[PATCH] Add support for git option: `pack.packSizeLimit`

Cathy J. Fitzpatrick cathy at cathyjf.com
Sat Dec 16 12:20:53 GMT 2023


The standard `git-config(1)` setting of `pack.packSizeLimit` is used to
specify the maximum size of a pack file when repacking a repository.
This setting is important when working with a remote git hosting
provider that imposes a maximum file size on files stored on the remote
server. For example, GitHub currently imposes a maximum size of 100 MiB
per file stored on its servers.

Until now, gpg-remote-gcrypt has ignored the `pack.packSizeLimit`
setting when repacking the repository. This setting has been ignored
because gpg-remote-gcrypt has supplied the `--stdout` flag to
`git-pack-objects(1)`, and that flag implicitly causes
`git-pack-objects(1)` ignore the value of `pack.packSizeLimit`.

This patch modifies gpg-remote-gcrypt so that it will respect the value
of `pack.packSizeLimit`. This is achieved by modifying the invocation of
`git-pack-objects(1)` so that the `--stdout` argument is not supplied.
Instead, the pack files are written to the same temporary directory that
gpg-remote-gcrypt already uses for other purposes.

The code that invokes `git-pack-objects(1)` is also modified to handle
the possibility that more than one pack file might be produced (if the
size of the pack would exceed the value of `pack.packSizeLimit`).
Previously, gpg-remote-gcrypt was able to assume that
`git-pack-objects(1)` would always produce exactly one pack file, but
with this patch, that is no longer the case if the user has specified
`pack.packSizeLimit`. To address this, it was necessary to introduce a
loop in two places, to iterate over each of the generated pack files,
instead of assuming that there would always be exactly one pack file.

This patch does not change the git-remote-gcrypt protocol in any way.
Repositories created with the new version of git-remote-gcrypt can still
be read with older versions of git-remote-gcrypt. And, of course,
repositories created with older versions of git-remote-gcrypt can be
read with the new version. The change is fully backward- and
forward-compatible. Indeed, this is true of the `pack.packSizeLimit`
setting in general. As the manual for `gpg-config(1)` observes,
"the git:// protocol is unaffected" by the value of
`pack.packSizeLimit`.

Although storing repositories encrypted by git-remote-gcrypt on the
servers of Git hosting services such as GitHub has a variety of
drawbacks, it is a supported use case, and it can make sense for
certain kinds of repositories. This patch makes it easier to work with
these backends by handling any maximum file size restrictions
imposed by the services, and, for maximum simplicity, the interface
for this patch relies solely on a standard `git-config(1)` setting,
namely, the `pack.packSizeLimit` setting.

This patch also modifies the README to document the behavior of the
`pack.packSizeLimit` setting as it affects git-remote-gcrypt. This patch
also amends the section of the README relating to the
*GCRYPT_FULL_REPACK* environment variable to clarify that, in order
to force a repack of the repository, the variable must be set to a value
_other than the empty string_.

Finally, this patch also includes a new test that that, when run,
verifies the basic functionality of git-remote-gcrypt.

Signed-off-by: Cathy J. Fitzpatrick <cathy at cathyjf.com>
---
 README.rst           |  17 +++-
 git-remote-gcrypt    |  46 ++++++---
 tests/system-test.sh | 223 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 270 insertions(+), 16 deletions(-)
 create mode 100755 tests/system-test.sh

diff --git a/README.rst b/README.rst
index a7c41a2..6bc91d7 100644
--- a/README.rst
+++ b/README.rst
@@ -105,11 +105,26 @@ The following ``git-config(1)`` variables are supported:
     If this flag is set to ``true``, git-remote-gcrypt will refuse to push,
     unless ``--force`` is passed, or refspecs are prefixed with ``+``.
 
+``pack.packSizeLimit``
+    This is a standard git configuration variable.
+
+    In the context of git-remote-crypt, this variable, if set, specifies the
+    maximum size of the packfiles to be uploaded to the backend. As in
+    standard git, this value should be an integer, optionally suffixed with
+    "k", "m", or "g". If a packfile exceeds the maximum size, it will be
+    split into several files before being uploaded. This splitting is
+    transparent to the user and does not affect use of the repository.
+
+    This variable is useful when working with a backend that imposes a maximum
+    file size, such as GitHub, which currently imposes a maximum file size of
+    100m.
+
 Environment variables
 =====================
 
 *GCRYPT_FULL_REPACK*
-    When set (to anything), this environment variable forces a full repack when pushing.
+    When set (to anything other than the empty string), this environment
+    variable forces a full repack when pushing.
 
 Examples
 ========
diff --git a/git-remote-gcrypt b/git-remote-gcrypt
index 7e7240f..97684aa 100755
--- a/git-remote-gcrypt
+++ b/git-remote-gcrypt
@@ -739,7 +739,8 @@ do_push()
 	# The manifest is encrypted.
 	local r_revlist= pack_id= key_= obj_= src_= dst_= \
 		r_pack_delete= tmp_encrypted= tmp_objlist= tmp_manifest= \
-		force_passed=
+		force_passed= tmp_pack_prefix= r_new_pack_list= \
+		new_pack_object_ids= object_id=
 
 	ensure_connected
 
@@ -787,6 +788,7 @@ EOF
 		fi
 	fi
 
+	tmp_pack_prefix="$Tempdir/pack_raw"
 	tmp_encrypted="$Tempdir/packP"
 	tmp_objlist="$Tempdir/objlP"
 
@@ -798,17 +800,28 @@ EOF
 	# Only send pack if we have any objects to send
 	if [ -s "$tmp_objlist" ]
 	then
-		key_=$(genkey "$Packkey_bytes")
-		pack_id=$(export GIT_ALTERNATE_OBJECT_DIRECTORIES=$Tempdir;
-			pipefail git pack-objects --stdout < "$tmp_objlist" |
-			pipefail ENCRYPT "$key_" |
-			tee "$tmp_encrypted" | gpg_hash "$Hashtype")
-
-		append_to @Packlist "pack :${Hashtype}:$pack_id $key_"
-		if isnonnull "$r_pack_delete"
-		then
-			append_to @Keeplist "keep :${Hashtype}:$pack_id 1"
-		fi
+		# This will return more than one object_id if the user's git
+		# configuration includes `pack.packSizeLimit` and the size of the
+		# packfile is greater than the specified size limit. Hence, we need
+		# to iterate through the returned objects.
+		new_pack_object_ids=$(GIT_ALTERNATE_OBJECT_DIRECTORIES=$Tempdir \
+			git pack-objects "$tmp_pack_prefix" < "$tmp_objlist")
+		while IFS= read -r object_id
+		do
+			key_=$(genkey "$Packkey_bytes")
+			pack_id=$(pipefail ENCRYPT "$key_" < "$tmp_pack_prefix-$object_id.pack" | \
+				tee "$tmp_encrypted-$object_id" | gpg_hash "$Hashtype")
+			rm -f -- "$tmp_pack_prefix-$object_id.pack"
+
+			append_to @r_new_pack_list "$pack_id:$object_id"
+			append_to @Packlist "pack :${Hashtype}:$pack_id $key_"
+			if isnonnull "$r_pack_delete"
+			then
+				append_to @Keeplist "keep :${Hashtype}:$pack_id 1"
+			fi
+		done <<EOF
+$new_pack_object_ids
+EOF
 	fi
 
 	# Generate manifest
@@ -824,16 +837,19 @@ repo $Repoid
 $Extnlist
 EOF
 
-	# Upload pack
+	# Upload pack (or packs, if applicable)
 	if [ -s "$tmp_objlist" ]
 	then
-		PUT "$URL" "$pack_id" "$tmp_encrypted"
+		xecho "$r_new_pack_list" | while IFS=':' read -r pack_id object_id
+		do
+			PUT "$URL" "$pack_id" "$tmp_encrypted-$object_id"
+			rm -f -- "$tmp_encrypted-$object_id"
+		done
 	fi
 
 	# Upload manifest
 	PUT "$URL" "$Manifestfile" "$tmp_manifest"
 
-	rm -f "$tmp_encrypted"
 	rm -f "$tmp_objlist"
 	rm -f "$tmp_manifest"
 
diff --git a/tests/system-test.sh b/tests/system-test.sh
new file mode 100755
index 0000000..7e144d2
--- /dev/null
+++ b/tests/system-test.sh
@@ -0,0 +1,223 @@
+#!/usr/bin/env bash
+# SPDX-FileCopyrightText: Copyright 2023 Cathy J. Fitzpatrick <cathy at cathyjf.com>
+# SPDX-License-Identifier: GPL-2.0-or-later
+set -efuC -o pipefail
+shopt -s inherit_errexit
+
+# Unlike the main git-remote-gcrypt program, this testing script requires bash
+# (rather than POSIX sh) and also depends on various common system utilities
+# that the git-remote-gcrypt carefully avoids using (such as mktemp(1)).
+#
+# The test proceeds by setting up a new repository, making some large commits
+# with random data into the repository, pushing the repository to another
+# remote using git-remote-gcrypt over the gitception protocol, and then cloning
+# the second repository and ensuring that the data it contains is correct.
+#
+# The random data is obtained from /dev/urandom. This script won't work
+# on systems that don't provide /dev/urandom.
+#
+# The following settings specify the parameters to be used for the test.
+num_commits=5
+files_per_commit=3
+random_source="/dev/urandom"
+random_data_per_file=5242880 # 5 MiB
+default_branch="main"
+test_user_name="git-remote-gcrypt"
+test_user_email="git-remote-gcrypt at example.com"
+pack_size_limit="12m" # If this variable is unset, there is no size limit.
+
+readonly num_commits files_per_commit random_source random_data_per_file \
+    default_branch test_user_name test_user_email pack_size_limit
+
+# Pipe text into this function to indent it with four spaces. This is used
+# to make the output of this script prettier.
+indent() {
+    sed 's/^\(.*\)$/    \1/'
+}
+
+section_break() {
+    echo
+    printf '*%.0s' {1..70}
+    echo $'\n'
+}
+
+assert() {
+    (set +e; [[ -n ${show_command:-} ]] && set -x; "${@}")
+    local -r status=${?}
+    { [[ ${status} -eq 0 ]] && echo "Verification suceeded."; } || \
+        echo "Verification failed."
+    return "${status}"
+}
+
+fastfail() {
+    "$@" || kill -- "-$$"
+}
+
+umask 077
+tempdir=$(mktemp -d)
+readonly tempdir
+# shellcheck disable=SC2064
+trap "rm -Rf -- '${tempdir}'" EXIT
+
+# Set up the PATH to favor the version of git-remote-gcrypt from the repository
+# rather than a version that might already be installed on the user's system.
+PATH=$(git rev-parse --show-toplevel):${PATH}
+readonly PATH
+export PATH
+
+# Unset any GIT_ environment variables to prevent them from affecting the test.
+git_env=$(env | sed -n 's/^\(GIT_[^=]*\)=.*$/\1/p')
+# shellcheck disable=SC2086
+IFS=$'\n' unset ${git_env}
+
+# Ensure a predictable gpg configuration.
+export GNUPGHOME="${tempdir}/gpg"
+mkdir "${GNUPGHOME}"
+# Use a wrapper for gpg(1) to avoid cluttering the test output with unnecessary
+# warnings about the obsolete `--secret-keyring` option. These warnings are
+# caused by git-remote-gcrypt passing an option to gpg(1) that only makes sense
+# for ancient versions of gpg(1), but addressing that (if it should be
+# addressed at all) is a task best left for another day.
+cat << 'EOF' > "${GNUPGHOME}/gpg"
+#!/usr/bin/env bash
+set -efuC -o pipefail; shopt -s inherit_errexit
+args=( "${@}" )
+for ((i = 0; i < ${#}; ++i)); do
+    if [[ ${args[${i}]} = "--secret-keyring" ]]; then
+        unset "args[${i}]" "args[$(( i + 1 ))]"
+        break
+    fi
+done
+exec gpg "${args[@]}"
+EOF
+chmod +x "${GNUPGHOME}/gpg"
+
+# Ensure a predictable git configuration.
+export GIT_CONFIG_SYSTEM=/dev/null
+export GIT_CONFIG_GLOBAL="${tempdir}/gitconfig"
+mkdir "${tempdir}/template" # Intentionally empty template directory.
+git config --global init.defaultBranch "${default_branch}"
+git config --global user.name "${test_user_name}"
+git config --global user.email "${test_user_email}"
+git config --global init.templateDir "${tempdir}/template"
+git config --global gpg.program "${GNUPGHOME}/gpg"
+[[ -n ${pack_size_limit:-} ]] && \
+    git config --global pack.packSizeLimit "${pack_size_limit}"
+
+# Prepare the random data that we'll be writing to the repository.
+total_files=$(( num_commits * files_per_commit ))
+random_data_size=$(( total_files * random_data_per_file ))
+random_data_file="${tempdir}/data"
+head -c "${random_data_size}" "${random_source}" > "${random_data_file}"
+
+# Create gpg key and subkey.
+echo "Step 1: Creating a new GPG key and subkey to use for testing:"
+(
+    set -x
+    gpg --batch --passphrase "" --quick-generate-key \
+        "${test_user_name} <${test_user_email}>"
+    gpg -K
+) 2>&1 | indent
+
+###
+section_break
+
+echo "Step 2: Creating new repository with random data:"
+{
+    git init -- "${tempdir}/first"
+    cd "${tempdir}/first"
+    for ((i = 0; i < num_commits; ++i)); do
+        for ((j = 0; j < files_per_commit; ++j)); do
+            file_index=$(( i * files_per_commit + j ))
+            random_data_index=$(( file_index * random_data_per_file ))
+            # shellcheck disable=SC2016
+            echo "Writing random file $((file_index + 1))/${total_files}:" \
+                '${tempdir}'/"first/$(( file_index )).data "
+            head -c "${random_data_per_file}" > "$(( file_index )).data" < \
+                <(tail -c "+${random_data_index}" "${random_data_file}" || :)
+            if command -v base64 > /dev/null; then
+                # shellcheck disable=SC2312
+                echo "First 24 bytes in base64:" \
+                    "$(fastfail head -c 24 "$(( file_index )).data" | \
+                        fastfail base64)" | indent
+            fi
+        done
+        git add -- "${tempdir}/first"
+        git commit -m "Commit #${i}"
+    done
+
+    echo
+    echo "For reference, here is the commit log for the repository:"
+    git log --format=oneline | indent
+} | indent
+
+###
+section_break
+
+echo "Step 3: Creating an empty bare repository to receive pushed data:"
+git init --bare -- "${tempdir}/second.git" | indent
+
+
+###
+section_break
+
+echo "Step 4: Pushing the first repository to the second one using gitception:"
+{
+    # Note that when pushing to a bare local repository, git-remote-gcrypt uses
+    # gitception, rather than treating the remote as a local repository.
+    (
+        set -x
+        cd "${tempdir}/first"
+        git push -f "gcrypt::${tempdir}/second.git#${default_branch}" \
+            "${default_branch}"
+    ) 2>&1
+
+    if command -v tree > /dev/null; then
+        echo
+        echo "For reference, here is the directory tree of second.git:"
+        tree "${tempdir}/second.git"
+    fi
+
+    echo
+    echo "Here is the size of each object file in second.git:"
+    (
+        cd "${tempdir}/second.git/objects"
+        find . -type f -exec du -sh {} +
+    ) | indent
+
+    echo
+    echo "Note that git-pack-objects(1) will try to ensure that each object is"
+    echo "smaller than pack.packSizeLimit (${pack_size_limit:-unlimited}" \
+        "here) but this isn't always"
+    echo "possible because each object contains at least one of our random"
+    echo "files, and each random file has a certain minimum size. As a result,"
+    echo "pack.packSizeLimit is more of a suggestion than a hard limit."
+ } | indent
+
+###
+section_break
+
+echo "Step 5: Cloning the second repository using gitception:"
+{
+    (
+        set -x
+        git clone -b "${default_branch}" \
+            "gcrypt::${tempdir}/second.git#${default_branch}" -- \
+                "${tempdir}/third"
+    ) 2>&1
+
+    echo
+    echo "Verifying that the first and third repositories have the same"
+    echo "commit log as each other:"
+    # shellcheck disable=SC2312
+    assert diff \
+        <(fastfail cd "${tempdir}/first"; fastfail git log --oneline) \
+        <(fastfail cd "${tempdir}/third"; fastfail git log --oneline) \
+            2>&1 | indent
+
+    echo
+    echo "Verifying that the first and third repositories have the same"
+    echo "files in their respective working directories:"
+    show_command=1 assert diff -r --exclude ".git" -- \
+        "${tempdir}/first" "${tempdir}/third" 2>&1 | indent
+} | indent
\ No newline at end of file
-- 
2.43.0




More information about the sgo-software-discuss mailing list