Maintaining extra large binary files with Git and GCS

[Writing in progress]

Index

  • Overview
  • Limits of existing git servers
  • Gitea
  • Method of integration with GCS

Git is not good for version controlling large binary files. Especially if they are frequently making a huge historical changes.
The repository size increases rapidly. It exponentially increases the time to clone the repository to a new machine or instance.

There is a maximum file and repository limits in Git systems as well.

Besides, we don’t get a lot of benefits of Git for binary files. For example, we can’t merge the changes from different branches.
We cannot have a meaningful history of the changes.

In this article I will be discussing the size issues and their solutions only.

In this table, I am showing the comparison between some popular Git systems and their maximum allowed size.

x GitHub BitBucket GitLab
Single file limit normally: 100 MB
With LFS: 2GB
Repository size Recommended: <**1GB**
Strongly recommended: <5GB
normally: 2 GB
With LFS and paid upto 4GB

However, one solution could be deploying your own git server. It is very easy to deploy your own git server, without the above
limits using Gitea. You will still need to use Git LFS for extra large files and the issue of cloning and maintaining
is still there.

So, experiencing all above methods, I have found we can integrate GitHub with GCS and a bit of scripting to make it work seamlessly.

How it works?

In brief, binary files are ignored from the Git system. But while committing a change and tagged with bin-updated, it get uploaded in a GCS bucket.

Prerequisite knowledge:

  • Bash scripting
  • Git Hooks : Used to run script before and after a commit.
  • gcloud : Used to authenticate to gcloud.
  • gsutil : Upload file to GCS.

Setting up GCS

Scripting Git Hooks

On client side:

pre-push hook:
The script uploads the file to GCS and leaves a text file in that files location with extension .cloud containing the gcs path of the file.

On server side:

post-receive hook:
The script finds all files with .cloud extension and downloads from the gcs bucket in that folder.

Reference