Thumbnail image

Shrink Git Repositories With Repo Cleaner

Warning: This post is over 365 days old. The information may be out of date.

Within the landscape of all those wonderful CI/CD tools like Jenkins, GitLab, Azure DevOps, and many more, git became most likely a crucial building block for your DevOps strategy. When migrating to git from any other version control system you may -accidentally- migrate “as-is” without going through the learning curve beforehand. I understand, going through the steep learning curve of git, branching strategies, and merging strategies may not have your priority. When taking this approach, sooner or later you will bump into a nasty situation whereby your git repo is grown out of control. Let’s see how we can shrink git repositories with BFG repo-cleaner and git-sizer.

So, when is your git repo out of control?

There are a couple of red flags that should give you a heads-up about the health of your git repository. One of them is its overall size. When adopting CI/CD and you’ll notice that the git checkout step is taking a significant amount of time, your repository might be a little heavy. Having too many references within your git repo will increase the time during a git fetch. Ideally, you want to pull your remote changes fast. If it takes longer than a minute you may have too many references. You can cut the references by reducing git tags and long-term branches.

On the other side having too many objects will result in a slower traverse of your repo when cloning. The more objects the longer it takes to download and push it, regardless of the file sizes. Sometimes git is being (ab)used to store larger files like blobs, archives, artefacts, etc. Most of the time these files end up in your repo because you are used to doing it like that. Adding large files to your repository quickly adds up to the overall size of your repo. Consider Azure Artifact Storage or Artifactory for storing artefacts. If you still want to store large files in git please take a look at git-lfs. Of course, plenty other signs that tell you something about the health of your repository but for now, we’re sticking to its size, references and objects.

Identifying the root cause with git-sizer

Do you feel your repo needs some attention? or are you not sure yet? Luckily for both cases, we can leverage the wonderful tool called git-sizer. Shrink git repositories with git-sizer and BFG repo-cleaner gives us some insights about what’s going on within the git repository. In my use-case, I’m working with a 10GiB git repository. Let’s see what git-sizer shows us.

1$ du -sh
210.0G .
3$ git-sizer

Screenshot of git-sizer output

Git-sizer shows the state of your repository with a level of concern. In this example git-sizer has identified a 2.13GiB blob file called docker/data/sql/Master.mdf. Now we know the root cause of why this repo is so heavy.

Cleaning the repository with BFG repo-cleaner

Before you start to shrink git repositories with git-sizer and BFG, please follow the instruction of the BFG repo-cleaner website carefully! And make sure you have a bare clone of your repository. Yes, you could skip the git-sizer and clean the repo by stripping out files bigger than a certain size. However, personally I do recommend to identify any potential issues with git-sizer first. Once you have installed bfg on your local system you can run the following command

1java -jar bfg.jar --delete-files <fileToBeRemoved> <name of your repo>

This will return lots of information where the particular file is stored. In which branch, reference, commits and even when it was added to your repo. When ready you can rewrite the git history without the removed file(s) and push it back to the remote.

Note: make sure you are inside the bare repo folder before running the following command:

1git reflog expire --expire=now --all && git gc --prune=now --aggressive

Rewriting git history will take some time. Roughly between 5 and 15 min. Once it’s done you can push it back to the remote using git push.

In my use-case I was able to remove 8GiB by just removing the .mdf object from all references.

Goodluck cleaning up your repositories!