Git Internals

About

This PWA is about the internals of Git.

It is Open Sourced at github.com/HarshKapadia2/git_internals.

The content of this site is covered through talks as well. Feel free to watch the Git Internals talks.

Prerequisites

Before going through this PWA, it would be beneficial to know the basics of Git.

Watch the Git Basics talks.
Go through the git_basics PWA.
Misc resources

The `.git` Directory

Introduction

On executing the git init command in a directory, Git creates a hidden .git directory in that directory. The .git directory contains all the project history data on which Git can perform its version control functions. It also contains files to configure the way Git handles things for that particular repository.

The `.git` Directory Contents

.git
├───addp-hunk-edit.diff
├───COMMIT_EDITMSG
├───config
├───description
├───FETCH_HEAD
├───HEAD
├───hooks
│   └───<*.sample>
├───index
├───info
│   └───exclude
├───lfs
│   ├───cache
│   │   └───locks
│   │       └───refs
│   │           └───heads
│   │               └───<branch_names>
│   │                   └───verifiable
│   ├───objects
│   │   └───<first_2_SHA-256_characters>
│   │       └───<next_2_SHA-256_characters>
│   │           └───<entire_64_character_SHA-256_hash>
│   └───tmp
├───logs
│   ├───HEAD
│   └───refs
│       ├───heads
│       │   └───<branch_names>
│       ├───remotes
│       │   └───<remote_aliases>
│       │       └───<branch_names>
│       └───stash
├───MERGE_HEAD
├───MERGE_MODE
├───MERGE_MSG
├───objects
│   ├───<first_2_SHA-1_characters>
│   │   └───<remaining_38_SHA-1_characters>
│   ├───info
│   │   ├───commit-graph
│   │   └───packs
│   └───pack
│       ├───multi-pack-index
│       ├───<*.idx>
│       ├───<*.pack>
│       └───<*.rev>
├───ORIG_HEAD
├───packed-refs
├───rebase-merge
│   ├───git-rebase-todo
│   ├───git-rebase-todo.backup
│   ├───head-name
│   ├───interactive
│   ├───no-reschedule-failed-exec
│   ├───onto
│   └───orig-head
└───refs
    ├───heads
    │   └───<branch_names>
    ├───remotes
    │   └───<remote_aliases>
    │       └───<branch_names>
    ├───stash
    └───tags
        └───<tag_names>

The `index` File

This file contains the details of staged (added) files and is the staging area of the repository.

The words 'index', 'stage' and 'cache' are the same in Git and are used interchangeably.
Do not confuse this index file with the .idx Index files in .git/objects/pack. They are not related.

It is created when files are added for the first time and is updated every time the git add command is executed.

It is a binary file and just printing contents using cat .git/index will result in gibberish. Its contents can be accessed using the git ls-files --stage [plumbing command].

From the image above
- 100644 is the mode of the file. It is an octal number.
  Octal: 100644 Binary: 001000 000 110100100
  - The first six binary bits indicate the object type.
    
    001000 indicates a regular file. (As seen in this case.)
    
    001010 indicates a symlink (symbolic link).
    
    001110 indicates a gitlink.
  - The next three binary bits (000) are unused.
  - The last nine binary bits (110100100) indicate Unix file permissions.
    
    644 and 755 are valid for regular files.
    
    Symlinks and gitlinks have the value 0 in this field.
- The next 40 character hexadecimal string is the SHA-1 hash of the file.
- The next number is a stage number/slot, which is useful during merge conflict handling.
  - 0 indicates a normal un-conflicted file.
  - 1 indicates the base, i.e., the original version of the file.
  - 2 indicates the 'ours' version, i.e., the HEAD version with both changes.
  - 3 indicates the 'theirs' version, i.e., the file with the incoming changes.
- The last string is the name of the file being referred to.

Further reading on index files can be found in the Resources section.

The `HEAD` File

It is used to refer to the latest commit in the current branch.
Usually it does not contain a commit SHA-1, but contains the path to a file (of the name of the current branch) in the refs directory which stores the last commit’s SHA-1 hash in that branch.
It contains a commit’s SHA-1 hash when a specific commit or tag is checked out. (Detached HEAD state.)
More on the HEAD file.

Eg:

# in the 'main' branch
$ cat .git/HEAD
ref: refs/heads/main
$ git switch test_branch
Switched to branch 'test_branch'
$ cat .git/HEAD
ref: refs/heads/test_branch

The `refs` Directory

.git
├───...
└───refs
    ├───heads
    │   └───<branch_name(s)>
    ├───remotes
    │   └───<remote_alias(es)>
    │       └───<branch_name(s)>
    ├───stash
    └───tags
        └───<tag_name(s)>

This directory holds the reference to the latest commit in every local branch and fetched remote branch in the form of the SHA-1 hash of the commit.
It also stores the SHA-1 hash of the commit which has been [tagged].
The HEAD file references a file (of the name of the branch that is currently checked out) from the heads directory in this (refs) directory.

Do not confuse the .git/refs directory and the .git/logs/refs directory. They have different uses.

The `packed-refs` File

One file is created per branch and tag in the refs directory.
In a repository with a lot of branches and tags, there are a lot of references (refs) and most of them are not actively used/changed.
These refs occupy a lot of storage space and cause performance issues.
The git pack-refs command is used to solve this problem. It stores all the refs in a single file called packed-refs. The git gc command also does this.

If a ref is missing from the usual refs directory after packing, it is looked up in this file.
Subsequent updates (new commit or a pull or push with changes) to a branch whose ref is packed creates a new file with the name of the branch in the refs directory as usual, but does not update the hash of that branch in the packed-refs file with the latest one. (A new packed-refs file will have to be generated for that.)

The `logs` Directory

.git
├───...
└───logs
    ├───HEAD
    └───refs
        ├───heads
        │   └───<branch_name(s)>
        ├───remotes
        │   └───<remote_alias(es)>
        │       └───<branch_name(s)>
        └───stash

Contains the history of all commits in order.

Every row consists of the parent commit’s SHA-1 hash, the current commit’s SHA-1 hash, the committer’s name and e-mail, the Unix Epoch Time of the commit, the time zone, the type of action and message in order.
There are logs for every branch in the local Git repository and for the fetched branches from the remote Git repository/repositories (if any).
Inside the logs directory
- The HEAD file stores information about all the commands executed by the user, such as branch switches, commits, rebases, etc.
- The files in the refs directory only include branch specific operations and history, such as commits, pulls, resets, rebases, etc.

Do not confuse the .git/logs/refs directory and the .git/refs directory. They have different uses.

The `FETCH_HEAD` file

It contains the latest commits of the fetched remote branch(es).
It corresponds to the branch which was
- Checked out when last fetched.
  - From the image above, only one branch is displayed without the not-for-merge text. The odd one out (the main branch in this case) is the branch which was checked out while fetching.
- Explicitly mentioned using the git fetch <remote_repo_alias> <branch_name> command.

The `COMMIT_EDITMSG` File

The commit message is written in this file.
This file is opened in an editor on executing the git commit command.
It contains the output of the git status command commented out using the character.
If there has been a commit before, then this file will show the last commit message along with the git status output just before that commit.

The `objects` Directory

.git
├───...
└───objects
    ├───<first_2_SHA-1_characters>
    │   └───<remaining_38_SHA-1_characters>
    ├───info
    │   ├───commit-graph
    │   └───packs
    └───pack
        ├───multi-pack-index
        ├───<*.idx>
        └───<*.pack>

The most important directory in the .git directory.
It houses the data (SHA-1 hashes) of all the commit, tree and blob objects in the repository.
To decrease access time, objects are placed in buckets (directories), with the first two characters of their SHA-1 hash as the name of the bucket. The remaining 38 characters are used to name the object’s file.
More on the pack directory.

Do not confuse this directory (.git/objects/info) with the .git/info directory. They have different uses.

The `info` Directory

.git
├───...
└───info
    └───exclude

It contains the exclude file which behaves like the .gitignore file, but is used to ignore files locally without modifying .gitignore.
More on the exclude file.

Do not confuse this directory (.git/info) with the .git/objects/info directory. They have different uses.

The `config` File

This file contains the local Git repository configuration.
It can be modified using the git config --local command.

The `addp-hunk-edit.diff` File

Created when the e (edit) option is chosen in the git add --patch command.
Enables the manual edit of a hunk of a file to be staged.

The `ORIG_HEAD` File

It contains the SHA-1 hash of a commit.
It is the previous state of the HEAD, but not necessarily the immediate previous state.
It is set by certain commands which have destructive/dangerous behaviour, so it usually points to the latest commit with a destructive change.
It is less useful now because of the [git reflog command] which makes reverting/resetting to a particular commit easier.

The `description` File

This is the description of the repository.
This file is used by GitWeb, which hardly anyone uses today, so can be left alone.

Git Objects

Introduction

Git has two data structures, a mutable index that caches information about the working directory and the next revision to be committed, and an immutable, append-only object database (repository) containing four types of objects

Blob Object
Commit Object
Tree Object
Tag Object

Git carries out its version control using these objects to store data and the internal working of Git can be understood by understanding these objects.

Some part of the internal working of Git will be explored through an example in this section.

Feel free to follow along and create the graphs below using Git Graph.

Run git init to initialize an empty repository in the inside_git directory (root directory). A hidden directory .git is created in this root folder.

The command du -c is used to list the sub-directories of inside_git and their sizes on disk (in kbs).

The `blob` Object

A Blob Object stores the contents of a file.

Create a new file in the root folder.

Now the working tree (root directory) contains the .git directory and the new file master_file_1.txt.

Add the file to the staging area using git add . and run du -c once again.

Note that a new directory e6 has been added to .git/objects.

Use the dir (or ls) command to find out which file is present in the directory .git/objects/e6.

The file name 9de29bb2d1d6434b8b29ae775ad8c2e48c5391 is 38 characters long. On appending it to the folder name (e6), it becomes a 40 character string e69de29bb2d1d6434b8b29ae775ad8c2e48c5391. This is a SHA-1 hash. Git hashes the content of the file (and some more data) using the SHA-1 algorithm to produce a 40 character hexadecimal string. Every stage, [commit] and [tag] produces its own unique SHA-1 hash(es). (Being a 40 character string, hash collisions are VERY rare.) The first two characters of the hash are used for bucketing the hashes into folders, to decrease access time. To make things easy, Git sometimes uses just 4 to 8 characters of an object’s hash to refer to it.

As mentioned in the previous paragraph, Git hashes the contents of the file and other details to create a 40 character SHA-1 hash. To verify that, some content needs to be added to the file. The file will then have to be added again. (This will produce another hash.)

From the last command in the image above, it can be inferred that a new hash 1a3851c172420a2198cf8ca6f2b776589d955cc5 was generated. Check its contents using the cat command.

The output is gibberish because Git compresses file contents (and some additional data) with the zlib library and then stores it in the file. So to make sense of the gibberish, the content of the file needs to be de-compressed.

blob 16\0Git is amazing!\n is the content of the hashed file. (\0 and \n are not seen. Explained in the points below.)

Breaking it down

blob is the object type of the file. It is an abbreviation for 'Binary Large OBject'. These objects (files) store the content of the files.
16 is the file size (length). Git is amazing! consists of 15 characters, but the echo command adds a new line (line feed) character (\n) at the end of the text, making the length 16.
Just like the \n character which cannot be seen in the output, there is a NULL character (\0) between the length and file content.
Git is amazing!\n is the file content. (The \n is not visible.)

If blob 16\0Git is amazing!\n is hashed using SHA-1, the same hash (1a3851c172420a2198cf8ca6f2b776589d955cc5) will be generated!

So, Git generates the hash of the file using the string <object_type> <content_length>\0<file_content> and stores that string in the file after compressing it. (The name of the file is the last 38 characters of the 40 character hash that was generated. The first two characters are used for bucketing.)

Blob Objects do not store the diff/delta of files. They store the entire contents of files.

The process of finding the contents of the file using cat is pretty cumbersome. It is a better idea to use the git cat-file [plumbing command] provided by Git.

Variations of the git cat-file command that will be used

git cat-file -p <hash> (-p = pretty print) to display file data.
git cat-file -t <hash> (-t = type) to display file type (blob, commit, tree or tag).
git cat-file -s <hash> (-s = size) to display the file size (length).

The `commit` Object

A commit object links tree objects together into a history. It contains the name of a tree object (of the top-level source directory), a timestamp, a log message, and the names of zero or more parent commit objects.

Commit master_file_1.txt and then run du -c again.

From the above image it can be noticed that two new directories .git/objects/1b and .git/objects/d5 were created. Also, after committing the file, Git printed the first seven characters of the SHA-1 hash for that commit in the output.

Using the seven characters of the commit hash in the output, check the file type using the git cat-file -t command.

So the file type is commit, inferring that it is a file generated through a commit.

Print the contents of the commit object (file) using the git cat-file -p command.

Commit object content

tree 1b2190cdc2801ec3df6505dc351dee878ac7f2fc is the other SHA-1 hash that was generated (remember that two directories were generated in .git/objects on committing the file), of the type tree. The tree is the [snapshot] of the current state of the repository.
Parent commit’s SHA-1 hash (Not present here. Explained below.)
The next line has the details of the author (the one who wrote the code):
- Name
- e-mail ID
- Timestamp
The next line has the details of the committer (the one who committed the code):
- Name
- e-mail ID
- Timestamp
Commit message
Commit description (If provided. Not present here.)

The `tree` Object

A tree object is the equivalent of a (sub)directory: it contains a list of filenames, each with some type bits and the name of a blob or tree object that is that file, symbolic link, or directory’s contents. This object describes a snapshot of the source tree.

Check the contents of the tree file listed in the commit object (file).

The tree file has entries of the files & directories in the snapshot (current state) of the local repository. The format of each line is the same.

Tree object content format

100644 is the mode of the file. It is an octal number.
```
Octal: 100644
Binary: 001000 000 110100100
```
- The first six binary bits indicate the object type.
  - 001000 indicates a regular file. (As seen in this case.)
  - 001010 indicates a symlink (symbolic link).
  - 001110 indicates a gitlink.
- The next three binary bits (000) are unused.
- The last nine binary bits (110100100) indicate Unix file permissions.
  - 644 and 755 are valid for regular files.
  - Symlinks and gitlinks have the value 0 in this field.
blob is the object type. (It can be a tree object as well. Explained below.)

1a3851c172420a2198cf8ca6f2b776589d955cc5 is the SHA-1 hash of the file.
Name of the file.

So, each commit object points to a tree object and each tree object points to a set of blobs and/or trees, which correspond respectively to files and subdirectories.

The connections between the commit, tree and blob files till now. (HEAD is just a pointer to the latest commit.)

The blob e69de has been modified to blob 1a385 and so is not connected to the tree 1b219. Only the latest blob of every added file is connected to the new tree object when a commit is made.
This graph can be created using Git Graph.

Parent Commits

Create another file (master_file_2.txt), add it and commit it.

Check the contents of the commit file (using part of the hash 8282663 as seen in the above image).

A new line parent d5b8f77ce1dc1a37b29885026055c8656c3e0b65 is seen. Remember, this is the hash of the previous commit. Git is thus creating a graph. A Directed Acyclic Graph to be precise. (Check image below.)

Also, the HEAD will now automatically point to this (latest - 82826) commit rather than the parent (previous - d5b8f) commit as it was doing before. To verify, check where the HEAD is pointing.

It is pointing to the latest commit (82826).

Now check the contents of the tree object of the latest commit.

From the commit object, tree object and HEAD position, the connection graph looks as follows

This graph can be created using Git Graph.

Creating Directories

Create a new file (master_dir_1_file_3.txt) inside a directory (dir_1), add it, commit it and look at the contents of the commit file.

The commit file has the same format as before.

Check the contents of the tree file (with the hash f6a65 as seen in the above image).

It is surprising to note that the tree f6a65 points to another tree abecf! The name of the new tree is dir_1.

Check the contents of the dir_1 tree.

So it points to the file (master_dir_1_file_3.txt) inside the directory dir_1.

Have a look at how the tree f6a65 connected itself to the tree and blobs.

The graph of the repository as it stands now

This graph can be created using Git Graph.

Renaming Files

Rename master_file_1.txt to the_master_file.txt to see how Git handles it internally.

When the file is committed, Git is smart enough to recognize that a file was renamed and is not a new file, as can be seen in the last line of the above image. It can recognize this because the SHA-1 hash of the file has not changed (as the content of the file has not changed).

Check the contents of the commit and tree files.

From the last line, the hash 1a385 is same as the hash of the original file name (master_file_1.txt) and just the name of the file has been changed in the tree object instead of creating a new blob file. This is efficient space management by Git!

The structure of the repo.

This graph can be created using Git Graph.

Modifying Large Files

Add and commit a image to Git. The size of the image is 1.374 Mb (or 1374 kb), so it is a relatively huge file as compared to the other files (~ 1 kb/file).

Make a small change to the image file contents and then add and commit it again.

The SHA-1 hashes of master_image_1.png in the latest (6d2d2) and previous (27666) tree are different, so Git has created two different blobs (ca893 and 1f7af) for the same file, even when they only have a very small difference.

Run du -c now.

From the image above, there are two directories (.git/objects/1f and .git/objects/ca) with the same size (1376 kb).

The directory content size (1376 kb) is greater than the image size (1374 kb) as Git adds the file type and size (length) to the blob file and then hashes it.

So is Git inefficient at handling huge files? No. The content of the file has changed and this produces a different SHA-1 hash (1f7af) than the original SHA-1 hash (ca893), so Git is not able to handle the change like it did when a file was simply renamed. Having multiple copies of such a huge file is not a problem in the local repository, but it will take up a lot of bandwidth while pushing and pulling from a platform like GitHub. To avoid this, Git uses Delta Compression. It stores the difference (diff) of the older file from the new one and indicates the new one as the parent. This is looked into in the sub-section below.

The `pack` Directory

.git
├───...
└───objects
    ├───...
    └───pack
        ├───multi-pack-index
        ├───<*.idx>
        └───<*.pack>

Delta compression is carried out every time a clone, push or pull is executed, or if Garbage Collection (git gc) is run.

Delta compression creates two types of files in .git/objects/pack

Pack (.pack) file(s)
Index (.idx) file(s)

Multiple Packfiles can exist in one repo.
There is one Index file per Packfile.
The multi-pack-index (MIDX) file might also be created in extreme cases, but is not considered here.

The current state of the repo

Note that the size of .git/objects/pack in the above image is 0 kb.

Garbage Collection (git gc) will be used to carry out Delta Compression and then du -c will be used to view the changes.

Notice in the above image that the size of .git/objects/pack is now 1380 kb (from 0 kb) and a lot of the files in .git/objects have disappeared, except for .git/objects/e6.

The total size of the .git directory went down from 4220 kb (as seen in the first du -c image in this sub-section) to 2838 kb (as seen in the above image). This is a 32.75% reduction in the size of the local repository!

The contents of .git/objects/pack

As mentioned above, two types of files (a pack .pack file and an index .idx file) are created in .git/objects/pack.

Check the contents of the Packfile using the plumbing command git verify-pack -v path/to/pack/file/<file_name>.pack (-v = verbose).

From the above image, it can be understood that the Packfile contains all the Git objects. The Pack file is a file that contains all the Git Objects (along with their content) stored in it in a particular format. All the objects stored in the Packfile are removed from the .git/objects directory.

From the above image, it can also be understood that the size of the newly modified image (hash 1f7af) is very large in comparison to the original image (hash ca893). The blob of the original image (hash ca893) also has the hash of the modified image (1f7af) mentioned after it, indicating that its parent is the newly modified image file (hash 1f7af). Thus Git stored the entire new file and only a diff/delta for the older file with a pointer to the newer file, rather than storing the entire file again, making it space efficient.

The newer file (hash 1f7af) will usually be accessed more than the older one (hash ca893), so storing the entirety of the newer file and a delta/diff for the older one makes more sense than storing the entirety of the old file and a delta/diff for the new one. As the newer file will usually be accessed more, it would be inefficient to apply the delta/diff of the newer file to the entirety of the older file to generate the newer file every time. It is cheaper to apply the delta/diff of the older file to the entirety of the newer file, as the older file won’t be accessed as frequently.

The pack file has a graph in it just like the Directed Acyclic Graph that the Commit, Tree and Blob objects form.
The Index (.idx) file contains offsets into its corresponding Pack (.pack) file so that a specific object can be found quickly.
The Index (.idx) file has its own file structure.
Do not confuse these .idx Index files with the staging area index file. They are not related.

On running aggressive Garbage Collection (git gc --aggressive), Git got rid of all the files in .git/objects that were referenced in a tree and added them to the Pack file. The .git/objects/e6 directory did not get removed as it was not referenced (listed) in any Tree Object.

As mentioned at the start of this sub-section, these Packfiles and Index files are created every time a clone, push or pull is executed, or if Garbage Collection (git gc) is run. Why is this so? Network bandwidth and clone/push/pull command execution time are the main reasons. Applying Delta compression and putting in all objects into one file makes it simpler and faster to transfer data over the Network and also saves storage space (~32% space was saved through packing in this case).

Take a look at the log of the repository.

Further reading on Packfiles can be found in the Resources section.

Empty Commits

The --allow-empty option in git commit allows creating commits without changes in any files.

As empty commits have no changes to any files, they always point to the latest Tree Object in the branch.

To illustrate this, set up a repo using the following commands

$ git init
$ touch file_1.txt
$ git add .
$ git commit -m "Add file_1.txt"
$ git commit --allow-empty -m "Empty commit #1"
$ git commit --allow-empty -m "Empty commit #2"

# Now run
$ git log --oneline --graph
* 208cead (HEAD -> main) Empty commit #2
* 64cf914 Empty commit #1
* be0c1ec Add file_1.txt

Use the git cat-file -p <hash> command as done in previous sub-sections to create the graph.

The graph of the above repository

If the empty commit is the first commit in the repository (initial commit), then it will have its own empty Tree Object associated with it. In all other cases an empty commit will point to the latest Tree Object in the branch.
This graph can be created using Git Graph.

Resources

Talks

Tools

Git Graph: Visualize Commit, Tree and Blob objects (by Harsh Kapadia)

Git Internals

About

Prerequisites

The `.git` Directory

Introduction

The `.git` Directory Contents

The `index` File

The `HEAD` File

The `refs` Directory

The `packed-refs` File

The `logs` Directory

The `FETCH_HEAD` file

The `COMMIT_EDITMSG` File

The `objects` Directory

The `info` Directory

The `config` File

The `addp-hunk-edit.diff` File

The `ORIG_HEAD` File

The `description` File

Git Objects

Introduction

The `blob` Object

The `commit` Object

The `tree` Object

Parent Commits

Creating Directories

Renaming Files

Modifying Large Files

The `pack` Directory

Empty Commits

Resources

Talks

Tools

Articles

Misc

The Staging `index` File

Packfiles

Git Internals

About

Prerequisites

The .git Directory

Introduction

The .git Directory Contents

The index File

The HEAD File

The refs Directory

The packed-refs File

The logs Directory

The FETCH_HEAD file

The COMMIT_EDITMSG File

The objects Directory

The info Directory

The config File

The addp-hunk-edit.diff File

The ORIG_HEAD File

The description File

Git Objects

Introduction

The blob Object

The commit Object

The tree Object

Parent Commits

Creating Directories

Renaming Files

Modifying Large Files

The pack Directory

Empty Commits

Resources

Talks

Tools

Articles

Misc

The Staging index File

Packfiles

The `.git` Directory

The `.git` Directory Contents

The `index` File

The `HEAD` File

The `refs` Directory

The `packed-refs` File

The `logs` Directory

The `FETCH_HEAD` file

The `COMMIT_EDITMSG` File

The `objects` Directory

The `info` Directory

The `config` File

The `addp-hunk-edit.diff` File

The `ORIG_HEAD` File

The `description` File

The `blob` Object

The `commit` Object

The `tree` Object

The `pack` Directory

The Staging `index` File