6  Version Control with Git for Reproducible Research

6.1 What is Git and Why Bother?

Git is a version control system. Think of it as a “save” button for your entire project folder, not just a single file. It tracks every change you make to your code, allowing you to revert to any previous version at any time. For research, this is invaluable.

Why use Git?

  • Reproducibility: You have a complete history of your analysis. A referee asks what you did six months ago? Git knows.
git log
commit 5f2652de4b9fb9f72e41996e78016bed6438a04e (HEAD -> master)
Author: Eric Roca <eric.roca@gmail.com>
Date:   Fri Aug 29 19:36:30 2025 +0200

    Commented

commit beccfd7db887e3ec894ae85088c5abcb6037e10d
Author: Eric Roca <eric.roca@gmail.com>
Date:   Fri Aug 19 9:19:44 2025 +0200

    Initial analysis
 git diff beccfd7 5f2652d analysis.do
diff --git a/analysis.do b/analysis.do
index e2d2607..93f26b3 100644
--- a/analysis.do
+++ b/analysis.do
@@ -1,5 +1,12 @@
+/*
+    Eric Roca Fernandez
+*/
+
+// Open the database
 sysuse auto

+// Summarize the turn data
 sum turn

+// Initial regression
 reg weight mpg
  • Collaboration: It makes working with co-authors seamless. No more emailing analysis_v3_final_Johns_edits.do.
  • Experimentation: You can safely test new ideas (e.g., a different model specification) without breaking your main analysis.
  • Backup: When used with a remote hosting service, it’s a backup of your entire codebase.

6.1.1 Git vs. GitHub (and others)

This is a common point of confusion.

  • Git is the software on your computer that tracks changes.
  • GitHub, GitLab, Codeberg, and Forgejo are websites that host your Git projects. They are the “cloud” for your code, allowing you to store it remotely and share it with others.

You use Git locally on your machine, and then push your changes to a service like GitHub to collaborate or for backup.

6.2 The Basic Workflow

6.2.1 1. Initializing a Repository

In your main project folder, run this command once:

git init

This creates a hidden .git folder where Git will store the entire history of your project.

6.2.2 2. The .gitignore File: Tell Git What to Ignore

This is the most important step for a research project. You do not want to track data files, temporary files, or large output files (like PDFs or LaTeX tables) with Git. Git is for code.

Create a file named .gitignore in your project’s root directory and add the names of files and folders to ignore.

# .gitignore

# Data - NEVER track data with Git
data/
/data/original/
/data/temporary/
/data/final/

# Output files
paper/
tables/
figures/
output/

# Stata specific
*.dta.bak
*.gph

# Python specific
__pycache__/
*.pyc
.ipynb_checkpoints

# Quarto / R specific
_freeze/
_quarto/
*.html
*.pdf

# OS-specific files
.DS_Store
Thumbs.db

6.2.3 3. Saving Your Work: Add and Commit

Saving changes is a two-step process. You should only do these once you have made significant changes to your code, not every minor edit or every time you save. Think about Git as a history book: you would not write about every single detail, just the important events.

  1. Stage changes with git add: Once you are ready to track changes, tell Git which files you want to include in your next snapshot.

    # Stage a specific file
    git add code/analysis.do
    
    # Stage all changes in the current directory (respects .gitignore)
    git add .
  2. Commit changes with git commit: Save the staged files as a snapshot in the project’s history. You must include a message describing the change.

    git commit -m "Add initial regression for model 1"

    Tip: Write clear, concise commit messages. They are your research logbook. “Fix bug” is bad. “Fix bug where sample was not correctly filtered for women” is good.

6.3 Collaboration: Push, Pull, and Remotes

To collaborate, you first need to host your repository on a service like GitHub. After creating a repository on the website, it will give you a URL. You link your local repository to it like this:

# Add a remote named "origin" (a standard convention)
git remote add origin <URL_from_GitHub>
  • git push: Upload your committed changes to the remote repository.

    git push origin main
  • git pull: Download changes from the remote repository. Always do this before you start working.

    git pull origin main

6.3.1 A Typical Co-author Workflow

  1. You (start of day): git pull to get your co-author’s latest changes.
  2. You: Work on your code, cleaning data or running a new analysis.
  3. You: Save your work: git add . then git commit -m "Add robustness check with alternative fixed effects".
  4. You (end of day): git push to upload your changes for your co-author to see.
  5. Co-author (next day): Repeats the cycle, starting with git pull.

6.3.2 Merge Conflicts: When You Edit the Same Thing

A merge conflict happens when you and a co-author change the same lines in the same file. Git doesn’t know which version to keep, so it asks you to decide.

Imagine you both edit line 5 of analysis.do. When you git pull, Git will stop and mark the file with a conflict:

// ... some stata code ...

<<<<<<< HEAD
reg gdp growth controls, robust
=======
reg gdp growth controls, cluster(region)
>>>>>>> fa345b2... Add clustered standard errors

// ... more stata code ...
  • <<<<<<< HEAD: This is your version.
  • =======: Separates the two versions.
  • >>>>>>> fa345b2...: This is the incoming version from your co-author.

To resolve it:

  1. Open the file in your editor.

  2. Delete the lines you don’t want to keep.

  3. Delete the Git markers (<<<<<<<, =======, >>>>>>>).

  4. The final code should be exactly what you want it to be. For example:

    // ... some stata code ...
    
    reg gdp growth controls, cluster(region)
    
    // ... more stata code ...
  5. Save the file, then git add and git commit to finalize the merge.

6.4 Branches: A Safe Place to Experiment

By default, you work on a branch called main (or master). A branch is like a parallel timeline for your project. They are incredibly useful for research.

gitGraph:
    commit id: "Initial version"
    branch revision-1
    checkout revision-1
    commit id: "Add new database"
    commit id: "Regressions with new data"
    branch revision-1-use_probit
    checkout revision-1-use_probit
    commit id: "Use probit model"
    checkout main
    merge revision-1

Why use branches?

  • Experiment safely: Want to try a completely new statistical method? Create a branch. If it doesn’t work out, you can just delete the branch, and your main code is untouched.

  • Respond to referee reports: This is a killer feature. When you get a “revise and resubmit,” create a branch for it.

    # Create a new branch for the first revision and switch to it
    git checkout -b revision-1

    Now you can make all the changes the referees requested. Your main branch still represents the original submitted version. When you’re done, you can merge the changes from revision-1 back into main.

6.4.1 Branching Workflow

  1. Create a new branch: bash git branch try-new-instrument

  2. Switch to it: bash git checkout try-new-instrument

  3. Work as usual: add, commit, push. Your changes are isolated on this branch.

  4. Merge it back (if successful): When you’re happy with the experiment, switch back to main and merge the changes.

    git checkout main
    git merge try-new-instrument