5 Replication with Stata: Best Practices

5.1 Stata: Setting Up for Reproducibility

Stata is a powerful tool for empirical research, but reproducibility requires discipline and structure. Here’s how to set up your project for robust, transparent replication.

5.1.1 1. Use Global Variables for Paths

Define all key directories at the top of your main do-file. This makes your code portable and easy to maintain.

global project_path "/path/to/your/project"
global data_original "$project_path/data/original"
global data_temp     "$project_path/data/temporary"
global code          "$project_path/code"
global output        "$project_path/paper"
global ado           "$project_path/ado"

Tip

Never hardcode paths in your scripts. Use these globals everywhere. Indicate in the README where to set these paths and what to adapt.

5.1.2 2. Managing Ado Packages for Replication

It is quite common to add new capabilities to Stata via user-written packages (ado files), for instance, the estout, ivreg2 or reghdfe. These packages are updated and may lead to changes in your results. Ideally, we would indicate somewhere which versions of these packages were used, but this is only being slowly worked on as an ado package. We can circumvent this issue by storing ado files within a project-specific folder. By doing so, we can update them as needed and bundle this folder with our replication package.¹

Create a project-specific ado folder (e.g., ado/, see above).
Install all required packages there in your master file.
- Comment out ssc install lines to avoid checking packages at every run.
Bundle this ado folder with your replication package so users get the exact versions you used.

// Create a project-specific ado directory
sysdir set PLUS "$ado"

// Install packages into this folder
ssc install estout, replace

Important

Make sure to change the sysdir settings in your scripts to point to the correct ado folder. For instance, in your master file, add

sysdir set PLUS "$ado"

Tip

Do not include the ssc install lines in the replication package you share to avoid users accidentally updating packages and breaking reproducibility. Instead, include the ado folder itself.

5.1.3 3. Comment Generously

Explain your logic, not just the syntax.
Comments help your future self and others understand why you did something.
Document assumptions, data quirks, and any manual steps.

/*
    Main regressions on population density and settlement patterns
    Add covariates little by little
*/
foreach v of global outcomes_ea {
    reg `v' $EA1
    eststo model_`v'_ea1
    reg `v' $EA2
    eststo model_`v'_ea2
}

5.1.4 4. Write Reusable Code: Functions and Loops

Encapsulate repetitive tasks in programs (functions).
Use loops to iterate over variables, controls, or specifications.
This makes it easy to add/remove variables or change specifications later.

In your main do-file, indicate groups of variables that you reuse together.

/******************************************************
* Sets of controls
******************************************************/

global myreg            "eclipses_log";
global distances        "distance_coast_km_log distance_addis_km_log distance_river_km_log 
                         distance_volcano_km_log distance_tectonic_km_log";
global geography        "rugavg elevavg malaravg calavg abs_lat south i.mht_enc";
global area             "area_km";
global motifs           "eclipse_related calendar_related thinking_related cloud_related 
                         lightning_related rock_related sand_related white_related purple_related 
                         curious_related";
global motifs_religion  "religion_related pray_related religious_related";

/******************************************************
* Sets of dependent variables
******************************************************/

global outcomes_ea_3            "v31 v30";
global outcomes_ea_1            "v33 v66 v90 tasks technology strategy writing explanation";

The following is a relatively complex Stata program that determines the appropriate regression model based on the outcome variable. It was useful to avoid forgetting control variables and ensuring the correct model specification. Then, if regressions with v90 were to be run under OLS, it would be enough to modify the global macro ols_regs and probit_regs.

/******************************************************
* Programs
******************************************************/

/* Determine regression model and adjust it */
#delimit ;
capture program drop reg_model;
program reg_model, rclass;
    noisily di "`1'";
   /* Pass the outcome variable as argument */

   /* The sample should be limited when using V90, values 0 and 8 are for missing */
    if "`1'" == "v90" {;
        return local extras "if v90>0&v90<8";
    };
    else if (strpos("$outcomes_seshat_1", "`1'") > 0 | strpos("$outcomes_seshat_3", "`1'") > 0) & "`1'" != "writing" {;
            return local extras "if sd > 0";
    };
    else {;
        return local extras " ";
    };

    /* When the outcome is the number of tasks, control for the number of surveyed tasks */
    if "`1'" == "tasks" {;
        return local add_regressor "tasks_with_info";
    };

    /* When outcomes are from the folklore database we add as controls
         the number of published books
         the year of first publication
         the number of motifs
       Additionally, we clean the variables from terms that appear as related
       in ConceptNet
    */
    else if strpos("$outcomes_folklore", "`1'") {;
        local add_regressor "lnnmbr_title lnyear_firstpub lnmotifs_total";
        /* Use the global folklore_controls to determine the control we should add
           It is organized as a dictionary.
           First, we retrieve the position of the look-up word
           The next entry is the value associated to the searched key, so we add 1 to the previous index
       */
       if strpos("$folklore_controls", "`1'") {;
            local motif_position : list posof "`1'" in global(folklore_controls);
            local motif_position = `motif_position' + 1;
            local control : word `motif_position' of $folklore_controls;
            local add_regressor "`add_regressor' `control'";
        };
        return local add_regressor `add_regressor';
    };
    /* In the regressions using the Seshat, we control for total population throughout
       except (obviously) when the outcome variable is population density
   */
   else if strpos("$outcomes_seshat_1", "`1'") > 0 | strpos("$outcomes_seshat_3", "`1'") > 0 {;
        /* The previous expression also matches the Ethnographic Atlas variable `writing`.
           We create an ad-hoc rule to avoid adding `p_polity_population` to its regression.
        */
        if "`1'" == "writing" {;
            return local add_regressor " ";
        };
        else if "`1'" == "density" {;
            return local add_regressor " ";
        };
        else {;
            return local add_regressor "polity_population";
        };
    };
    else {;
        return local add_regressor " ";
    };

    /* Select the appropriate model */
    if strpos("$oprobit_regs", "`1'") {;
        return local method "oprobit";
        return local limit "iter(100)";
    };
    if strpos("$ols_regs", "`1'") {;
        return local method "reg";
        return local limit " ";
    };
    if strpos("$nbreg_regs", "`1'") {;
        return local method "nbreg";
        return local limit "iter(100)";
    };
    if strpos("$poisson_regs", "`1'") {;
        return local method "poisson";
        return local limit "iter(100)";
    };
    if strpos("$probit_regs", "`1'") {;
        return local method "probit";
        return local limit "iter(100)";
    };
end;

/* Add additional information to the regressions */
#delimit ;
capture program drop add_information;
program add_information;
    /* Pass the name of the locals and their values as arguments 
       Ex: add_information local_1 value_1 local_2 value_2 
    */
    tokenize `0';
    local n : word count `0';
    if mod(`n', 2) != 0 {;
        di "You should enter the information as local_1 value_1 local_2 value_2";
        exit;
    };
    forval i = 1/`n' {;
        if mod(`i', 2) == 1 {;
            local next = `i' + 1;
            qui estadd local ``i'' "``next''";
        };
        else {;
            continue;
        };
    };
end;

Now you can loop over outcomes and always use the right model and controls:

foreach v of global outcomes {
    reg_model "`v'"
    local controls = r(controls)
    local method = r(method)
    `method' `v' `controls'
}

5.1.5 5. Separate Analysis and Table Generation

First, generate all results and store them (e.g., with eststo).
- There is a limit in the number of stored estimates, I have hit it.
Then, in a separate do-file, generate all tables and figures.
This keeps your workflow modular and easier to debug.

5.1.6 6. General Workflow Example

Set up paths and globals at the top of your master do-file.
Install dependencies (once, not in the replication package).

/*
  1-Master.do
  This file sets up the environment, loads data, and runs the main analysis.
  Adjust paths and globals as needed.
*/

// Paths
global project_path "/path/to/your/project"
global data_original "$project_path/data/original"
global data_temp     "$project_path/data/temporary"
global code          "$project_path/code"
global output        "$project_path/paper"
global ado           "$project_path/ado"

// Variables for analysis
global EA1 "v1 v2 v3"
global EA2 "v4 v5 v6"

// Ado directory
sysdir set PLUS "$ado"

// Install dependencies
ssc install estout


// Analysis
do "$code/2-Data_cleaning.do"
do "$code/3-Analysis.do"
do "$code/4-Results.do"

Prepare data: load, clean, and save to data/temporary.

/*
  2-Data_cleaning.do
  This file handles data loading, cleaning, and preparation.
*/

use "$data_original/your_data.dta", clear

* Document your data cleaning steps here
save "$data_temp/cleaned_data.dta", replace

Run analysis: loop over outcomes, use functions for model selection, store results.

/*
  3-Analysis.do
  This file contains the main analysis code.
*/

// Main regressions, loop over outcomes
foreach v of global outcomes {
    reg_model "`v'"
    local controls = r(controls)
    local method = r(method)
    `method' `v' `controls'
    eststo `method'_`v'
}

Generate tables/figures: in a dedicated script, using stored results. The tables (and figures) should never be edited manually. If you need to make changes, update the code and re-run the analysis. This ensures reproducibility and reduces errors.

/*
  4-Results.do
  This file generates tables and figures from the stored results.
*/

local table_1 ""
foreach v of global outcomes {
    forval i=0/2 {
        local table_1 "`table_1' `v'_r`i'"
        }
    }
foreach v of global outcomes_seshat {
    forval i=1/2 {
        local table_1 "`table_1' `v'_`i'"
        }
    }

#delimit ;
estout `table_1'
    using "${path_tables}/Table_Development.tex",
    cells("b(fmt(3))" "se(fmt(3) par star)" "conleyse(fmt(3) par([ ]) star pval(conleyp))")
    starlevels(`"\sym{*}"' 0.1 `"\sym{**}"' 0.05 `"\sym{***}"' 0.01, label(" \(p<@\)"))
    varwidth(10)
    modelwidth(9)
    delimiter(&)
    end(\\)
    prehead(&\multicolumn{6}{c}{Ethnographic Atlas}&\multicolumn{2}{c}{Seshat}\\
            \cmidrule(lr){2-7}\cmidrule(lr){8-9})
    posthead("\midrule")
    prefoot("\midrule")
    mgroups("Population Density" "Settlement Patterns" "Population Density",  
        pattern(1 0 0 1 0 0 1 0) prefix(\multicolumn{@span}{c}{) suffix(}) span erepeat(\cmidrule(lr){@span})) 
    mlabels(none)
    varlabels($labels)
    numbers(\multicolumn{@span}{c}{( )})
    collabels(none)
    eqlabels(, begin("\midrule" "") none)
    substitute(_ \_ "\_cons " \_cons)
    interaction(" $\times$ ")
    level(95)
    style(esttab)
        rename(eclipses eclipses_log)
    replace
    keep(eclipses_log distance_volcano_km_log distance_tectonic_km_log volcanoes)
    order(eclipses_log distance_volcano_km_log distance_tectonic_km_log volcanoes)
    stats(fe tim geo ethnic con r2_p N, fmt(0 0 0 0 0 3 0)
        layout("\multicolumn{1}{c}{@}" "\multicolumn{1}{c}{@}" "\multicolumn{1}{c}{@}" "\multicolumn{1}{c}{@}" "\multicolumn{1}{c}{@}" "\multicolumn{1}{S}{@}" "\multicolumn{1}{c}{@}")
        labels("Fixed effects" "Time Fixed Effects" "Geography" "Ethnic" "Controls Seshat" "\midrule \(R^{2}\)/Pseudo-\(R^{2}\)" "Observations"));
#delimit cr

5.2 Latex

With a Stata code that generates readily usable LaTeX tables and figures, you can easily integrate your results into your academic papers. Again, the tables (and figures) should never be edited manually. If you need to make changes, update the code and re-run the analysis. This ensures reproducibility and reduces errors. For instance, do not export tables as Excel or Word files to make formatting changes. Integrating the results in LaTeX is straightforward with the right Stata commands.

Table \ref{tab:development} presents the results of regressions using the Ethnographic Atlas and Seshat datasets.

\begin{table}[htbp]
\centering
\caption{Determinants of Population Density and Settlement Patterns}
\label{tab:development}
\begin{tabular}{l*{6}{c}}
\input{tables/Table_Development.tex}
\end{tabular}
\tablenotes{
    Notes: This table presents the results of the main regressions for each dataset. (...)
}
\end{table}

Here, the \input{tables/Table_Development.tex} command directly includes the LaTeX code generated by Stata, ensuring that your tables are always up-to-date with your latest analysis. \tablenotes is a personal Latex macro for adding table-specific notes.

However, another challenge remains: referring to specific values in the table. For instance, you might want to reference the coefficient for population density in the Ethnographic Atlas dataset. While it is possible to automate this, this is not straightforward in Stata. For example, it is possible to export the results to a CSV file, refer to them with placeholders and use a script to replace the placeholders with the actual values. As we will see in the next section, this is much easier in Quarto.

Check whether ado packages are redistributable.↩︎