4 Replication with Stata: Best Practices
4.1 Stata: Setting Up for Reproducibility
Stata is a powerful tool for empirical research, but reproducibility requires discipline and structure. Here’s how to set up your project for robust, transparent replication.
4.1.1 1. Use Global Variables for Paths
Define all key directories at the top of your main do-file. This makes your code portable and easy to maintain.
global project_path "/path/to/your/project"
global data_original "$project_path/data/original"
global data_temp "$project_path/data/temporary"
global code "$project_path/code"
global output "$project_path/paper"
global ado "$project_path/ado"
Never hardcode paths in your scripts. Use these globals everywhere. Indicate in the README where to set these paths and what to adapt.
4.1.2 2. Managing Ado Packages for Replication
It is quite common to add new capabilities to Stata via user-written packages (ado files), for instance, the estout
, ivreg2
or reghdfe
. These packages are updated and may lead to changes in your results. Ideally, we would indicate somewhere which versions of these packages were used, but this is only being slowly worked on as an ado
package. We can circumvent this issue by storing ado
files within a project-specific folder. By doing so, we can update them as needed and bundle this folder with our replication package.1
- Create a project-specific ado folder (e.g.,
ado/
, see above). - Install all required packages there in your master file.
- Comment out
ssc install
lines to avoid checking packages at every run.
- Comment out
- Bundle this ado folder with your replication package so users get the exact versions you used.
// Create a project-specific ado directory
sysdir set PLUS "$ado"
// Install packages into this folder
ssc install estout, replace
Make sure to change the sysdir
settings in your scripts to point to the correct ado
folder. For instance, in your master file, add
sysdir set PLUS "$ado"
Do not include the ssc install
lines in the replication package you share to avoid users accidentally updating packages and breaking reproducibility. Instead, include the ado
folder itself.
4.1.4 4. Write Reusable Code: Functions and Loops
- Encapsulate repetitive tasks in programs (functions).
- Use loops to iterate over variables, controls, or specifications.
- This makes it easy to add/remove variables or change specifications later.
In your main do-file, indicate groups of variables that you reuse together.
/******************************************************
* Sets of controls
******************************************************/
global myreg "eclipses_log";
global distances "distance_coast_km_log distance_addis_km_log distance_river_km_log
distance_volcano_km_log distance_tectonic_km_log";
global geography "rugavg elevavg malaravg calavg abs_lat south i.mht_enc";
global area "area_km";
global motifs "eclipse_related calendar_related thinking_related cloud_related
lightning_related rock_related sand_related white_related purple_related
curious_related";
global motifs_religion "religion_related pray_related religious_related";
/******************************************************
* Sets of dependent variables
******************************************************/
global outcomes_ea_3 "v31 v30";
global outcomes_ea_1 "v33 v66 v90 tasks technology strategy writing explanation";
The following is a relatively complex Stata program that determines the appropriate regression model based on the outcome variable. It was useful to avoid forgetting control variables and ensuring the correct model specification. Then, if regressions with v90
were to be run under OLS, it would be enough to modify the global macro ols_regs
and probit_regs
.
/******************************************************
* Programs
******************************************************/
/* Determine regression model and adjust it */
#delimit ;capture program drop reg_model;
program reg_model, rclass;
noisily di "`1'";
/* Pass the outcome variable as argument */
/* The sample should be limited when using V90, values 0 and 8 are for missing */
if "`1'" == "v90" {;
return local extras "if v90>0&v90<8";
};else if (strpos("$outcomes_seshat_1", "`1'") > 0 | strpos("$outcomes_seshat_3", "`1'") > 0) & "`1'" != "writing" {;
return local extras "if sd > 0";
};else {;
return local extras " ";
};
/* When the outcome is the number of tasks, control for the number of surveyed tasks */
if "`1'" == "tasks" {;
return local add_regressor "tasks_with_info";
};
/* When outcomes are from the folklore database we add as controls
the number of published books
the year of first publication
the number of motifs
Additionally, we clean the variables from terms that appear as related
in ConceptNet
*/
else if strpos("$outcomes_folklore", "`1'") {;
local add_regressor "lnnmbr_title lnyear_firstpub lnmotifs_total";
/* Use the global folklore_controls to determine the control we should add
It is organized as a dictionary.
First, we retrieve the position of the look-up word
The next entry is the value associated to the searched key, so we add 1 to the previous index
*/
if strpos("$folklore_controls", "`1'") {;
local motif_position : list posof "`1'" in global(folklore_controls);
local motif_position = `motif_position' + 1;
local control : word `motif_position' of $folklore_controls;
local add_regressor "`add_regressor' `control'";
};return local add_regressor `add_regressor';
};/* In the regressions using the Seshat, we control for total population throughout
except (obviously) when the outcome variable is population density
*/
else if strpos("$outcomes_seshat_1", "`1'") > 0 | strpos("$outcomes_seshat_3", "`1'") > 0 {;
/* The previous expression also matches the Ethnographic Atlas variable `writing`.
We create an ad-hoc rule to avoid adding `p_polity_population` to its regression.
*/
if "`1'" == "writing" {;
return local add_regressor " ";
};else if "`1'" == "density" {;
return local add_regressor " ";
};else {;
return local add_regressor "polity_population";
};
};else {;
return local add_regressor " ";
};
/* Select the appropriate model */
if strpos("$oprobit_regs", "`1'") {;
return local method "oprobit";
return local limit "iter(100)";
};if strpos("$ols_regs", "`1'") {;
return local method "reg";
return local limit " ";
};if strpos("$nbreg_regs", "`1'") {;
return local method "nbreg";
return local limit "iter(100)";
};if strpos("$poisson_regs", "`1'") {;
return local method "poisson";
return local limit "iter(100)";
};if strpos("$probit_regs", "`1'") {;
return local method "probit";
return local limit "iter(100)";
};end;
/* Add additional information to the regressions */
#delimit ;capture program drop add_information;
program add_information;
/* Pass the name of the locals and their values as arguments
Ex: add_information local_1 value_1 local_2 value_2
*/
tokenize `0';
local n : word count `0';
if mod(`n', 2) != 0 {;
di "You should enter the information as local_1 value_1 local_2 value_2";
exit;
};`n' {;
forval i = 1/if mod(`i', 2) == 1 {;
local next = `i' + 1;
qui estadd local ``i'' "``next''";
};else {;
continue;
};
};end;
Now you can loop over outcomes and always use the right model and controls:
foreach v of global outcomes {
"`v'"
reg_model local controls = r(controls)
local method = r(method)
`method' `v' `controls'
}
4.1.5 5. Separate Analysis and Table Generation
- First, generate all results and store them (e.g., with
eststo
).- There is a limit in the number of stored estimates, I have hit it.
- Then, in a separate do-file, generate all tables and figures.
- This keeps your workflow modular and easier to debug.
4.1.6 6. General Workflow Example
- Set up paths and globals at the top of your master do-file.
- Install dependencies (once, not in the replication package).
/*
1-Master.do
This file sets up the environment, loads data, and runs the main analysis.
Adjust paths and globals as needed.
*/
// Paths
global project_path "/path/to/your/project"
global data_original "$project_path/data/original"
global data_temp "$project_path/data/temporary"
global code "$project_path/code"
global output "$project_path/paper"
global ado "$project_path/ado"
// Variables for analysis
global EA1 "v1 v2 v3"
global EA2 "v4 v5 v6"
// Ado directory
sysdir set PLUS "$ado"
// Install dependencies
ssc install estout
// Analysis
do "$code/2-Data_cleaning.do"
do "$code/3-Analysis.do"
do "$code/4-Results.do"
- Prepare data: load, clean, and save to
data/temporary
.
/*
2-Data_cleaning.do
This file handles data loading, cleaning, and preparation.
*/
use "$data_original/your_data.dta", clear
data cleaning steps here
* Document your save "$data_temp/cleaned_data.dta", replace
- Run analysis: loop over outcomes, use functions for model selection, store results.
/*
3-Analysis.do
This file contains the main analysis code.
*/
// Main regressions, loop over outcomes
foreach v of global outcomes {
"`v'"
reg_model local controls = r(controls)
local method = r(method)
`method' `v' `controls'
`method'_`v'
eststo }
- Generate tables/figures: in a dedicated script, using stored results. The tables (and figures) should never be edited manually. If you need to make changes, update the code and re-run the analysis. This ensures reproducibility and reduces errors.
/*
4-Results.do
This file generates tables and figures from the stored results.
*/
local table_1 ""
foreach v of global outcomes {
forval i=0/2 {local table_1 "`table_1' `v'_r`i'"
}
}foreach v of global outcomes_seshat {
forval i=1/2 {local table_1 "`table_1' `v'_`i'"
}
}
#delimit ;`table_1'
estout using "${path_tables}/Table_Development.tex",
"b(fmt(3))" "se(fmt(3) par star)" "conleyse(fmt(3) par([ ]) star pval(conleyp))")
cells(`"\sym{*}"' 0.1 `"\sym{**}"' 0.05 `"\sym{***}"' 0.01, label(" \(p<@\)"))
starlevels( varwidth(10)
modelwidth(9)
delimiter(&)
end(\\)
prehead(&\multicolumn{6}{c}{Ethnographic Atlas}&\multicolumn{2}{c}{Seshat}\\
\cmidrule(lr){2-7}\cmidrule(lr){8-9})
posthead("\midrule")
prefoot("\midrule")
mgroups("Population Density" "Settlement Patterns" "Population Density",
pattern(1 0 0 1 0 0 1 0) prefix(\multicolumn{@span}{c}{) suffix(}) span erepeat(\cmidrule(lr){@span}))
mlabels(none)
varlabels($labels)
numbers(\multicolumn{@span}{c}{( )})
collabels(none)
eqlabels(, begin("\midrule" "") none)
substitute(_ \_ "\_cons " \_cons)
interaction(" $\times$ ")
level(95)
style(esttab)
rename(eclipses eclipses_log)
replace
keep(eclipses_log distance_volcano_km_log distance_tectonic_km_log volcanoes)
order(eclipses_log distance_volcano_km_log distance_tectonic_km_log volcanoes)
stats(fe tim geo ethnic con r2_p N, fmt(0 0 0 0 0 3 0)
layout("\multicolumn{1}{c}{@}" "\multicolumn{1}{c}{@}" "\multicolumn{1}{c}{@}" "\multicolumn{1}{c}{@}" "\multicolumn{1}{c}{@}" "\multicolumn{1}{S}{@}" "\multicolumn{1}{c}{@}")
labels("Fixed effects" "Time Fixed Effects" "Geography" "Ethnic" "Controls Seshat" "\midrule \(R^{2}\)/Pseudo-\(R^{2}\)" "Observations"));
#delimit cr
4.2 Latex
With a Stata code that generates readily usable LaTeX tables and figures, you can easily integrate your results into your academic papers. Again, the tables (and figures) should never be edited manually. If you need to make changes, update the code and re-run the analysis. This ensures reproducibility and reduces errors. For instance, do not export tables as Excel or Word files to make formatting changes. Integrating the results in LaTeX is straightforward with the right Stata commands.
\ref{tab:development} presents the results of regressions using the Ethnographic Atlas and Seshat datasets.
Table
\begin{table}[htbp]
\centering
\caption{Determinants of Population Density and Settlement Patterns}
\label{tab:development}
\begin{tabular}{l*{6}{c}}
\input{tables/Table_Development.tex}
\end{tabular}
\tablenotes{
Notes: This table presents the results of the main regressions for each dataset. (...)
}\end{table}
Here, the \input{tables/Table_Development.tex}
command directly includes the LaTeX code generated by Stata, ensuring that your tables are always up-to-date with your latest analysis. \tablenotes
is a personal Latex macro for adding table-specific notes.
However, another challenge remains: referring to specific values in the table. For instance, you might want to reference the coefficient for population density in the Ethnographic Atlas dataset. While it is possible to automate this, this is not straightforward in Stata. For example, it is possible to export the results to a CSV file, refer to them with placeholders and use a script to replace the placeholders with the actual values. As we will see in the next section, this is much easier in Quarto.
Check whether
ado
packages are redistributable.↩︎
4.1.3 3. Comment Generously
Comments help your future self and others understand why you did something.