Replicating gsynth

Posted on October 29^th, 2018

Introduction

Synthethic methods are a method of causal inference that seeks to combine traditional difference-in-difference types studies with time series cross sectional differences with factor analysis for uncontrolled/ unobserved measures. This method has been growing our of work initially from Abadie (Abadie, Diamond, and Hainmueller 2011a) and growing in importance/ research for state level policy analysis. It is interesting from the fact that it combines some elements of factor analysis to develop predictors and regression analysis to try to capture explained and unexplained variance.

Package

library(tidyverse)
library(gsynth)
library(panelView)

Method Assumptions

This methodology makes several critical assumptions.

The model takes the functional form of $Y_{i,t}=\delta_{i,t}D_{i,t}+x_{i,t}<sup>\prime\beta+\lambda_i</sup>{\prime}f_i+\epsilon_{i,t}$
Strict exogeneity (e.g. the error terms are indepdent to D, X, $\lambda$ and f)
Weak serial dependence of the error terms
Regularity of conditions

The method then uses bootstrapping for confidence intervals.

There is some basic data available in the package

data(gsynth)

The data that will be initially used will be the simdata dataset

id	time	Y	D	X1	X2	eff	error	mu	alpha	xi	F1	L1	F2	L2
116	1	4.96	0	0.75	0.23	0.00	-2.60	5	-0.36	1.13	0.25	1.39	0.01	-0.18
114	29	16.90	0	0.87	2.83	8.96	0.50	5	-1.53	3.24	0.18	0.25	-0.38	-0.78
101	10	1.44	0	1.52	-0.55	0.00	-0.74	5	-0.06	-1.60	0.22	-0.04	1.15	-0.88
125	6	15.78	0	2.20	2.91	0.00	-0.99	5	-0.65	0.70	1.52	0.54	0.33	-0.09
129	11	9.63	0	0.14	1.34	0.00	0.19	5	1.62	-0.53	0.38	-0.73	0.99	-0.51
101	12	5.47	0	1.43	0.33	0.00	0.04	5	-0.06	-1.46	-0.50	-0.04	0.55	-0.88
119	9	10.94	0	2.65	2.17	0.00	-0.09	5	-1.20	-1.51	0.12	-0.60	-0.33	1.02
134	18	13.71	0	1.72	2.28	0.00	-0.96	5	-0.81	0.33	0.05	1.02	1.53	1.00
140	4	7.83	0	2.38	-0.18	0.00	-0.21	5	1.28	1.91	1.37	-0.93	0.64	-1.13
133	25	12.05	0	1.08	1.97	3.45	0.30	5	0.32	-0.37	-0.69	0.66	-0.95	-0.29
109	15	17.43	0	2.98	2.98	0.00	-0.56	5	0.69	-1.29	-1.07	0.18	1.36	1.37
104	23	21.17	1	2.60	2.27	3.52	0.07	5	1.34	-0.37	1.01	2.02	-0.25	-0.62
128	30	4.79	0	2.31	-0.50	8.95	-2.55	5	1.63	-0.42	-0.14	0.33	0.92	0.39
119	8	6.02	0	0.83	0.99	0.00	-0.09	5	-1.20	-1.57	0.58	-0.60	0.44	1.02
122	7	8.50	0	0.71	1.23	0.00	-0.66	5	0.56	-0.26	-1.55	0.67	1.10	0.45
122	2	9.68	0	2.00	1.01	0.00	0.39	5	0.56	-1.46	-0.03	0.67	0.39	0.45
104	25	9.31	1	0.39	-0.53	5.33	0.02	5	1.34	-0.37	-0.69	2.02	-0.95	-0.62
109	16	12.81	0	-1.41	2.24	0.00	1.78	5	0.69	0.79	0.30	0.18	-0.60	1.37
143	8	6.43	0	-0.36	0.90	0.00	0.48	5	0.60	-1.57	0.58	-0.30	0.44	-0.54
134	27	2.32	0	-0.13	0.36	8.13	-0.71	5	-0.81	-1.05	-0.28	1.02	-0.78	1.00
125	16	3.00	0	0.66	-0.67	0.00	-1.01	5	-0.65	0.79	0.30	0.54	-0.60	-0.09
130	17	-7.68	0	-0.10	-3.48	0.00	-0.19	5	0.79	0.77	0.45	-1.22	2.19	-1.35
129	10	7.47	0	1.15	0.75	0.00	-0.21	5	1.62	-1.60	0.22	-0.73	1.15	-0.51
132	29	10.77	0	-0.02	1.23	11.04	-0.19	5	-0.96	3.24	0.18	1.39	-0.38	0.58
143	29	13.15	0	2.11	0.24	8.82	1.33	5	0.60	3.24	0.18	-0.30	-0.38	-0.54

It can be examined via the panelView package:

panelview(Y ~ D, data = simdata,  index = c("id","time"))

Modeling

out <- gsynth(Y ~ D + X1 + X2, data = simdata,
                  index = c("id","time"), force = "two-way",
                  CV = TRUE, r = c(0, 5), se = TRUE,
                  inference = "parametric", nboots = 1000,
                  parallel = FALSE)

Cross-validating ...
 r = 0; sigma2 = 1.84865; IC = 1.02023; PC = 1.74458; MSPE = 2.37280
 r = 1; sigma2 = 1.51541; IC = 1.20588; PC = 1.99818; MSPE = 1.71743
 r = 2; sigma2 = 0.99737; IC = 1.16130; PC = 1.69046; MSPE = 1.14540*
 r = 3; sigma2 = 0.94664; IC = 1.47216; PC = 1.96215; MSPE = 1.15032
 r = 4; sigma2 = 0.89411; IC = 1.76745; PC = 2.19241; MSPE = 1.21397
 r = 5; sigma2 = 0.85060; IC = 2.05928; PC = 2.40964; MSPE = 1.23876

 r* = 2


Simulating errors .............
Bootstrapping ...
..........

print(out)

Call:
gsynth.formula(formula = Y ~ D + X1 + X2, data = simdata, index = c("id",
    "time"), force = "two-way", r = c(0, 5), CV = TRUE, se = TRUE,
    nboots = 1000, inference = "parametric", parallel = FALSE)

Average Treatment Effect on the Treated:
        Estimate  S.E. CI.lower CI.upper p.value
ATT.avg    5.544 0.262     5.03    6.057       0

   ~ by Period (including Pre-treatment Periods):
          ATT   S.E. CI.lower CI.upper   p.value n.Treated
-19  0.392160 0.5468  -0.6796  1.46396 4.733e-01         0
-18  0.276548 0.4461  -0.5979  1.15098 5.354e-01         0
-17 -0.275393 0.5428  -1.3392  0.78841 6.119e-01         0
-16  0.441201 0.4389  -0.4191  1.30150 3.148e-01         0
-15 -0.889595 0.4914  -1.8527  0.07348 7.023e-02         0
-14  0.593891 0.4633  -0.3142  1.50199 1.999e-01         0
-13  0.528749 0.4132  -0.2811  1.33860 2.007e-01         0
-12  0.171569 0.5975  -0.9994  1.34258 7.740e-01         0
-11  0.610832 0.4474  -0.2661  1.48774 1.722e-01         0
-10  0.170597 0.4516  -0.7146  1.05581 7.056e-01         0
-9  -0.271892 0.5871  -1.4226  0.87886 6.433e-01         0
-8   0.094843 0.5037  -0.8924  1.08204 8.506e-01         0
-7  -0.651976 0.5567  -1.7430  0.43904 2.415e-01         0
-6   0.573686 0.4783  -0.3638  1.51113 2.304e-01         0
-5  -0.469686 0.5150  -1.4792  0.53978 3.618e-01         0
-4  -0.077766 0.5287  -1.1140  0.95850 8.831e-01         0
-3  -0.141785 0.5473  -1.2144  0.93085 7.956e-01         0
-2  -0.157100 0.4186  -0.9776  0.66341 7.075e-01         0
-1  -0.915575 0.4856  -1.8673  0.03616 5.936e-02         0
0   -0.003309 0.3638  -0.7164  0.70975 9.927e-01         0
1    1.235962 0.7289  -0.1927  2.66458 8.995e-02         5
2    1.630264 0.5615   0.5297  2.73079 3.691e-03         5
3    2.712178 0.5508   1.6326  3.79178 8.487e-07         5
4    3.466758 0.7437   2.0092  4.92431 3.136e-06         5
5    5.740132 0.5367   4.6882  6.79203 0.000e+00         5
6    5.280035 0.5741   4.1548  6.40529 0.000e+00         5
7    8.436485 0.4843   7.4874  9.38561 0.000e+00         5
8    7.839902 0.6408   6.5839  9.09592 0.000e+00         5
9    9.455115 0.5450   8.3869 10.52329 0.000e+00         5
10   9.638509 0.5035   8.6517 10.62527 0.000e+00         5

Coefficients for the Covariates:
    beta    S.E. CI.lower CI.upper p.value
X1 1.022 0.03004    0.963    1.081       0
X2 3.053 0.02958    2.995    3.111       0

out$est.att

             ATT      S.E.   CI.lower    CI.upper      p.value n.Treated
-19  0.392159788 0.5468489 -0.6796443  1.46396386 4.732961e-01         0
-18  0.276547958 0.4461493 -0.5978887  1.15098460 5.353532e-01         0
-17 -0.275392926 0.5427657 -1.3391941  0.78840827 6.118824e-01         0
-16  0.441201288 0.4389354 -0.4190963  1.30149888 3.148187e-01         0
-15 -0.889595124 0.4913755 -1.8526735  0.07348322 7.023098e-02         0
-14  0.593890957 0.4633226 -0.3142046  1.50198651 1.999097e-01         0
-13  0.528749012 0.4131945 -0.2810973  1.33859533 2.006643e-01         0
-12  0.171568737 0.5974650 -0.9994412  1.34257868 7.739889e-01         0
-11  0.610832288 0.4474116 -0.2660784  1.48774295 1.721720e-01         0
-10  0.170597468 0.4516495 -0.7146193  1.05581419 7.056379e-01         0
-9  -0.271891657 0.5871305 -1.4226462  0.87886293 6.433030e-01         0
-8   0.094842558 0.5036802 -0.8923526  1.08203768 8.506422e-01         0
-7  -0.651975701 0.5566512 -1.7429921  0.43904067 2.414998e-01         0
-6   0.573686472 0.4782973 -0.3637590  1.51113193 2.303589e-01         0
-5  -0.469685905 0.5150445 -1.4791545  0.53978268 3.618041e-01         0
-4  -0.077766449 0.5287181 -1.1140349  0.95850205 8.830650e-01         0
-3  -0.141784521 0.5472736 -1.2144212  0.93085211 7.955779e-01         0
-2  -0.157100323 0.4186374 -0.9776145  0.66341385 7.074627e-01         0
-1  -0.915575087 0.4855890 -1.8673120  0.03616184 5.936318e-02         0
0   -0.003308833 0.3638145 -0.7163722  0.70975454 9.927435e-01         0
1    1.235962010 0.7289023 -0.1926602  2.66458419 8.995247e-02         5
2    1.630264312 0.5615032  0.5297382  2.73079044 3.691436e-03         5
3    2.712177702 0.5508277  1.6325753  3.79178010 8.486982e-07         5
4    3.466757691 0.7436652  2.0092007  4.92431464 3.135799e-06         5
5    5.740132310 0.5366920  4.6882353  6.79202929 0.000000e+00         5
6    5.280034526 0.5741182  4.1547836  6.40528546 0.000000e+00         5
7    8.436484821 0.4842566  7.4873594  9.38561026 0.000000e+00         5
8    7.839901526 0.6408388  6.5838805  9.09592251 0.000000e+00         5
9    9.455114684 0.5449999  8.3869345 10.52329485 0.000000e+00         5
10   9.638509457 0.5034590  8.6517480 10.62527087 0.000000e+00         5

out$est.avg

        Estimate     S.E. CI.lower CI.upper p.value
ATT.avg 5.543534 0.262005 5.030014 6.057054       0

out$est.beta

       beta       S.E.  CI.lower CI.upper p.value
X1 1.021890 0.03004240 0.9630082 1.080772       0
X2 3.052994 0.02957771 2.9950231 3.110966       0

The default plot:

plot(out) +
  theme(plot.title = element_text(size = 12))

Modified:

plot(out, theme.bw = TRUE) +
  theme(plot.title = element_text(size = 12))

With the raw data:

plot(out, type = "raw", theme.bw = TRUE)+
  theme(plot.title = element_text(size = 12))

plot(out,type = "raw", legendOff = TRUE, ylim=c(-10,40), main="")+
  theme(plot.title = element_text(size = 12))

plot(out, type = "counterfactual", raw = "none", main="")+
  theme(plot.title = element_text(size = 12))

plot(out, type = "counterfactual", raw = "band", xlab = "Time",
     ylim = c(-5,35), theme.bw = TRUE)+
  theme(plot.title = element_text(size = 12))

plot(out, type = "counterfactual", raw = "all")+
  theme(plot.title = element_text(size = 12))

plot(out, type = "factors", xlab = "Time")+
  theme(plot.title = element_text(size = 12))

Voter Turnout

Now I am going to try to reproduce the results from Xu’s paper [xu_generalized_2017] and implemented in his package (Xu and Liu 2018).

Data

abb	year	turnout
AL	1920	21.02107
AL	1924	13.58233
AL	1928	19.03850
AL	1932	17.62002
AL	1936	18.69238
AL	1940	18.86406

Model

The paper builds two models, one will only the impact of voter education, another with additional controls. For the sake of time I will build the model with all of the controls used.

out <- gsynth(turnout ~ policy_edr + policy_mail_in + policy_motor, data = turnout,
                  index = c("abb","year"), force = "two-way",
                  CV = TRUE, r = c(0, 5), se = TRUE,
                  inference = "parametric", nboots = 1000,
                  parallel = FALSE)

Cross-validating ...
 r = 0; sigma2 = 77.99366; IC = 4.82744; PC = 72.60594; MSPE = 22.13889
 r = 1; sigma2 = 13.65239; IC = 3.52566; PC = 21.67581; MSPE = 12.03686
 r = 2; sigma2 = 8.56312; IC = 3.48518; PC = 19.23841; MSPE = 10.31254*
 r = 3; sigma2 = 6.40268; IC = 3.60547; PC = 18.61783; MSPE = 11.48390
 r = 4; sigma2 = 4.74273; IC = 3.70145; PC = 16.93707; MSPE = 16.28613
 r = 5; sigma2 = 4.02488; IC = 3.91847; PC = 17.05226; MSPE = 15.78683

 r* = 2


Simulating errors .............
Bootstrapping ...
..........

The model completed and appeared that it did well. Now to examine the figures to see if they match:

plot(out)

Now we can look at the estimated average treatment

out$est.avg %>%
  knitr::kable()

	Estimate	S.E.	CI.lower	CI.upper	p.value
ATT.avg	4.895788	2.315232	0.3580163	9.43356	0.0344641

plot(out, type = "counterfactual", raw = "all")+
  theme(plot.title = element_text(size = 12))

Now with Synth

Now I want to try to replicate the previous case study used in Synth (Abadie, Diamond, and Hainmueller 2011b) and see if it is possible.

library(Synth)

I am going to recode some variable in order to fit with the methods used in gsynth. The Synth package uses a dataprep function in order to format the data into the requirements.

data("basque")
basque_treat <- basque %>%
  mutate(treat = ifelse(grepl(pattern = "Basque", x = regionname) & year > 1969,1,0))
head(basque_treat) %>%
  knitr::kable()

regionno	regionname	year	gdpcap	sec.agriculture	sec.energy	sec.industry	sec.construction	sec.services.venta	sec.services.nonventa	school.illit	school.prim	school.med	school.high	school.post.high	popdens	invest
1	Spain (Espana)	1955	2.354542	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1	Spain (Espana)	1956	2.480149	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1	Spain (Espana)	1957	2.603613	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1	Spain (Espana)	1958	2.637104	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1	Spain (Espana)	1959	2.669880	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
1	Spain (Espana)	1960	2.869966	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

out_basque <- gsynth(gdpcap ~ treat, data = basque_treat, index = c("regionname","year"), force = "two-way",
                  CV = TRUE, r = c(0, 5), se = TRUE,
                  inference = "parametric", nboots = 1000,
                  parallel = FALSE, )

Cross-validating ...
 r = 0; sigma2 = 0.15833; IC = -1.31080; PC = 0.14556; MSPE = 0.02963
 r = 1; sigma2 = 0.02499; IC = -2.64303; PC = 0.04627; MSPE = 0.00994
 r = 2; sigma2 = 0.00985; IC = -3.07772; PC = 0.02745; MSPE = 0.00755*
 r = 3; sigma2 = 0.00581; IC = -3.12683; PC = 0.02166; MSPE = 0.00981
 r = 4; sigma2 = 0.00395; IC = -3.05355; PC = 0.01843; MSPE = 0.00329
 r = 5; sigma2 = 0.00256; IC = -3.04564; PC = 0.01435; MSPE = 0.00260

 r* = 5


Simulating errors .............
Bootstrapping ...
..........

Let’s check the results

plot(out_basque)+
  theme(plot.title = element_text(size = 12))

plot(out_basque, type = "counterfactual", raw = "all")+
  theme(plot.title = element_text(size = 12))

Interesting that this method lands on a slightly different estimate than the previous paper. I think that this is due to some missing covariates. It appears that gsynth doesn’t handle missing data too well, which is ok.

Thoughts

I think that a combination of Bayesian Hierarchical modeling with structural timeseries modeling could get somewhere close to the method, and take advantage of specifying knowns, group effects, etc. But the method is neat and a good way to do some quick analysis.

Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. 2011a. “Synth: An R Package for Synthetic Control Methods in Comparative Case Studies.” Journal of Statistical Software 42 (13): 1–17. http://www.jstatsoft.org/v42/i13/.

———. 2011b. “Synth: An r Package for Synthetic Control Methods in Comparative Case Studies.” Journal of Statistical Software, Articles 42 (13): 1–17. https://doi.org/10.18637/jss.v042.i13.

Xu, Yiqing, and Licheng Liu. 2018. Gsynth: Generalized Synthetic Control Method. https://CRAN.R-project.org/package=gsynth.

Citation

BibTex citation:

@online{dewitt2018
author = {Michael E. DeWitt},
title = {Replicating gsynth},
date = 2018-10-29,
url = {https://michaeldewittjr.com/articles/2018-10-28-replicating-gsynth},
langid = {en}
}

For attribution, please cite this work as:

Michael E. DeWitt. 2018. "Replicating gsynth." October 29, 2018. https://michaeldewittjr.com/articles/2018-10-28-replicating-gsynth