I have a problem with replicating a fixed effects estimations with some kind of panel data structure (but no time index).
I've seen several good explanations for FE-models and some easy application in R. But I'm working at some paper which does not include a time-index but three different indices (person, village, block). Whereas block is the FE (some administrative unit).
Here is what the authors do (FE-estimation):
Here is some of their results:
Question: I would like to replicate, say, column 3, the coefficient and the robust SE.
(Link to the data: https://www.aeaweb.org/articles?id=10.1257/aer.20150474 )
My approach so far:
To get an idea:
# making up some data
person_id <- c(1,3,4,5,7,8)
person_id <- as.integer(person_id) # integer
village_id <- c(1,1,1,2,2,2)
village_id <- as.integer(village_id) # # integer
block <- c("a","a","b","b","c","c") # character
block <- as.factor(block) # factor
treat <- c(0,1,1,0,1,0) # numeric
treat <- as.integer(treat) # integer
outcome <- c(13,7,8,22,91,2) # numeric
# combining data
df <- cbind(person_id, village_id, block, outcome, treat)
df <- as.data.frame(df)
# converting data, not really necessary
pdata <- plm.data(df, index=c("person_id", "village_id"))
# just for comparison
lm(outcome ~ treat, data=df) # no problem
lm(outcome ~ treat + block, data=df) # no problem
# using panel data structure, error: empty model
FE <- plm(outcome ~ treat, data=pdata, method="within")
# alternative, , error: empty model
FE <- plm(outcome ~ treat, data=pdata, method="within", index=c("person_id", "village_id"))
It's not possible to just create panel data with 3 indices like in pdata <- plm.data(df, index=c("person_id", "village_id", "block"))
but I can't tell the reason. Still it seems that R interprets those indices as "time".
I managed to set up a pooling-model (this yields the perfect coefficient, don't know why, I would like a within-model):
pooling<- plm(DV_dap ~ gotminikit + paddyarea + block, data=r_farmlevel_year2, model="pooling", index=c("farmer_id", "village_id")) # coef 393.768 fits!
and adjusted the calculation of robust SE (just trial and error):
coeftest(pooling, vcov=pvcovHC(pooling, method="arellano", cluster="time", type="HC0")) # 135.377
coeftest(pooling, vcov=pvcovHC(pooling, method="arellano", cluster="time", type="HC1")) # 135.927
coeftest(pooling, vcov=pvcovHC(pooling, method="arellano", cluster="time", type="HC2")) # 136.705 - pretty close!
coeftest(pooling, vcov=pvcovHC(pooling, method="arellano", cluster="time", type="HC3")) # 138.087
I don't have enough econometric background to decide between those ways of SE-calculation. But none of them results in exactly the given number of 136.410.
A linear model (as suggested) get's me very close to the results, but doesn't yield a perfect match:
lmodel <- lm(DV_dap ~ r_farmlevel_year2$gotminikit + r_farmlevel_year2$paddyarea + r_farmlevel_year2$block)
coeftest(lmodel , vcov = sandwich) # coef 393.768 SE 137.775
I would appreciate any hints :)