1. Full sample fundamental analysisreg csr X1 lnsize bcash roa lev les tobinq age i.Ind i.year est store m1reg csr X2 lnsize bcash roa lev les tobinq age i.Ind i.yearest store m2esttab m1 m2 m3 m4 using esttab1.rtf, replace
2.Dummy variable grouping comparative analysisreg csr X1 lnsize bcash roa lev tobinq age i.Ind i.year if zy==1est store m1reg csr X1 lnsize bcash roa lev tobinq age i.Ind i.year if zy==0est store m2reg csr X2 lnsize bcash roa lev tobinq age i.Ind i.year if zy==1est store m3reg csr X2 lnsize bcash roa lev tobinq age i.Ind i .year if zy==0est store m4esttab m1 m2 m3,se scalars(N r2 F p) mtitles title("Figure 1"),using esttab1.rtf,replace
3. Fixed effect or random effectTwo forms of individual effects: fixed effects & random effects step1. Separately test whether fixed effects & random effects are better than mixed regression step2. Fixed effects & random effects are tested by hausman. Select order: xtreg csr X1 lnsize bcash roa lev les tobinq age i .Ind i.year, feestimates store FExtreg csr X1 lnsize bcash roa lev les tobinq age i.Ind i.year, reestimates store REhausman FE RE, constant sigmamore
ResultThe smaller the value of p in FE, the more explanatory FE Regression is better than mixed regression; the smaller p value in RE indicates that RE regression is better than mixed regression; the smaller Prob>chi2 in HAUSMAN test indicates that FE is better than R E;
Defects of fixed effectsIt can only reflect variable information that changes with time. Variables that do not change with time (such as Industry in your current model) will automatically be omitted. And if you add the dummy variable of time, the fixed-effects model becomes a two-way fixed-effects model.
Accumulation:You first xtreg, fe, the result is an F test that all u _ i=0, this test If it fails, that is, the p-value is small, then the mixed regression reg cannot be used. Next, is the choice of fe and re. For a wide and short (N much larger than T) sample, the estimation results of the random effect model and the fixed effect model may be very different. Which one exactlyOne is better, it is not easy to generalize, because each has its own advantages and disadvantages. As for which one to use in practical applications, certain criteria need to be judged. 1. Judging by the type of data: macro data generally use fixed effects (because a large number of occasional fluctuations are offset by data aggregation); micro data generally use random effects; 2. judging by the number of individuals: if the individual is the entire overall unit, the purpose is to describe The population (such as the data from the provinces of the country) is generally fixed effect; if the individual is a part of the overall unit randomly selected, the purpose is to estimate the population (such as household survey data), generally random effects; 3. Judging according to the nature of the model parameter estimator ：The most commonly used is Hausman test encode firmcode,gen(firm)xtset firm yeardrop csr firmcoderename csr1 csrgen shortr=shortloan/size forced conversion type desstring gov2 ,replace forcedestring var1,gen(var2)esttab m1 m2 ,se scalars(N r2 F ) mtitles title("Figure 1"),using esttab1.rtf,replace Common problems: reg command: reg,robust and reg,cluster(clustvar) can get Robust-SE (standard error) with different results, my understanding is the former Mainly deal with heteroscedasticity, the latter mainly deal with autocorrelation? xtreg command: ,vce(robust) and ,vce(cluster clustvar) get the same SE? why? And you suggest that you use the former when dealing with heteroscedasticity and the latter when dealing with cross-section correlation. Does this mean that these two equivalent options can deal with heteroscedasticity and cross-sectional correlation at the same time? Answer: reg, robust only adjusts the standard error when heteroscedasticity is considered, using White (1980) sandwich estimator; after xtset id year, xtreg, robust is set to be equivalent to xtreg, vce(cluster id) (Stata11 After); if xtreg, vce (cluster industry) are not the same; (xt indicates that the system understands) xtreg, vce (cluster industry) and reg, vce (cluster industry) have the same interpretation: assuming that the interference items are in different industries They are independent of each other, but there are correlations between different companies within the same industry. case1: reg, vce (cluster year) is equivalent to assuming that the interference items between different years are independent of each other, but the interference items between all companies in the same year are related to each other. case2: reg, vce(cluster id) is equivalent to assuming that the interference items of different companies are independent of each other, but the interference items of the same company in different years are related to each other. Therefore, setting vce(cluster var) does not necessarily mean sequence correlation, the key lies in what variable var is.
AutocorrelationAutocorrelation/serial-correlation classification: time autocorrelation (a variable is related to a year near, mostly in time series); (moving average and other year information is artificially adjusted before and after the application Data, statistics from the Bureau of Statistics may have been adjusted) Spatial autocorrelation (correlation of adjacent variables, mostly in cross-sectional data); Disturbance term autocorrelation, if an autocorrelation variable is omitted, it will be included in the disturbance term;
Panel data testMulticollinearity has not been a big problem in measurement. The correlation coefficient and vif can be basically diagnosed. If it is too high, you can directly eliminate the heteroscedasticity test. Generally use robust-se directly, and never use the general standard deviation! The problem of autocorrelation test has not been done! FDI stock data must have strong autocorrelation, direct dynamic panel estimation, and later comparison found that autocorrelation has little effect on the regression results.
Threshold regression: Require balanced panelsUnbalanced panels become balanced panels xtbalance,range(2004 2007) [Code example] use thresholddata,clearSTATA12.0: xtptm pollution,rx(pgdp) thrvar(fdi) regime(1) iters(300) trim(0.01) grid(100) STATA14.0: xthreg pollution, rx(pgdp) qx(fdi) thnum(1) bs(300) trim(0.01) grid(100) Double threshold: xthreg y, rx(x1) qx(x2) thnum(2) bs(300 300) trim(0.05 0.05) grid(100) vce(robust) [Command difference] The rx in the two commands represents the influence of the threshold variable Core explanatory variables; the qx in the xthreg command represents the threshold variable, and the thrvar in the xtptm command represents the threshold variable (some older versions of xtptm also use qx to represent the threshold variable); the thnum in the xthreg command represents the number of thresholds, and the value in the xtptm command The regime represents the number of thresholds (some older versions of xtptm also use thnum to represent the number of thresholds); the bs in the xthreg command represents the number of bootstrap samples, and the iters in the xtptm command represents the number of bootstrap samples; trim in both commands represents every The ratio of outlier removal in a threshold group; the grid in the two commands represents the number of grids calculated by the sample grid (if not set, the value is 0, setting this option can reduce the calculation time and improve the calculation efficiency). Regarding the overall operation efficiency: xtptm operation takes significantly less time than xthreg and is more efficient. Regarding the interpretation of the output results: xthreg's regression output is more friendly than xtptm: xtptm is named in the order of variables and then output, and xthreg is directly named after the variable name and then output. [Cause 1 of the error report] As for some students said that there was an error using the command, the system prompts an error such as 3200 conformability error. It is likely that you have also written the core explanatory variable in rx into the general explanatory variable, which will cause the matrix structure Error, so the system reported an error, and then returned to unable to run. The form: STATA14.0: xthreg pollution pgdp, rx(pgdp) qx(fdi) thnum(1) bs(300) trim(0.01) grid(100) such a statement is wrong. [Cause 2 of the error report] Some students said that there was an error during operation, and the system prompts errors such as thestm(): 3301 subscript invalid. It is likely that the proportion of outliers removed in each threshold group is too small, that is, the trim is set. Too small (such as 0.01), this will cause the subscript to overflow when the element is referenced and the subscript reference is invalid. In this case, you may wish to increase the trim a bit (such as 0.05), such as: STATA14.0: xthreg pollution pgdp population urbanization _ level industrialization _ level, rx(pgdp) qx(fdi) thnum(1) bs(300) trim(0.05) grid(100)
descriptive statistics1. Know the distribution status: left Or right sum x1, d return list where p (50) represents the median 2. xtsum x1 contains descriptive statistics such as standard deviation
Basic processing dataDirectly eliminate out-of-standard data winsor2 x1, replace cuts (1 99) When trim generates time dummy variables tab year, gen(dy)reg, dy* can represent all lag phase gen m=lx time values within panelA:duplicates drop firm year, forcetsset firm year
PSMgen tmp = runiform() sort tmp (all observations are sorted randomly in the above two steps) psmatch2 treat age, out(re78) logit neighbor(1) common caliper(.05) tiespstest, bothpsgraph command interpretation: the following is The syntax format of psmatch2 in the help menu, psmatch2 depvar [indepvars] [if exp] [in range] [, outcome(varlist) pscore(varname) neighbor(integer) radius caliper(real) mahalanobis(varlist) ai(integer) population altvariance kernel llr kerneltype(type) bwidth(real) spline nknots(integer) common trim(real) noreplacement descending odds index logit ties w(matrix) ate] Simply put: psmatch2 dependent variable covariate, [option]. Focus on the interpretation of the meaning of the options in the command statement. In this example, select the "nearest neighbor matching within caliper" matching method. out(re78) indicates the outcome variable. logit specifies that the logit model is used for fitting, and the default is the probit model. Neighbor(1) specifies 1:1 matching. If you want to match 1:3, set it to neighbor(3). In this example, because the control group has a limited sample size, it is only suitable for 1:1 matching. Common compulsory excludes the tendency value in the test group that is greater than the maximum tendency value of the control group or lower than the minimum tendency value of the control group. The maximum allowable distance between the caliper (.05) test group and the matched control is 0.05. Ties force records at the same time when there is more than one best match observed in the experimental group. pstest, both do the post-match balance test. In theory, only continuous variables can be tested for balance. The balance test for categorical variables should be rearranged and then the χ2 test or rank-sum test should be used. But here is also a certain reference value for categorical variables. psgraph graphically illustrates the matching results.
Which model to choose should mainly be based on economic implications, and it is not recommended to be entangled in specific statistical indicators. Whether to consider the two-way fixed effect should consider how serious the endogenous problems brought by the two are ignored. Individual fixed effects can only exclude endogenous problems that may be caused by individual unobservable factors that do not change at any time. Your explained variable seems to be GDP, is there FDI in the explanatory variable? Then I think we should consider time fixed effects. For example, some institutional factors may affect both GDP and FDI (such as intellectual property protection, trade barriers, etc.). If viewed from the national level, this institutional factor has a clear trend of change over time, then it must be eliminated or controlled, otherwise FDI will become an endogenous variable. Of course, more strictly speaking, these factors may not be completely consistent across provinces or industries. In this case, further adding the interactive term of the province (or industry) and time dummy variables can completely eliminate this type of unobservable The influence of factors.
fe or reWhen doing the panel data FE model, the dummy variables in the explanatory variables were omitted, but when doing the RE model, this did not happen. Hossman's test concluded that FE was used Model, but the dummy variable is omitted, how can I improve it? The reason for the fixed effect difference (Chen Qiang) Our teacher said that at this time, the dummy variable and some other variable are used as interactive terms. For example, x1 is dummy, and x2 is another quantity that changes with the individual over time. Let us make a x1*x2 as an explanation. The variable fixed effects model is actually N-1 dummy variable due to the existence of N-1 individual effects. If the FE model is used and there is one dummy variable, there will be multiple collinear traps. At this time, RE can be used and the instrumental variable method can be used. And so on to solve the endogeneity, to avoid the inconsistency of parameters.
The Hausman test (RE vs FE) commonly used in forums can only be performed under the assumption of "same variance". If it is "heteroskedasticity" (and if you pay attention to top journals will correct the standard error, which means that the real situation is most likely to be "heteroskedasticity"), the method discussed here needs to be used, and I deeply feel that this method is Underestimated usage, and the commonly used Hausman test (under the same variance) is overestimated! If you reject null hypothesis (RE), you should use FE (plus robust option)
LR graph drawing (each paragraph does not wrap when ordering)xthreg lncsr lnsize,rx(x1) qx(x1) thnum(2) bs( 500 500) trim(0.01 0.01) grid(100) _ matplot e(LR21), columns(1 2) yline(7.35, lpattern(dash)) connect(direct) msize(small) mlabp(0) mlabs(zero) ytitle ("LR Statistics") xtitle("First Threshold")recast(line) name(LR21) nodraw _ matplot e(LR22), columns(1 2) yline(7.35, lpattern(dash)) connect(direct) msize(small ) mlabp(0) mlabs(zero) ytitle("LR Statistics") xtitle("Second Threshold")recast(line) name(LR22) nodraw _ matplot e(LR22), columns(1 2) yline(7.35, lpattern( dash)) connect(direct) msize(small) mlabp(0) mlabs(zero) ytitle("LR Statistics") xtitle("Third Threshold")recast(line) name(LR23) nodrawgraph combine LR21 LR22 LR23, cols(1 )
Test autocorrelation & heteroscedasticityCross-sectional data mainly considers heteroscedasticity, and time series mainly considers autocorrelation. At the same time, heteroscedasticity and autocorrelation exist. First consider whether the cause of autocorrelation is model misconfiguration or pure autocorrelation. If it is pure autocorrelation, FGLS can be used to solve the autocorrelation problem. And after solving the autocorrelation, you find that there is still a problem of heteroscedasticity. However, under normal circumstances, the difference is unknown, and it is inconvenient for us to do weighted least squares. At this time, to solve the problem of heteroscedasticity, White's "heteroskedasticity robust standard error" can be used, and the statistics constructed based on this standard error can make effective statistical inferences. Another method: when heteroscedasticity and autocorrelation exist at the same time, directly use HAC, heteroscedasticity autocorrelation and consistent standard error, based on the statistic constructed by this standard error, correct inference can be made. The premise is that your sample needs to be large enough. Finally, you also need to construct a suitable model according to your own situation. The above are just theoretical references: How to test the autocorrelation and heteroscedasticity of the panel with stata? ? Hettest, szroeter, etc. can only be used after regress, not xtreg. If stata does not have this function, if eviews has it, then talk about how to implement autocorrelation and heteroscedasticity test for panel in eviews? Answer : A) Test heteroskedasticity. We can use the fact that iterated GLS=MLE (Refer to most of standard textbooks). Try the following codes.xtgls y x1 x2 x3 x4 x5, i(country) t(year) igls panels(heteroskedastic )estimates store heteroxtgls y x1 x2 x3 x4 x5, i(country) t(year)estimates store noheterolocal df=e(N _ g)-1lrtest hetero nohetero, df(`df')B) Test autocorrelation. You can NOT use the fact given above. Since you can easily see after you write down the likelihood functioin, GLS does not accout for 1/2*(1+rho), (where rho is the correlation for AR(1) model). However, you can use the following code.tsset county year, yearly (declaration time series) findit xtserialnet sj 3-2 st0039 net install st0039 xtserial y x1 x2 x3 x4 x5XTGLS is a random effects model estimation method, in your case, you can use xtreg with vce(robust ) option to correct for heteoscedasticitywhich uses Huber-White (1980) method.Q: If the test finds heteroscedasticity and autocorrelation, how should it be corrected? In theory, it is necessary to use possible generalized least s (FGLS), which is to use the commands xtgls and igsl provided by you? And in this command xtgls y x1 x2 x3 x4 x5, i(country) t(year) igls panels(heteroskedastic), should i(country) and t(year) provide data? A: Change your country to numbers first. The following is an example. Your data should look like the following sample. Once you found the problems of heteroscedasticity and autocorrelation using the tests provided, then it will be proper to use FGLS using the following STATA command:iis countrytis yearxtgls y x1 x2 x3, i(country) t(year) panels(correlated) corr(psar1)This code corrects for heteroscedasticity across sections (or countries), it corrects for autocorrelation within countries (serial correlation) and contemporaneous correlation (spatial correlation) between countries as well. It is similar (but not exactly same) to Parks' method in SAS.Data format Country year x1 x2 x3 1 1 xx x1 2 xx x2 1 xx x2 2 xx x3 1 xx x3 2 xx xXTGLS is a random effects model estimation method, in your case, you can use xtreg with vce(robust) option to correct for heteoscedasticity which uses Huber-White (1980) method.
Endogenity and instrumental variablesdepends on what model you use. The standard OLS 2SLS with IV can use the ivreg command ivreg y (x (endogenous variable) = IV (x's instrumental variable) control variables) control variables multiple endogenous variables GMMivreg2 y x1 x2 (x3 x4 x5 = z1 z2 z3), robustivreg 2sls Y X2 X i.country i.year (X1 X1X2=Z1 Z1X2), r You can also pay attention to xtivreg2
Lian YujunThe theoretical explanation of xtgls can refer to Greene (2000, 4th, chp15); personally think it is more suitable for "big T small N" or N and T equivalent panel data, and for the most widely used " This command does not apply to large N small T" type panels. You can consider using the xtscc command to get the robust standard errors of heteroscedasticity, serial correlation, and cross-section correlation. For the description of the fintit xtscc command, refer to stata journal, 2007, vol3. The document can be searched in the forum. xtgls is generally used to test random effect heteroscedasticity, and xtscc is used to test fixed effect heteroscedasticity. It is recommended to look more at help xtgls ,help xtsccfrom help file, except for small T, large N, you can also use xtabond and xtabond2 help ivreg2help xtabond2 (Heteroskedastic and autocorrelation-consistent (HAC) inference in an OLS regression) .ivreg2 cinf unem, bw(3) kernel(bartlett) robust small .newey cinf unem, lag(2) (AC and HAC in IV and GMM estimation). ivreg2 cinf (unem = l(1/3).unem), bw(3). ivreg2 cinf (unem = l(1/3).unem), bw(3) gmm2s kernel(thann). ivreg2 cinf (unem = l(1/3).unem), bw(3) gmm2s kernel(qs) robust orthog(l1.unem) (Examples using Large N, Small T Panel Data). use http://fmwww.bc.edu/ec -p/data/macro/abdata.dta .tsset id year (Autocorrelation-consistent inference in an IV regression) .ivreg2 n (wk ys = dw dk d.ys d2.w d2.k d2.ys), bw(1 ) kernel(tru) (Two-step effic. GMM in the presence of arbitrary heteroskedasticity and autocorrelation). ivreg2 n (wk ys = dw dk d.ys d2.w d2.k d2.ys), bw(2) gmm2s kernel (tru) robust (Two-step effic. GMM in the presence o f arbitrary heterosked. and intragroup correlation). ivreg2 n (w k ys = d.w d.k d.ys d2.w d2.k d2.ys), gmm2s cluster(id)
! ! ! A command template is given below in the order of empirical papers*****************************Modular command template for management empirical papers************************ ***************************by Wen Wendell _ [email protected]**************** *************************************0, data cleaning and variable generation****** **********************sum x1 x2 x3 //View outliers replace x1=. if x1==9999997 //Clean up outliers gen age2=age* age //Generate age square item drop if gender