Skip Navigation
acfbanner  
ACF
Department of Health and Human Services 		  
		  Administration for Children and Families
          
ACF Home   |   Services   |   Working with ACF   |   Policy/Planning   |   About ACF   |   ACF News   |   HHS Home

  Questions?  |  Privacy  |  Site Index  |  Contact Us  |  Download Reader™Download Reader  |  Print Print      

Office of Planning, Research & Evaluation (OPRE) skip to primary page content
Advanced
Search

Table of Contents | Previous | Next

3. META-ANALYSIS

This section provides a brief description of the meta-analysis methods that we use to accomplish the goals discussed above. Meta-analysis provides a set of statistical tools that allow one to determine whether the variation in impact estimates from evaluations of welfare-to-work programs is statistically significant and, if it is, to examine the sources of this variation. For example, it can be used to determine whether some of the variation is due to differences in the mix of program target groups, economic conditions in the places and the time periods in which the evaluations took place, or the types of services provided by the programs. Good descriptions of meta-analysis are available in Hedges (1984), Rosenthal (1991), Cooper and Hedges (1994), and Lipsey and Wilson (2001).

Separate meta-analyses of the mandatory and voluntary welfare-to-work programs, were conducted. The motivation of individuals entering these two types of programs would be expected to differ. In addition, as a result of differences in the evaluation design that is typically used, a higher proportion of those assigned to the program group of voluntary programs than those assigned to the program group of mandatory programs typically receive program services and financial work incentive payments.

The alternative to using meta-analysis to synthesize the evaluations of welfare-to-work interventions is a narrative review. Both approaches rely on comparisons among evaluated programs. It is important to recognize that, even if these comparisons are limited to programs that were evaluated through random assignment (as they are in this study), the comparisons themselves are nonexperimental in character and, thus, may be subject to bias.

Although both meta-analysis and narrative synthesis rely on available information from evaluation reports, and as a result and as discussed later, are subject to numerous limitations, meta-analysis offers a number of advantages. Possibly most importantly, it imposes discipline on drawing conclusions about why some programs are more successful than others, by formally testing whether apparent relationships between estimated program impacts and program, client, and environmental characteristics are statistically significant. Moreover, it can focus on one of these characteristics, while statistically holding others constant. In addition, given a set of evaluations that are methodologically solid (for example, based on random assignment) narrative synthesis typically gives equal weight to each, regardless of the statistical significance of the estimates of program impacts or the size of the sample upon which they are based. As discussed in the following section, meta-analysis uses a more sophisticated approach.

3.1 WEIGHTING

In conducting a meta-analysis of program impacts, it is essential take account of the fact that the impact estimates for the individual program are based on different sample sizes and, hence, have different levels of statistical precision. The reason for taking account of different levels of statistical precision is suggested by the following formal statistical model, which explains variation in a specific program impact, such as on earnings, employment, AFDC receipts, or child outcomes:

Ei = Ei* + ei, where i = 1, 2, 3,…, n

where Ei is the estimated effect or impact of a welfare-to-work intervention, E*i is the “true” effect (obtained if the entire target population had been evaluated), n is the number of interventions for which impact estimates are available, and ei is the error due to estimation on a sample smaller than the population. It is assumed that ei has a mean of zero and a variance of vi.

To provide an estimate of the mean effect that takes account of the fact that vi varies across intervention impact estimates, a weighted mean can be calculated, with the weight being the inverse of vi, 1/vi. The reason for weighting by the inverse of the variance of the estimates of program impacts is intuitive. In evaluations, estimates of impacts from policy interventions are usually obtained by using samples from the intervention’s target population. One subset of persons from this population who are assigned to the program is compared to another subset of persons from the same population who are not assigned. As a result of sampling from the target population, the impact estimates are subject to sampling error. The variance of an estimated impact (which typically becomes smaller as the size of the underlying sample increases) indicates the size of the sampling error. In general, a smaller variance implies a smaller sampling error and, hence, that an impact estimate is statistically more reliable. Because all estimates of intervention impacts are not equally reliable, they should not be treated the same. By using the inverse of the variance of the effect estimates as a weight, estimates that are obtained from larger samples and, therefore, are more reliable, contribute more to various statistical analyses than estimates that are less reliable.7

We use such weights throughout in conducting the statistical analysis presented in this report. Typically, however, the evaluations used in this study did not report the exact value of the variance of the impact estimates, but instead reported that estimates of impacts were not statistically significant or were significant at the 1-, 5-, or 10-percent levels. Thus, the standard errors had to be imputed, except for those relatively rare instances when exact standard errors were provided. Once the standard errors were imputed, the variance could be computed as their square.

For impacts that are measured as proportions (e.g. the impact on the percentage of program group members who are employed or receiving AFDC), the imputation of the standard errors was done as follows:

σ2 = √[(Pt(1-Pt)/Nt)+(Pc(1-Pc)/Nc)],

where σ2 is the standard error of the program impact, Pt is the proportion receiving AFDC in the treatment group, Nt is the number of people in the treatment group, Pc is the proportion receiving AFDC in the control group, and Nc = the number of people in the control group.

For impacts that are measured as a continuous variable (e.g., earnings or the amount of AFDC received), imputation of the standard error is considerably more complex. First, for impacts that were significant at the 5- or 10-percent levels, it was assumed that the p-value was distributed at the midpoint of the possible range, i.e., if 0.1>p>0.05, p was assumed to equal 0.075; and if 0.05>p>0.01, p was assumed to equal 0.03. Second, cases for which impacts were significant at the 1-percent levels have an unbounded t-value and cases for which impacts were non-significant can have extremely small standard errors. Therefore for these cases, we used the following procedure: (1) we multiplied each of the standard errors imputed as described above for impacts that were significant at the 5- or 10-percent levels by the square root of the sample on which the impact estimate was based; (2) we computed the average of the values derived in (1); (3) for cases in which impacts were significant at the 1-percent level or were non-significant, we imputed the standard error by dividing the constant derived in (2) by the square root of the sample size on which the impact estimate was based.

No measures of statistical significance are available for the cost-benefit estimates of program effectiveness because such estimates are a composite of separate impact estimates. Thus, in our analysis of these measures, we weight by the square root of the total sample used in the evaluation (i.e., by the square root of Nt + Nc), rather than by 1/vi. However, for impact estimates for which both measures are available, the simple correlation between them is quite high, around .85 to .9.

3.2 STEPS IN CONDUCTING THE STATISTICAL ANALYSIS OF THE PROGRAM

EFFECT ESTIMATES

Descriptive Analysis. The first step in performing the meta-analysis was to conduct a descriptive analysis of program impacts. Thus, we present statistics for the means and medians of the impact estimates, their standard deviations, and their minimum and maximum values. Both weighted and unweighted means are reported. These statistics provide an overall picture of the size of the effects of welfare-to-work programs and how they vary.

Regression Analysis. The next step consists of using regression analysis to explain the variation among the program effect estimates. This analysis is limited to the evaluations of the mandatory programs, as there are an insufficient number of observations for the voluntary programs to conduct a regression analysis. The analysis performed for the voluntary programs is described below. For the reasons discussed above, we focus on regressions that are weighted by 1/vi. However, given the problems in computing 1/vi, we also estimated unweighted regressions) for comparison purposes. It may be useful to point out that the R-squared in both the unweighted and weighted regression must be less than one because the program impact estimates are subject to sampling error. This would be true even if all the systematic sources of variation in the program impact estimates could be taken into account.

To examine how the impacts of welfare-to-work interventions change over time, we pooled impact measures across the twenty post-random assignment calendar quarters in our database. Otherwise, however, we estimated separate regressions for intervention impacts measures in four different post-random assignment calendar quarters, the 3rd, 7th, 11th, and 15th.8 There are three reasons we did this. First, we can determine whether the importance of certain explanatory variables change over time. For example, one might anticipate that job search would have a stronger influence on earnings during the early post-random assignment quarters than later calendar quarters and that the opposite might be true of vocational training. Second, an evaluation of a welfare-to-work intervention usually reports impact estimates for several different calendar quarters. These impact estimates are not statistically independent of one another. Moreover, more quarters of impact estimates are available for some evaluated programs than for others. Thus, pooling across quarters would inappropriately give more weight to some evaluations than to others. Estimating separate regressions for different quarters helps circumvent these problems. Third, we conducted Chow tests of several of the impact measures to see if different regression models were needed for different calendar quarters. The tests resoundingly rejected the hypothesis that the coefficient vector for calendar quarters 1-10 are the same as that for quarters 11-20. Although less strongly, they also rejected the hypotheses that the regression models were the same for quarters 1-5 as for quarters 6-10 and the same for quarters 11-15 as for quarters 16-20. These results imply that although impact estimates might be pooled across a few adjacent or nearly adjacent calendar quarters, separate regressions should be estimated for quarters that are far apart.

In estimating separate regressions for quarters 3, 7, 11, and 15, we conducted additional Chow tests to determine whether different regression models are required for program impacts that are estimated for one-parent families and for those that were estimated for two-parent families or for impacts for programs that provided services and for impacts for programs that only provided financial work incentives. This time, the tests strongly and consistently indicated that the coefficient vectors did not significantly differ for these different groups and, hence, that the impact estimates could be pooled across the groups.

In estimating the regressions, we needed measures of the difference that welfare reform programs made in terms of the type and range of services provided as explanatory variables. For this purpose, we used the difference in participation rates in various activities (job search, basic education, work experience, and so forth) between those assigned to programs and those assigned to the control groups. Thus, we obtain measures that quantify the "net effect" of the introduction of a welfare reform program relative to the traditional program. These measures have an advantage over other effect indicators, such as stated policies or declared program intentions, in that they reflect what actually occurred. In addition, they take account of program non-participation, including caseload attrition due to the unassisted return to work or leaving the welfare rolls on one's own volition. As relative or "net" effect indicators, they also take account of variations in the intensity of service provision between different programs and program sites.

There is a possibility, however, that the measures of program participation rates that are used as explanatory variables in the regressions are endogenously determined. This could occur, for example, if programs that have a client population of individuals who are mostly job ready (e.g., high school graduates with considerable previous work experience) tend to stress job search, while programs with large fractions of clients who are not job ready tend to emphasize basic education. Similarly, programs that are located at sites with low unemployment rates might tend to emphasize job search and those with high unemployment rates might make more use of vocational training. Under these circumstances, program participation rates would, in part, reflect client and site characteristics, causing estimates of the relation between these measures and program impacts to be biased. It should be borne in mind, however, that the regressions control directly for client and site characteristics. Moreover, as discussed above, the program participation rates that we actually use in the regressions are measured in terms of the degree to which each program changes the pre-program regime—that is, the difference between the program group and the control group. Although program designs may reflect the characteristics of the available client population or local environmental conditions, it is not apparent that changes in how programs are run would be affected by client and site characteristics, assuming that these characteristics remain fairly stable.

Homogeneity Tests. For the 7th and the 11th calendar quarters, the database contains ten impact estimates for programs that placed welfare recipients who volunteered to participate into temporary jobs that paid them a stipend while they learned work skills. This is an insufficient number to conduct a regression analysis of the sort we conducted with the mandatory programs. Thus, we have conducted formal tests of homogeneity instead. We also conducted tests of homogeneity of the measures of the effects of welfare-to-work programs on child well-being, because sample size is also limited. In this case, we have a variety of different measures for each of three different age groups, but relatively few estimates for most measures. The homogeneity tests allowed us to see whether the estimated impacts differ significantly from one another (e.g., whether impacts for expensive interventions differ from impacts for inexpensive interventions).

A homogeneity test relies on the Q statistic, where Q is the weighted sum of squares of the estimated impacts, Ei , about the weighted mean effect, Ë and where (as before) the weights i are the inverse of the variance of the estimated impacts (Lipsey and Wilson 2001 pp. 215-216).

Thus, the formula for Q is

Q = Σ1/vi (Ei - Ë)2

Q is distributed as a chi-square with the degrees of freedom one less than the number of program effect estimates. If Q is below the critical chi-square value, then the distribution in the program effect estimates around their mean is no greater than that expected from sampling error alone. If the null test of homogeneity is rejected (i.e., Q exceeds the critical value), this implies that there are differences among the program effect estimates that are due to systematic factors (e.g., differences in program or target group characteristics), not just sampling error alone.

To analyze the voluntary welfare-to-work programs and the child well-being impact measures, we first pooled all the available program impact estimates and then, using the test described above, determined whether they are distributed homogeneously. In those cases when they are not, we then divided the impact estimates into subgroups on the basis of various potential explanatory factors (e.g., differences in net government operational costs, services provided, client characteristics, or site environmental characteristics) and repeated the homogeneity test. If the impact estimates for the subgroups are more homogeneous than those for the full set of observations, then this suggests an explanation for at least some of the divergence in the impact estimates.




7 There is an alternative weighting scheme that attempts to take account of factors that cause variation in program impacts that were not measured (e.g., the quality of leadership at program sites or local attitudes towards welfare recipients), as well as sampling variation (see Raudenbush, 1994). However this method is laborious to implement and we do not use it here. (back)

8 There were a few evaluations that did not reported impact estimates for the quarters of interest, but did report them for nearby quarters—for example, for quarter 6 or 8 or 9, but not quarter 7. These values were included in conducting the analysis in order to maximize the number of quarterly observations on which the calculations are based. In addition, there were a few evaluations that reported program effects on annual earnings and annual AFDC receipts, but did not provide quarterly estimates of these impacts. In these instances, the annual estimates were divided by four and assigned to the quarter of interest that occurred during the year over which the annual impacts were measured. (back)

 

Table of Contents | Previous | Next