Skip Navigation
Administration for Children and Families  
ACF
ACF Home   |   Services   |   Working with ACF   |   Policy/Planning   |   About ACF   |   ACF News   |   HHS Home

  Questions?  |  Privacy  |  Site Index  |  Contact Us  |  Download Reader™  |  Print      

Office of Planning, Research & Evaluation (OPRE) skip to primary page content
Advanced
Search

Table of Contents | Previous

APPENDIX B: SAMPLE SIZE REQUIREMENTS

This appendix discusses the estimation of sample size requirements for various designs for evaluating quality enhancement ideas in Head Start. We present a mathematical formulation to demonstrate the sources of variance under each sample design that would be plausible to use for evaluating the quality enhancement ideas. We then present estimates of sample size requirements for these sample designs.

The minimum detectable effect sizes (MDEs) represent the smallest true impact in effect size units (the percentage of the standard deviation of the outcome measure) that can be detected with a high probability. They pertain to overall impact estimates when data are pooled across all programs, centers, classrooms, and children in the study.

The MDE formula used in the calculations can be expressed as follows:

(1) MDE = 2.802 * (√[(1-R2)Var(impact)])/σ,

where 2.802 is a constant that applies when using a two-tailed test at the 5 percent significance level, 80 percent power, and infinite degrees of freedom; R2 is the R-squared value from the regression adjusting impact estimates for baseline demographics and other variables; Var(impact) is the variance of the impact estimate (mean treatment and control group difference in the outcome measure); this term should take into account design effects arising from the clustering of the sample and weighting; and σ is the standard deviation of the outcome measure.

The main focus of this appendix is the derivation of the variance of the impact estimates when samples are clustered under each design alternative. The precision of the impact estimates decreases substantially as the sample becomes more clustered. Children in Head Start programs are clustered in classrooms; classrooms are clustered in centers; and centers are clustered in grantees. Thus, for a given sample size of children, precision levels are maximized when the unit of random assignment is at the child level, and they decrease when the unit of random assignment is at the classroom, center, or program level. Consequently, by increasing the number of grantees and centers that must participate in the study to yield a desired level of precision, clustering can have a large impact on the costs of the evaluation.

The design effect of weighting increases the variance of the impact estimates when the analysis incorporates differential weights that are constructed to adjust for differential participation rate of the enhancement and control groups. The design effect of weighting is a constant that is larger if the sample for either the enhancement group or control group is underrepresented and therefore differentially weighted (perhaps because of differential consent rates); it does not vary with sample size. It generally takes values between 1.1 and 1.5, with the lower values corresponding to samples with high participation rates. By making efforts to ensure high participation rates of centers, classrooms, and children, researchers can minimize the design effect of weighting. The likelihood of minimizing this term (and the possibility of other offsetting factors in the calculation of minimum detectable effects) has led us to calculate sample sizes for this appendix without the design effect of weighting. Nevertheless, it should be kept in mind as site recruitment and data collection efforts are planned because of its independent effect on the precision of the impact estimates.

  1. VARIANCE CALCULATIONS FOR GROUP-BASED EXPERIMENTAL DESIGNS

  2. In this section, we focus on the sources of variance under each sample design that could be used to evaluate quality enhancement ideas in Head Start. We show how Var(impact) from the MDE formula can be estimated under each sample design. The applicability of each design depends on the intervention under investigation. For example, some interventions must be tested at the center level due to potential spillover effects, whereas others could be tested at the classroom or even the child level.

    1. Non-Clustered Design

    2. We begin with the simplest, non-clustered design to illustrate the key components of the variance estimate. In this evaluation design, we identify a group of grantees (and/or centers) to participate in an evaluation of an individual child- or family-focused intervention, such as a mentoring program matching high school students and Head Start children or a special parent education program. In this type of non-clustered design, children within purposively selected (or volunteer) centers would be randomly assigned to a research status directly; no program-, center-, or classroom-level clustering would take place. The variance of the estimated impact on an outcome measure (that is, the difference between the mean outcome of treatment and control group members) must account for between-child variance only and can be expressed as follows:

      (2) Var(impact) = (2σ2)/(cs) ,

      where c is the number of centers in the sample, s is the average number of children per center in the treatment and control groups (which are assumed to be equal) and σ2 is the variance of the outcome measure.

      We assume equal numbers of treatment and control children, because, for a given total research sample size, a 50:50 split between the two groups yields the most precise estimates. A finite sample correction (equal to one minus the proportion of the population being selected) could be included in the equation; however, for simplicity and to produce conservative estimates, we do not include this term. We follow this approach for the remainder of this appendix.

      As discussed in the report, a non-clustered design might not be feasible or applicable for many interventions. Instead, the designs typically involve random assignment of groups (classes, centers, or grantees) to the enhancement or control condition. In these designs, the variance is increased relative to that in equation (2) because the variance calculations must take into account the additional variance that arises because of clustering at the level of random assignment. The explanation is that the outcomes for children within these clusters are correlated, and a different random assignment would yield a different set of centers or classrooms in the treatment and control groups. In addition, the variance calculations must take into account additional variance that may arise if the set of programs, centers, and classrooms in the sample are to be representative of a broader Head Start population relative to whether the impact estimates will be generalized only to the sample included in the evaluation. We discuss these issues in additional detail in the next section.

    3. Fixed or Random Effects Designs

    4. A fundamental issue to consider when pooling across classrooms, centers, or grantees is whether site effects should be treated as fixed or random. This decision is likely to depend on whether the evaluation is taking a Stage 2 approach (smaller-scale evaluation, using volunteer sites) or a Stage 3 approach (field test). For most Stage 2 evaluations, sites (such as centers and grantees) will be volunteers that are purposively selected for the study because they are in a specific region, have a set of characteristics desired for the control group, and are willing and able to participate in the evaluation. In general, in these instances, the variance calculations for pooled impact estimates should not account for between-site variance terms. The sample was not randomly selected, so the pooled impact estimates can be generalized only to the study sites, rather than to a broader population of sites. Stated differently, the study can produce impact estimates that are internally valid, but not necessarily externally valid. For Stage 2 designs, where the goal is to determine whether an intervention can work, it is sufficient to demonstrate impacts within a purposively chosen sample (that is, a “best-case” scenario for the efficacy of the quality enhancement idea).

      In contrast, Stage 3 evaluations are designed so that the results can generalize to all Head Start centers and children (or to a well-defined subset, such as a region). In these evaluations, grantees and centers will be randomly selected from all Head Start grantees and centers (or from a well-defined subpopulation of programs). Study results can be generalized more broadly in these random-effects designs than in the fixed-effects designs. However, this generalization involves a cost in terms of precision levels: the variance formulas must be inflated to account for between-site effects. Stated differently, site effects must be treated in the variance formulas as random, not as fixed. Intuitively, in repeated sampling, a different set of sites would be selected for the evaluation, which could influence the impact findings. Hence, the variance expressions must account for the extent to which mean child outcomes vary across sites. In the remainder of this section, we present formulas for both fixed- and random-effects designs. These formulas can be derived from standard sampling theory for clustered designs (see Cochran 1977 and Kish 1965) or from random effects models assuming normally distributed error terms.

    5. Variance Calculations for Head Start Quality Enhancement Designs

    6. For evaluations of Head Start quality enhancements, we consider experimental designs in which any of the following units are randomly assigned to a research status:

      • Children

      • Classrooms

      • Centers

      • Grantees

      We also consider designs in which these units are randomly selected from a larger pool of units, either before or after random assignment. For example, if centers are randomly assigned to the enhancement or a control group, we consider the case in which grantees are randomly selected from a larger set of grantees (the random-effects case), as well as when they are selected purposively (the fixed-effects case). Each level of random assignment and random selection introduces additional layers of clustering into the variance formulas.

      1. Variance Estimates for Stage 3 Designs

      2. Table B.1 summarizes the designs that we consider for Stage 3. It also provides an example of how the design might be used for a Stage 3 evaluation, and displays equation numbers for the variance formulas for each design.

        For Stage 3 designs, we would randomly select grantees for the evaluation (and would include the appropriate variance terms for the grantee-level clustering). We would use either all centers or randomly select centers (and would do the same for classrooms within centers). No grantees or centers would be purposively selected; the goal would be to generalize the results across all Head Start programs, centers, and children. We present the designs from the least to the most clustered.

        Random Assignment of Children. A design that could be used for some evaluations of Head Start quality enhancements is to randomly assign children or families to receive either quality enhancement services or regular Head Start services. For example, an intensive parent education program might be offered to randomly selected families within a Head Start center. Families in the control group would receive the normal Head Start services, whereas those in the intervention group would receive the parent education program. This type of design is appropriate for interventions that are administered on an individual child or family basis (such as a child mentoring program involving a match between selected children and high school students), and for which potential spillover effects are small.

        Table B.1. Stage 3 Designs
          Purposive Selection Random (Representative) Selection Sources of Clustering Results Can Generalize to: Equation Number for Variance Formula Example from Report Comments
        Child-Level Random Assignment None Grantee Center Grantee Center All Head Start children 3 Parent education intervention
        Match children with high school mentors
        Intervention must be at the child level (not classroom-wide); few quality enhancements meet this criterion.
        Classroom-or Teacher-Level Random Assignment None Grantee Centers
        a. Use all classrooms
        b. Randomly select classrooms
        Grantee Center Classrooms All Head Start classes 4 Alternative mathematics curricula Randomly assign teachers after class lists are set or randomly assign children to classrooms after teachers are randomly assigned.
        In a Stage 3 design, classroom random assignment would be very difficult to monitor.
        Center-Level Random Assignment None Grantee
        a. Use all centers
        b. Randomly select centers
        Grantee Center All Head Start centers 5, 6 Alternative mathematics curricula
        Approaches to addressing children's behavioral problems
        T/TA to use program data for quality improvement
        Centers are randomly assigned, so clustering is at the center level; it does not affect power to use all centers or randomly select centers.
        Grantee-Level Random Assignment None Grantee
        Use all centers
        Grantee All Head Start grantees 8 T/TA to use program data for quality improvement This design would be best if grantees do not include large numbers of centers
        None Grantee
        Randomly select centers
        Grantee
        Center
        All Head Start grantees 7 T/TA to use program data for quality improvement This design would permit sampling centers so that each grantee includes a similar number of centers for analysis
        T/TA = training and technical assistance.

        At Stage 3, the participating grantees and centers would be randomly selected to represent all Head Start programs. However, because the intervention focuses on children, the power of the design could be strengthened by selecting the children without regard to classroom. This step would eliminate classroom-level clustering from the variance calculations. For this design, the variance calculations would be approximated by:

        (3) Var (pooled impact) = ([2σ2ρ1(1-c1)]/g) + ([2σ2ρ2(1-c2)]/cg) + ([2σ2(1-ρ12)]/cgs),

        where g is the total number of grantees in the sample; c is the average number of centers per grantee; s is the number of treatment or control children per center; ρ1 is the grantee-level intraclass correlation in the outcome measure; ρ2 is the center-level intraclass correlation; σ2 is the variance of the outcome measure; c1 is the correlation in mean outcomes between treatment and control group children within grantees; and c2 is the correlation between treatment and control group children’s mean outcomes within centers in grantees. The correlation in mean outcomes between treatment and control groups will be smaller if impacts are either large or of different sizes (and directions) for subgroups of children within treatment and control groups. In equation (3), the first term essentially represents the extent to which impacts differ across grantees, and the second represents the extent to which impacts differ across centers within grantees.

        Random Assignment of Classrooms. Another design is to randomly assign classrooms (or teachers) within randomly selected centers to the intervention or control groups. For example, teachers could be randomly assigned to either try a new early mathematics curriculum or continue their usual practices in the Head Start classroom. This type of design is appropriate for interventions that are administered at the classroom level, and for which potential spillover effects are deemed to be small. Note, however, that because a very large number of classrooms and centers would be involved in a Stage 3 evaluation, it would be difficult to visit classrooms frequently enough to ensure that spillover does not occur. An enhancement involving classroom random assignment at Stage 3 would have to be intrinsically difficult to move from one classroom to another.

        For a Stage 3 design, the grantees and centers would be sampled to represent all Head Start programs. Under this design, grantee-level, center-level, and classroom-level clustering are present and the variance expression can be approximated as follows:

        (4) Var(impact) = ([2σ2ρ1(1-c1*)]/g) + ([2σ2ρ2(1-c2*)]/cg) + ([2σ2ρ3]/cgl) + ([2σ2(1-ρ123)]/cgls)

        where g is the total number of grantees in the sample; c is the average number of centers per grantee; l is the average number of treatment or control classrooms per center; and s is the average number of children per classroom; ρ1 is the grantee-level intraclass correlation in the outcome measure; ρ2 is the center-level intraclass correlation; ρ3 is the classroom-level intraclass correlation; σ2 is the variance of the outcome measure; c1* is the correlation in mean outcomes between treatment and control group classrooms within grantees; and c2* is the correlation in mean outcomes between treatment and control group classrooms within centers within grantees.

        Random Assignment of Centers. In some designs, centers within grantees would be randomly selected to the intervention and control groups. This design would be chosen if the quality enhancement should be implemented center-wide because classroom spillover effects are likely, or because the quality enhancement will be more effective if everyone in the center is engaged in implementation. For example, teachers implementing strategies to promote positive social-emotional behavior in the classroom might be inclined to share those strategies with teachers in other classrooms who are having difficulties with children’s behavior. Similarly, individual teachers might implement the behavioral strategies more successfully if they are able to discuss their experiences and methods with other teachers in the center.

        The research design would call for random sampling of grantees and, possibly, of centers before random assignment. Because the intervention is fairly intensive to implement, the evaluation would be more cost-effective if, in every center that is randomly assigned, every classroom were included in the evaluation. In this case, clustering would occur only at the grantee and center levels, and the variance expression can be approximated as follows:

        (5) Var(impact) = ([2σ2ρ1(1-c1**)]/g) + ([2σ2ρ2]/cg) + ([2σ2(1-ρ12)]/cgs) ,

        where g is the total number of grantees in the sample; c is the average number of treatment or control centers per grantee; s is the average number of children per center; ρ1 is the grantee-level intraclass correlation in the outcome measure; ρ2 is the center-level intraclass correlation; σ2 is the variance of the outcome measure; and c1** is the correlation in mean outcomes between treatment and control group centers within grantees.

        An alternative design involving random assignment of centers might test alternative mathematics curricula, which, at Stage 3, could be implemented by distributing manuals, videotaped lessons for teachers, group activities to be conducted among all teachers in the center as part of professional development activities, and assistance from the Head Start Training and Technical Assistance network. For this evaluation, the design could call for random selection of a sample of grantees, random assignment of centers, and data collection in a randomly selected group of classrooms in each center. This design would involve clustering at the grantee, center, and classroom levels, and the variance expression can be approximated as follows:

        (6) Var(impact) = ([2σ2ρ1(1-c1**)]/g) + ([2σ2ρ2]/cg) + ([2ρ2ρ3]/cgl) + ([2σ2(1-ρ123)]/cgls)

        where g is the total number of grantees in the sample; c is the average number of treatment or control centers per grantee; l is the average number of classrooms per center; s is the average number of children per classroom; ρ1 is the grantee-level intraclass correlation in the outcome measure; ρ2 is the center-level intraclass correlation; ρ3 is the classroom-level intraclass correlation; σ2 is the variance of the outcome measure; and c1** is the correlation in mean outcomes between treatment and control group centers within grantees.

        Random Assignment of Grantees. Some interventions are appropriately implemented at the grantee level. For example, a change in management strategies, such as technical assistance to programs to help the programs to use information from a variety of program data sources in order to assess and identify areas for quality improvement, would be most appropriately implemented among all grantees. Similarly, a management course for directors or a change in the way that education coordinators work with teachers would have grantee-wide effects.

        For an evaluation of this type of quality enhancement, we would obtain a representative sample of grantees and would randomly assign the sample members to implement the quality enhancement or not. Although many centers potentially would be affected by the enhancement, the evaluation could focus on a random sample of centers and children within centers. Clustering would occur at the grantee level and, if centers are randomly selected for data collection, at the center level as well. For a design involving random assignment of grantees and a random sample of centers within grantees, the variance expression can be approximated as follows:

        (7) Var(impact) = ([2σ2ρ1]/g) + ([2σ2ρ2]/cg) + ([2σ2(1-ρ12)]/cgs) ,

        where g is the average number of treatment or control grantees in the sample; c is the average number of centers per grantee; s is the average number of children per center; σ2 is the variance of the outcome measure; ρ1 is the grantee-level intraclass correlation in the outcome measure; and ρ2 is the center-level intraclass correlation.

        For a design involving random assignment of grantees with all centers and classrooms included in the evaluation, the variance expression can be approximated as follows:

        (8) Var(impact) = ([2σ2ρ1]/g) + ([2σ2(1-ρ1)]/gs) ,

        where all terms are defined above.

      3. Variance Estimates for Stage 2 Designs

      4. The variance calculations for Stage 2 designs are simpler than those for Stage 3 because they have to account for clustering only at the level of random assignment and, possibly, at another stage, if the sample is drawn randomly at that level. Table B.2 summarizes the various designs that we consider for Stage 2, provides an example of how the design might be used for a Stage 2 evaluation, and displays equation numbers for the variance formulas for each design.

        Table B.2. Stage 2 Designs
          Purposive Selection Random (Representative) Selection Sources of Clustering Results Can Generalize to: Equation Number for Variance Formula Example from Report Comments
        Child-Level Random Assignment Grantee None None All children in the selected grantees 9 Parent education intervention
        Mentoring by high school students
        Limit evaluation to smaller grantees.
        Grantee Center None None All children in the selected centers 9 Parent education intervention
        Mentoring by high school students
        Limit evaluation to a small number of centers within selected grantees.
        Classroom- or Teacher-Level Random Assignment Grantee Centers Centers Classrooms Children/ classes in the selected grantees 11 Special curriculum
        Approach to working with children with behavioral issues
        Random assignment at the classroom level introduces classroom-level clustering, so the power is not affected by the choice of (1) using all classrooms, or (2) using a randomly selected set of classrooms.
        Grantee Center None Classrooms Children/ classes in the selected centers 10 Special curriculum
        Approach to working with childrenwith behavioral issues
        This design differs from the preceding one only in that centers are purposively selected, rather than randomly selected, thereby increasing power.
        Center-Level Random Assignment Grantee None Centers All children in the selected grantees 12 Alternative mathematics curriculum T/TA to use program data for quality improvements Random assignment at the center level introduces center-level clustering, so power is not affected by the choice of (1) using all centers, or (2) using a randomly selected set of centers. For power, it would be better to select students for the sample without regard to classroom.
        Grantee Center None Centers All children in the selected centers 12 Alternative mathematics curriculum T/TA to use program data for quality improvements The only reason to choose this design over the preceding one is to obtain a sample of volunteer centers. The power would not differ.
        Grantee Center Classrooms Centers Classrooms All children in the selected classrooms 13 Alternative mathematics curriculum T/TA to use program data for quality improvements This design would be used if classrooms within centers are selected for observation. The power would be much less than under the previous design.
        None Grantee Center Grantee Center All Head Start Children 15 T/TA to use program data for quality improvements Children would be sampled from within centers This design is unlikely to be used for Stage 2 because its sample size requirements are high. For most designs, it is not practical to deploy a quality enhancement within a full grantee, and to then sample some centers and/or children to measure outcomes.
        None Grantee Grantee All Head Start children 14 T/TA to use program data for quality improvements This design has the same issues as the preceding one, although power would be somewhat better without center-level clustering.
        T/TA = training and technical assistance.

        In each of the Stage 2 designs listed, grantees are purposively selected, which is consistent with conducting these evaluations with programs that agree to participate. In some cases, all centers would be included in the evaluation; in others, centers might be purposively selected (volunteer) or might be randomly selected. Including all centers in the evaluation might be feasible if all of them agree to participate when the grantee agrees to participate; this scenario is more likely to occur when grantees include a small number of centers. Alternatively, centers that do not want to participate could decline, as their choice affects only the ability to generalize results to the entire grantee population, or only to centers included in the evaluation. Random selection of centers should occur only if centers are randomly assigned, as random selection at the center level with random assignment at a different level would unnecessarily add to the sample size requirements for a Stage 2 evaluation.

        Similar principles can be identified for selection of classrooms within centers. If random assignment is conducted at the classroom level, clustering at the classroom level must be included in variance calculations, and, therefore, power will not be affected by the choice of whether or not to sample classrooms. However, if random assignment occurs at the center or grantee level, one should avoid adding a level of clustering by randomly selecting classrooms. Instead, the best choice from the perspective of maximizing power is to include all classrooms (within the designated grantees and centers) in the evaluation, or by selecting children for the study without regard to classroom.

        Random Assignment of Children. For some Stage 2 designs, children (or families) could be randomly assigned to a quality enhancement that focuses on individuals. A mentoring program for children and an intensive parent education program are examples of this type of quality enhancement.

        In these designs, clustering can be avoided by purposively selecting grantees and centers, and then selecting children for the study from a center-wide list (without regard to classroom). The variance for the impact estimates across centers can be approximated as follows:

        (9) Var (impact) = (2σ2)/(cs) ,

        where c is the total number of centers in the sample, σ2 is the variance of the outcome measure, and s is the average number of treatment or control children per center.

        Random Assignment of Classrooms Within Centers. Some Stage 2 designs could involve classroom-level random assignment. For example, an evaluation might test a computer-based language curriculum in randomly assigned classrooms. This type of design is appropriate for interventions that are administered at the classroom level and for which potential spillover effects are deemed to be small.

        In these designs, clustering is at the classroom level, but additional clustering can be avoided by purposively selecting grantees and centers for the study. Intuitively, if sampling were repeated, a different random allocation of classrooms would be selected to the treatment and control groups. Hence, the variance expressions must account for the extent to which mean child outcomes vary across classrooms. If grantees and centers agree to participate in the study, then center and grantee effects should be treated as fixed. Accordingly, the variance formula for these impact estimates can be expressed as follows:

        (10) Var(impact) = [(2σ2ρ3)/(cl)] + [2σ2(1-ρ3)/(cls)]

        where c is the total number of centers in the sample, l is the average number of treatment (control) classrooms per center, s is the average number of children per classroom, σ2 is the variance of the outcome measure, and ρ3is the intra-classroom variance as a proportion of the total variance.

        If centers are randomly selected to participate in the evaluation and classrooms are randomly assigned to the quality enhancement or control groups, then clustering would occur at the center and classroom levels. Under this design, the variance expression would be approximated by:

        (11) Var(impact) = [(2σ2ρ2(1-c2*))/c] + [(2σ2ρ3)/(cl)] + [(2σ2(1-ρ23))/(cls)]

        where c is the average number of centers in the sample; l is the average number of treatment or control classrooms per center; s is the average number of children per classroom; ρ2 is the center-level intraclass correlation; ρ3 is the classroom-level intraclass correlation; σ2 is the variance of the outcome measure; and c2 * is the correlation in mean outcomes between treatment and control group classrooms within centers. Because the additional center-level clustering term would add to the variance of the impact estimates, this design would require a larger sample than the previous design.

        Random Assignment of Centers. Other Stage 2 designs might involve random assignment of centers to different quality enhancements. For example, an approach to addressing children’s behavioral problems by working with the children and families with the support of a behavioral specialist might be most efficiently implemented center-wide.

        When centers are randomly assigned as part of a Stage 2 design, grantees would be voluntary participants (purposively selected). Although centers likely would be volunteers as well, random assignment at the center level would mean that the power would not be affected if centers were randomly selected for the evaluation instead. Whether or not classrooms are randomly selected for the evaluation would affect power, so we discuss the alternatives here.

        Under one design option, classrooms would not be sampled within the treatment and control group centers. For this option, either all relevant classrooms in the selected centers are included in the research sample or children are sampled directly to the research sample without regard to their classrooms. In practice, the researcher would be choosing between inclusion of all consenting children from each center in the research sample or the random selection of children for the research sample from the pool of children for whom consent had been given. For both options, clustering occurs at the center level, but not at the classroom level. Intuitively, if sampling were repeated, a different random allocation of centers would be selected to the treatment and control groups, but not a different set of classrooms within centers. Consequently, the variance of an impact estimate can be expressed as follows:

        (12) Var(impact) = [(2σ2ρ2)/(c)] + [2σ2(1-ρ2)/(cs)]

        where c is the average number of treatment or control centers; s is the average number of children per center; σ2 is the variance of the outcome measure; and ρ2 is the center-level intraclass correlation in the outcome measure.

        For a center-based experimental evaluation, another design option that conserves project resources is to sample classrooms within the study centers. In this case, clustering occurs at both the center and classroom levels. In the presence of both center- and classroom-level clustering, the variance formula can be expressed as follows:

        (13) Var(impact) = [(2σ2ρ2)/(c)] + [(2σ2ρ3)/(cl)] + [2σ2(1-ρ23)]/(cls)]

        where c is the average number of treatment or control centers per grantee; l is the average number of classrooms per center; s is the average number of children per classroom; σ2 is the variance of the outcome measure; ρ2 is the center-level intraclass correlation in the outcome measure; and ρ3 is the classroom-level intraclass correlation. The addition of a center-level clustering term to the variance calculation increases the sample size requirements of this design relative to the previous one.

        Random Assignment of Grantees. For some evaluation designs, the quality enhancement would be implemented at the grantee level. For example, a change in approach to managing the Head Start program would have grantee-wide effects. Nevertheless, the sample size requirements for the evaluation of this option are likely to be comparable to those of a Stage 3 evaluation.

        (14) Var(impact) = [(2σ2ρ1)/(g)] + [2σ2(1-ρ1)/(gs)]

        where g is the average number of treatment or control grantees in the sample; s is the average number of children per center; σ2 is the variance of the outcome measure; and ρ1 is the grantee-level intraclass correlation in the outcome measure.

        If grantees are randomly assigned and the evaluation includes only a random sample of centers within each of the grantees, the variance formula can be expressed as follows:

        (15) Var(impact) = [(2σ2ρ1)/(g)] + [(2σ2ρ2)/(cg)] + [(2σ2(1-ρ12))/(cgs)]

        where g is the average number of treatment or control grantees in the sample; c is the average number of centers per grantee; s is the average number of children per center; σ2 is the variance of the outcome measure; ρ1 is the grantee-level intraclass correlation in the outcome measure; and ρ2 is the center-level intraclass correlation.

    7. Estimating Correlations

    8. Before we can calculate the sample sizes required for the various evaluation designs, we must obtain estimates of the key parameters in the variance formulas we have presented in this appendix. In particular, we must have estimates for the following correlations that enter into the variance formulas:

      ρ1 = the extent to which mean outcomes differ across grantees (that is, the intraclass correlation at the grantee level)

      ρ2 = the extent to which mean outcomes differ across centers within grantees (that is, the intraclass correlation at the center level)

      ρ3 = the extent to which mean outcomes differ across classrooms within centers (that is, the intraclass correlation at the classroom level)

      c1 = the correlation between the mean outcomes of treatment and control group children within grantees

      c2 = the correlation between the mean outcomes of treatment and control group children within centers in grantees

      c1 * = the correlation between the mean outcomes of treatment and control group classrooms within grantees

      c2 * = the correlation between the mean outcomes of treatment and control group classrooms within centers in grantees

      c1 ** = the correlation between the mean outcomes of treatment and control group centers within grantees

      We discuss our estimates of these parameters in the following section.

      1. Intraclass Correlations

      2. To obtain estimates for the intraclass correlations, we used data from the Head Start Family and Child Experiences Survey (FACES), 2000 cohort. The Head Start FACES data are a representative sample of children and classrooms selected by multi-stage random sampling of grantees, centers, and classrooms. In many cases, the samples of children per classroom and of classrooms per center generally are small, but they nevertheless are representative of Head Start classrooms. To obtain as large a sample as possible, we included all children in the sample (ages 3 to 5 years) who had valid scores on the Peabody Picture Vocabulary Test (a measure of children’s receptive vocabulary) for the spring (end-of-Head Start year) assessments. Data from the Head Start National Reporting System would be ideal for estimating these correlations, but they are not readily available for this analysis.

        Our analysis indicates that ρ1, the grantee-level intraclass correlation, is 0.247; ρ2, the center-level intraclass correlation, is 0.056, and ρ3, the classroom-level intraclass correlation, is 0.073. These estimates indicate that child assessment scores are more divergent across grantees than they are across centers within a single grantee or across classrooms within a center. Essentially, there is no sorting by ability across Head Start classrooms or across centers within a grantee. Instead, all classrooms and centers include children with the same variation in assessment scores as exists across the grantee as a whole. However, from one grantee to another, there are greater differences in assessment scores. Thus, the populations served by each Head Start grantee seem to vary, whereas, within grantees, the centers and classrooms look similar. Stratified random sampling would generate lower grantee-level intraclass correlations within strata, which would increase the power of designs that involve grantee-level clustering.1

        The intraclass correlation at the classroom level (0.073) is slightly higher than at the center level (0.056), suggesting that teacher effects—the influence on scores of individual strong teachers—are stronger than are center effects. The same pattern of correlations at the classroom (teacher) and school levels are found in studies of older children.

      3. Correlation Between Treatment and Control Group Means

      4. The correlations between treatment and control group means ideally should be estimated from evaluation data, and having more than one evaluation on which to base these estimates would be very useful. For example, data from the Head Start Impact Study, in which children were randomly assigned to Head Start programs or to a control group, many of whom received alternative center-based child care, could help to estimate values for c1 and c2.2 Data from the Preschool Curriculum Evaluation Research (PCER) evaluation, in which most sites randomly assigned classrooms to an enhanced curriculum or to regular services, would provide a useful source of information for estimating c1 and c2. We are not aware of any evaluations that have randomly assigned early childhood centers to an intervention or control group.

        We do not currently have access to data from the Head Start Impact Study or from PCER to help to generate estimates of these parameters. However, we have estimated a value for c1 using data from the Early Head Start evaluation, in which families with a pregnant woman or an infant in 17 sites were assigned to Early Head Start or a control group that did not receive Early Head Start services. The estimate for c1, using data from the three-year-old followup, is 0.5. Additional refinement of these assumptions will be possible as data become available from evaluations using classroom- or center-level random assignment.

  3. ESTIMATED SAMPLE SIZES UNDER ALTERNATIVE STAGE 2 AND STAGE 3 DESIGNS

  4. Tables B.3–B.7 present estimated sample sizes for the Stage 2 and Stage 3 designs that we have discussed in Section A. The following assumptions underlie all of the sample size estimates:

    • A two-tailed test at 80 percent power and a 5 percent significance level. We adopt a two-tailed test, rather than a one-tailed test, because we want to be able to measure any inadvertently negative effects of the enhancements, as well as the positive impacts.

    • Children are divided equally between treatment and control groups. Unbalanced assignment reduces the precision of the estimates.

    • Interview nonresponse is 10 percent. Ninety percent of the initial sample is available for analysis at the spring followup because of the level of interview nonresponse.

    • One enhancement is compared with regular Head Start services. If additional enhancements are to be compared with one another or with a control group, the sample sizes would have to increase accordingly. The tables show the number of units (grantees, centers, or classrooms) in the treatment group. These numbers must be doubled if one enhancement is compared with a control group, tripled if two enhancements are evaluated against a control group, and so on.

    • Intraclass correlations ( ρ1, ρ2 and ρ3) are based on calculations from FACES 2000 data. We assume that the grantee-level intraclass correlation ( ρ1 ) is 0.123, which is lower than estimated from FACES data because we would use stratified random sampling, rather than sampling grantees from the full universe of Head Start programs. We use the FACES estimates for the center-level intraclass correlation ( ρ2 = 0.056) and the classroom-level intraclass correlation ( ρ3 = 0.073). In future work, more FACES samples could be used to estimate these correlations, although the FACES sample is small for estimating some of these correlations. Samples drawn from the National Reporting System data would be ideal, as a sufficient sample of centers within grantees classrooms within centers, and children within classrooms could be obtained.

    • Correlation between treatment and control group outcomes. For child-level correlations, we assume a value of .30 for the correlation between treatment and control group children within grantees, and a value of .50 for the correlation between the mean outcomes of treatment and control group children within centers in grantees. For classroom-level correlations, we assume a value of .30 for the correlation between mean outcomes of treatment and control group classrooms within grantees, and a value of .15 for the correlation between the mean outcomes of treatment and control group classrooms within centers in grantees. For the center-level correlation, we assume a value of .10 for the correlation between mean outcomes of treatment and control group centers within grantees.

    • We assume an average of four centers per grantee, two classes per center, 15 children per center, and 10 children per class. These assumptions are consistent with information about center-based Head Start programs from the Head Start Program Information Report (PIR) that indicates the average number of centers per grantee is 8; the median number of classrooms per center is 3; the median number of children per classroom is 18; and the median number of children per center is approximately 50.3

    In addition, we present the estimates for alterative assumptions that affect the power of the sample to detect impacts:

    • Alternative levels of minimum detectable effect sizes. To address the issue that smaller effect sizes are plausible for improvements in Head Start practice, yet smaller effect sizes require larger samples, we show the minimum sample sizes necessary to detect effect sizes of 0.1, 0.2, 0.25, and 0.33.

    • The regression R2 value. Regression-adjusted impact estimates have greater precision because the regression adjustment eliminates some of the variability in the intervention and control group mean outcomes. If impacts are estimated without any regression-adjustment, then the R2 is zero, and the estimated impact is the simple difference in means. If we have some baseline information on family demographics (for example, the mother’s education level), then the R2 typically is about 0.2. If we have baseline information on the same (or similar) outcome variables as those measured at followup, then the R2 is likely to be approximately 0.5. Therefore, we use the three values for R2 to represent plausible alternative levels of information available for the impact analysis.

    Table B.3. Required Sample Sizes of Grantees, Centers, and Children to Detect Target Minimum Detectable Effect Sizes When Grantees Are Randomly Assigned, Stage 3
      Total Number of Grantees Enhancement Grantees Total Number of Centers Total Number of Children (15 per center) Initial Sample of Children
    MDE = 0.1 R2 = 0 474 237 1,896 28,440 31,600
    R2 = 0.2 378 189 1,512 22,680 25,200
    R2 = 0.5 236 118 944 14,160 15,733
    MDE = 0.2 R2 = 0 118 59 472 7,080 7,867
    R2 = 0.2 94 47 376 5,640 6,267
    R2 = 0.5 60 30 240 3,600 4,000
    MDE = 0.25 R2 = 0 76 38 304 4,560 5,067
    R2 = 0.2 60 30 240 3,600 4,000
    Source: Authors’ calculations.

    Note: The sample size calculations assume a two-tailed test of statistical significance at 80 percent power and a 5 percent significance level; grantees are equally divided among treatment and control groups; a 90 percent response rate to the follow-up interview; intraclass correlations and correlations between treatment and control groups as described in the text; four centers per grantee; and 15 children randomly selected from each center for the research sample. The formula used to calculate sample sizes is:

    MDE = 2.802 * [(√[(1-R2)Var(impact)])/(σ)],

    where:

    Var(impact) = [(2σ2ρ1)/(g)] + [(2σ2ρ2)/(cg)] + [(2σ2(1-ρ12))/(cgs)]

    R2 is the regression R-squared value; g is the total number of grantees in the sample; c is the average number of centers per grantee; and s is the average number of children per center; ρ1 is the grantee-level intraclass correlation in the outcome measure; ρ2 is the center-level intraclass correlation; σ2 is the variance of the outcome measure.

    We have omitted examples that result in estimated sample sizes below 30 grantees, as the group mean estimates would be unstable if the group has fewer than 30 members.

    Table B.4. Required Sample Sizes of Centers and Children to Detect Target Minimum Detectable Effect Sizes When Centers Are Randomly Assigned, Stage 3
      Total Number of Centers Enhancement Centers Total Number of Children (15 per center) Initial Sample of Children
    MDE = 0.1 R2 = 0 1,043 522 15,646 17,384
    R2 = 0.2 834 417 12,517 13,907
    R2 = 0.5 522 261 7,823 8,692
    MDE = 0.2 R2 = 0 261 130 3,911 4,346
    R2 = 0.2 209 104 3,129 3,477
    R2 = 0.5 130 65 1,956 2,173
    MDE = 0.25 R2 = 0 167 83 2,503 2,781
    R2 = 0.2 134 67 2,003 2,225
    R2 = 0.5 83 42 1,252 1,391
    MDE = 0.33 R2 = 0 96 48 1,437 1,596
    R2 = 0.2 77 38 1,149 1,277
    Source: Authors’ calculations.

    Note: The sample size calculations assume a two-tailed test of statistical significance at 80 percent power and a 5 percent significance level; children are equally divided among treatment and control groups; a 90 percent response rate to the follow-up interview; intraclass correlations and correlations between treatment and control groups as described in the text; four centers per grantee, with two assigned to the enhancement group and two to the control group; and 15 children randomly selected from each center for the research sample. The formula used to calculate sample sizes is:

    MDE = 2.802 * [(√[(1-R2)Var(impact)])/(σ)],

    where:

    Var(impact) = ([2σ2ρ1(1-c1**)]/g) + ([2σ2ρ2]/cg) + ([2σ2(1-ρ12)]/cgs) ,

    R2 is the regression R-squared value; g is the total number of grantees in the sample; c is the average number of treatment or control centers per grantee; s is the average number of children per center; ρ1 is the grantee-level intraclass correlation in the outcome measure; ρ2 is the center-level intraclass correlation; σ2 is the variance of the outcome measure; and c1*** is the correlation in mean outcomes between treatment and control group centers within grantees.

    We have omitted examples that result in estimated sample sizes below 30 centers, as the group mean estimates would be unstable if the group has fewer than 30 members.

    Table B.5. Required Sample Sizes of Centers, Classrooms, and Children to Detect Target Minimum Detectable Effect Sizes When Classrooms Are Randomly Assigned, Stage 3
      Total Number of Centers Total Number of Classrooms Total Number of Children (10 per class) Initial Sample of Children
    MDE = 0.1 R2 = 0 848 1,695 16,952 18,836
    R2 = 0.2 678 1,356 13,562 15,069
    R2 = 0.5 424 848 8,476 9,418
    MDE = 0.2 R2 = 0 212 424 4,238 4,709
    R2 = 0.2 170 339 3,390 3,767
    R2 = 0.5 106 212 2,119 2,354
    MDE = 0.25 R2 = 0 136 271 2,712 3,014
    R2 = 0.2 108 217 2,170 2,411
    R2 = 0.5 68 136 1,356 1,507
    MDE = 0.33 R2 = 0 78 156 1,557 1,730
    R2 = 0.2 62 125 1,245 1,384
    R2 = 0.5 39 78 778 865
    Source: Authors’ calculations.

    Note: The sample size calculations assume a two-tailed test of statistical significance at 80 percent power and a 5 percent significance level; children are equally divided among treatment and control groups; a 90 percent response rate to the follow-up interview; intraclass correlations and correlations between treatment and control groups as described in the text; four centers per grantee; two classrooms per center, with one assigned to the enhancement group and one to the control group; and 10 children randomly selected from each classroom for the research sample. The formula used to calculate sample sizes is:

    MDE = 2.802 * [(√[(1-R2)Var(impact)])/(σ)],

    where:

    Var(impact) = ([2σ2ρ1(1-c1**)]/g) + ([2σ2ρ2]/cg) + ([2ρ2ρ3]/cgl) + ([2σ2(1-ρ123)]/cgls)

    R2 is the regression R-squared value; g is the total number of grantees in the sample; c is the average number of centers per grantee; l is the average number of treatment or control classrooms per center; and s is the average number of children per classroom; ρ1 is the grantee-level intraclass correlation in the outcome measure; ρ2 is the center-level intraclass correlation; ρ3 is the classroom-level intraclass correlation; σ2 is the variance of the outcome measure; c1* is the correlation in mean outcomes between treatment and control group classrooms within grantees; and c2* is the correlation in mean outcomes between treatment and control group classrooms within centers within grantees.

    Table B.6. Required Sample Sizes of Centers and Children to Detect Target Minimum Detectable Effect Sizes When Centers Are Randomly Assigned, Stage 2
      Total Number of Centers Enhancement Centers Total Number of Children (15 per center) Initial Sample of Children
    MDE = 0.1 R2 = 0 374 187 5,603 6,225
    R2 = 0.2 299 149 4,482 4,980
    R2 = 0.5 187 93 2,801 3,113
    MDE = 0.2 R2 = 0 93 47 1,401 1,556
    R2 = 0.2 75 37 1,121 1,245
    MDE = 0.25 R2 = 0 60 30 896 996
    Source: Authors’ calculations.

    Note: The sample size calculations assume a two-tailed test of statistical significance at 80 percent power and a 5 percent significance level; children are equally divided among treatment and control groups; a 90 percent response rate to the follow-up interview; intraclass correlations and correlations between treatment and control groups as described in the text; four centers per grantee, with two assigned to the enhancement group and two to the control group; and 15 children randomly selected from each center for the research sample. The formula used to calculate sample sizes is:

    MDE = 2.802 * [(√[(1-R2)Var(impact)])/(σ)],

    where:

    Var(impact) = ([2σ2ρ2]/c) + ([2σ2(1-ρ2)]/cs) ,

    R2 is the regression R-squared value; c is the average number of treatment or control centers; s is the average number of children per center; σ2 is the variance of the outcome measure, and ρ2 is the center-level intraclass correlation in the outcome measure.

    We have omitted examples that result in estimated sample sizes below 30 centers, as the group mean estimates would be unstable if the group has fewer than 30 members.

    Table B.7. Required Sample Sizes of Centers, Classes, and Children to Detect Target Minimum Detectable Effect Sizes When Classrooms Are Randomly Assigned, Stage 2
      Total Number of Centers Enhancement Classrooms Total Number of Children (10 per class) Initial Sample of Children
    MDE = 0.1 R2 = 0 260 260 5,204 5,782
    R2 = 0.2 208 208 4,163 4,626
    R2 = 0.5 130 130 2,602 2,891
    MDE = 0.2 R2 = 0 65 65 1,301 1,445
    R2 = 0.2 52 52 1,041 1,156
    R2 = 0.5 33 33 650 723
    MDE = 0.25 R2 = 0 42 42 833 925
    R2 = 0.2 33 33 666 740
    Source: Authors’ calculations.

    Note: The sample size calculations assume a two-tailed test of statistical significance at 80 percent power and a 5 percent significance level; children are equally divided among treatment and control groups; a 90 percent response rate to the follow-up interview; intraclass correlations and correlations between treatment and control groups as described in the text; four centers per grantee; two classrooms per center, with one assigned to the enhancement group and one to the control group; and 10 children randomly selected from each classroom for the research sample. The formula used to calculate sample sizes is:

    MDE = 2.802 * [(√[(1-R2)Var(impact)])/(σ)],

    where:

    Var(impact) = [(2σ2ρ3)/(cl)] + [2σ2(1-ρ3)/(cls)]

    R2 is the regression R-squared value; c is the total number of centers in the sample, l is the average number of treatment (control) classrooms per center, s is the average number of children per classroom, σ2 is the variance of the outcome measure, and is ρ3 the between-classroom variance as a proportion of the total variance.

    We have omitted examples that result in estimated sample sizes below 30 classrooms, as the group mean estimates would be unstable if the group has fewer than 30 members.



1 The increase in power due to stratification of the sample will be offset in part by a reduction in degrees of freedom that depends on the number of strata. (back to footnote 1)

2 The Head Start Impact Study is not a perfect source of information for these parameters because it did not involve a comparison of children randomly assigned to enhanced Head Start services and regular Head Start services. However, the child-level random assignment to different types of preschool program services should be informative about values for c1 and c2. (back to footnote 2)

3 When classrooms are randomly assigned and the center has only 2 or 3 classrooms, one classroom is assigned to the enhancement group and one to the control group. Under this design, there are not enough degrees of freedom to estimate classroom effects (since there is only one classroom in each research group). To estimate impacts, classrooms will need to be pooled across centers and centers will need to be treated as if they were randomly selected. (back to footnote 3)

 

Table of Contents | Previous