Skip Navigation
Administration for Children and Families  
ACF
ACF Home   |   Services   |   Working with ACF   |   Policy/Planning   |   About ACF   |   ACF News   |   HHS Home

  Questions?  |  Privacy  |  Site Index  |  Contact Us  |  Download Reader™  |  Print      

Office of Planning, Research & Evaluation (OPRE) skip to primary page content
Advanced
Search

Table of Contents | Previous | Next

Chapter 4: Overview of Methods for Analyzing Impacts on Children and Families

Purpose of Interim Analysis

This chapter describes key aspects of the analysis methods used in estimating Head Start’s impact on children’s cognitive and social-emotional development, health, and parenting practices. The narrative begins with a discussion of the key outcome domains, constructs, and measures that were selected for this examination of program impacts including procedures used to create scales and other derived variables. It also reviews the methods used to estimate overall average program impacts, and impacts for particular subgroups and concludes with a discussion of ways to deal with the nonadherence to random assignment discussed in Chapter 2.

Outcome Domains and Measures

As discussed in Chapter 1, a wide variety of data sources and measures are being used to assess the impact of Head Start on fostering and enhancing child development, including direct child assessments, parent/primary caregiver interviews, interviews with providers of early care services used by participating study children, and observations of children’s early care settings. For this report, a selected set of key outcome measures in the cognitive, socio-emotional, health, and parenting domains were used for this initial assessment of the impact of Head Start. These are described below and summarized in Exhibit 4.11 for the combined treatment and comparison groups presented separately by age group (3- and 4-year-olds), based on information collected in spring 2003:

  1. Cognitive Domain:
    • Pre-reading skills. These skills focus primarily on letter recognition, an important step toward becoming a proficient reader. This domain is measured by the Woodcock-Johnson-III Letter-Word Identification subtest and the Letter Naming Task.

    • Pre-writing skills. Children’s ability to draw shapes and write letters and words is assessed. This domain is measured by the Woodcock-Johnson III Spelling subtest and McCarthy Draw-a-Design Test.

    • Vocabulary knowledge. This skill is indicative of children’s oral language development and general knowledge. This domain is measured by the PPVT-III and the Color Naming Task.

    • Oral comprehension and phonological awareness. This includes the child’s ability to understand and make inferences from phrases and sentences spoken in English and to understand that spoken sentences are made of component words, compound words are made up of simpler words, and that words are made up of component syllables and sounds (phonemes). This domain is measured by the Woodcock-Johnson III Oral Comprehension subtest and the Elision subtest of the Comprehensive Test of Print and Phonological Processing-Preschool Edition (CTOPPP).

    • Early math skills. Child assessments include basic math skills and understandings that are essential for the development of more advanced quantitative capabilities. This domain is measured by the Woodcock-Johnson III Applied Problems subtest and the Counting Bears Task.

    • And, finally, a measure of parent’s perceptions of their child’s literacy skills, using information from the parent interview.

Exhibit 4.1: Spring 2003 Outcome Measures, Data Source or Scoring Method, and Descriptive Statistics for the Combined Sample 1, 2
Domain Outcome Measure Source/Scoring Mean 3 Standard Deviation Range
3-Year-Old Group 4-Year-Old Group
Cognitive Domain Peabody Picture Vocabulary Test-III (adapted) Child Assessment; IRT scoring of the adapted version M: 252.01 (82.17)
SD: 35.56
R: 148-382
M: 292.63 (87.55)
SD: 38.75
R: 174-414
Comprehensive Test of Phonological and Print Processing (CTOPPP): Elision Child Assessment; IRT scoring M: 241.59
SD: 43.63
R: 131-379
M: 274.47
SD: 48.14
R: 132-385
Letter Naming Task Child Assessment; Number of letters identified correctly M: 4.71
SD: 7.20
R: 0-26
M: 10.40
SD: 9.70
R: 0-26
Color Naming/Identification Child Assessment; Number of colors identified correctly M: 13.49
SD: 6.90
R: 0-20
M:16.78
SD: 5.30
R: 0-20
Counting Bears Child Assessment; Measure of one-to-one counting M: 2.77
SD: 1.32
R: 1-5
M: 3.68
SD: 1.34
R: 1-5
McCarthy Scales of Children’s Abilities: Draw-a-Design Child Assessment; Number of shapes drawn correctly M: 3.13
SD: 1.17
R: 0-12
M: 4.46
SD: 2.03
R: 0-15
Woodcock-Johnson III Tests of Achievement: Letter-Word Identification Child Assessment; W score generated by the Woodcock-Johnson Compuscore and Profiles Program M: 303.77 (91.53)
SD: 25.05
R: 264-392
M: 322.41 (92.44)
SD: 27.80
R: 264-408
Woodcock-Johnson III Tests of Achievement: Spelling Child Assessment; W score generated by the Woodcock-Johnson Compuscore and Profiles Program M: 345.11 (92.47)
SD: 22.58
R: 277-426
M: 369.66 (90.99)
SD: 25.44
R: 277-442
Woodcock-Johnson III Tests of Achievement: Applied Problems Child Assessment; W score generated by the Woodcock-Johnson Compuscore and Profiles Program M: 375.43 (88.16)
SD: 28.39
R: 318-436
M: 395.98 (87.57)
SD: 25.53
R: 318-436
Woodcock-Johnson III Tests of Achievement: Oral Comprehension Child Assessment; W score generated by the Woodcock-Johnson Compuscore and Profiles Program M: 435.48 (92.25)
SD: 14.04
R: 418-489
M: 443.52 (90.68)
SD: 17.92
R: 418-489
Test de Vocabulario en Imágenes Peabody (adapted) 4i Child Assessment; IRT scoring of the adapted version M: 250.18 (90.31)
SD: 40.41
R: 160-383
M: 293.56 (88.18)
SD: 43.80
R: 149-442
Batería Woodcock-Muñoz Pruebas de aprovechamiento-Revisada: Identificación de letras y palabras 4ii Child Assessment; W score calculated from the Woodcock-Munoz scoring table in the Test Record M: 351.17 (93.49)
SD: 12.52
R: 316-392
M: 357.54 (86.21)
SD: 11.50
R: 316-423
Parent (reported) Emergent Literacy Scale Parent Interview; Sum of five items M: 2.61
SD: 1.45
R: 0-5
M: 3.55
SD: 1.39
R: 0-5
Socio-emotional Domain Social Skills and Positive Approaches to Learning Parent Interview; Sum of seven items M: 12.39
SD: 1.72
R: 4-14
M: 12.48
SD: 1.72
R: 4-14
Total Child Behavior Problems Scale Parent Interview; Sum of 12 items M: 6.01
SD: 3.66
R: 0-22
M: 5.70
SD: 3.59
R: 0-19
Aggressive Behavior Scale Parent Interview; Sum of four items M: 3.01
SD: 1.72
R: 0-8
M: 2.79
SD: 1.69
R: 0-8
Hyperactive Behavior Scale Parent Interview; Sum of three items M: 1.85
SD: 1.55
R: 0-6
M: 1.74
SD: 1.47
R: 0-6
Withdrawn Behavior Scale Parent Interview; Sum of three items M: 0.57
SD: 0.95
R: 0-6
M: 0.68
SD: 0.96
R: 0-6
Social Competencies Checklist Parent Interview; Home inventory from the Developing Skills Checklist; Sum of 12 items M: 10.98
SD: 1.32
R: 0-12
M: 11.04
SD: 1.32
R: 1-12
Parenting Practices Domain Parent used time out in the last week Parent Interview; One item M: 0.64
SD: 0.48
R: 0-1
M: 0.64
SD: 0.48
R: 0-1
Number of times parent used time out in the last week Parent Interview; One item M: 1.77
SD: 2.25
R: 0-28
M: 1.68
SD: 2.50
R: 0-100
Parent spanked child in the last week Parent Interview; One item M: 0.45
SD: 0.50
R: 0-1
M: 0.37
SD: 0.48
R: 0-1
Number of times parent spanked child in the last week Parent Interview; One item M: 0.90
SD: 1.52
R: 0-21
M 0.70
SD: 1.21
R: 0-20
Parental Safety Practices Scale Parent Interview; Average score for five items M: 3.71
SD: 0.33
R: 2-4
M: 3.72
SD: 0.33
R: 2-4
Removing Harmful Objects Scale Parent Interview; Average score for seven items M: 3.89
SD: 0.32
R: 1-4
M: 3.89
SD: 0.32
R: 1-4
Restricting Child Movement Scale Parent Interview; Average score for four items M: 3.89
SD: 0.29
R: 1-4
M: 3.88
SD: 0.31
R: 1-4
Safety Devices Scale Parent Interview; Average score for two items M: 3.34
SD: 0.75
R: 1-4
M: 3.39
SD: 0.77
R: 1-4
Family Cultural Enrichment Scale Parent Interview; Sum of seven items M: 3.65
SD: 1.41
R: 0-7
M: 3.95
SD: 1.43
R: 0-7
How many times child was read to in the last week by parent or other family member Parent Interview; One item M: 2.86
SD: 0.94
R: 1-4
M: 2.88
SD: 0.95
R: 1-4
Health Domain Child seen by dentist since last September Parent Interview; One item M: 0.60
SD: 0.49
R: 0-1
M: 0.65
SD: 0.48
R: 0-1
Overall child’s health status Parent Interview; One item M: 0.78
SD: 0.41
R: 0-1
M: 0.80
SD: 0.40
R: 0-1
Child had injury in last month requiring medical treatment Parent Interview; One item M: 0.09
SD: 0.28
R: 0-1
M: 0.12
SD: 0.32
R: 0-1
Child has health insurance Parent Interview; One item M: 0.92
SD: 0.27
R: 0-1
M: 0.88
SD: 0.32
R: 0-1
Child has place for routine medical care Parent Interview; One item M: 0.98
SD: 0.14
R: 0-1
M: 0.97
SD: 0.17
R: 0-1
Child has condition that requires ongoing medical care Parent Interview; One item M: 0.13
SD: 0.34
R: 0-1
M: 0.11
SD: 0.32
R: 0-1
Child has an unmet health care need Parent Interview; One item M: 0.02
SD: 0.13
R: 0-1
M: 0.03
SD: 0.18
R: 0-1

1 The combined sample includes children assessed in all languages in fall 2002 (including English, Spanish, and other languages) and in English in spring 2003. (back)

2 Woodcock-Johnson III provides a Compuscore program that does not convert a raw score of 0 to a standard score. The W ability and standard scores presented in this exhibit reflect a correction factor to accommodate for children who had a 0 raw score. (back)

3 Scores in parentheses are standard scores for tests where available. (back)

4 Indicates administered to Spanish-speaking children only in the combined sample. (back: 4i, 4ii)

  1. Social-Emotional Domain:
    • Social skills and approaches to learning. Parents were asked to rate their child’s social skills and positive approaches to learning.2 Social skills focused on cooperative and empathic behavior, such as, "Makes friends easily," "Comforts or helps others," and "Accepts friends' ideas in sharing and playing." Approaches to learning deal with curiosity, imagination, openness to new tasks and challenges, and having a positive attitude about gaining new knowledge and skills. Examples include, "Enjoys learning," "Likes to try new things," and "Shows imagination in work and play." The two scales are based on an instrument used in the Head Start Family and Child Experiences Survey (FACES).3

    • Social competencies. Parents were asked to provide information on social capabilities using a Social Competencies Checklist, also used in FACES 2000. The checklist consisted of 12 items; for each item, the parent was asked to report whether the child engaged in that behavior or exhibited that attribute “regularly” or “very rarely or not at all.” Examples of the items included, “Shares newly learned ideas,” “Takes care of personal belongings,” “Helps with simple household tasks,” and “Notices when others are happy, sad, angry.” The total scale score could range from zero (all items rated “rarely or not at all”) to 12 (all items rated “does regularly”).

    • Problem behavior. Parents were asked to rate their children on items dealing with aggressive or defiant behavior such as, “Hits and fights with others,” “Has temper tantrums or hot temper,” and “Is disobedient at home.” Other items dealt with inattentive or hyperactive behavior, including, “Can’t concentrate, can’t pay attention for long,” and “Is very restless and fidgets a lot.” A third set of items dealt with shy, withdrawn, or depressed behavior, e.g., “Feels worthless or inferior,” and “Is unhappy, sad, or depressed.” For each item, the parent was asked to judge whether the behavioral description was “not true,” “sometimes true,” or “very true” of the child. The Total Behavior Problem scale derived from parent ratings contained 14 rating items, and the total scale score could range from zero (all items marked “not true”) to 28 (all items marked “very true”). The Aggressive Behavior subscale contained four items and could range from zero to eight. The Hyperactive Behavior subscale contained three items, and scores could range from zero to six. The Withdrawn Behavior subscale contained three items, and scores could range from zero to six. These scales were also used in FACES 2000, and their development was based on prior work by Rutter, Achenbach, Zill and Peterson, and others (see ACF, 2001). The mean scores obtained in the Head Start Impact Study were very comparable to mean scores obtained from parents of an independent national sample of Head Start children surveyed for FACES 2000.4

  2. Health Domain:
    • Access to health care. Parents were asked to report on various health care services, two of which are used in this report:

      • Whether the child has health insurance. Parents were asked if the child was covered by Medicaid or a state health insurance program, or by health insurance through their job or the job of another employed adult.

      • Whether the child has received dental care. Parents were asked if the child had ever seen a dentist.
    • Child’s health status. Parents were asked to report on their child’s health status:

      • Child’s health status (excellent or very good). Parents were asked if, overall, the child’s health was excellent, very good, good, fair, or poor. This outcome was coded “yes” for those who reported that their child’s health was excellent or very good.

      • Whether the child needs ongoing medical care. Parents were asked if their child had an illness or condition that requires regular ongoing medical care.

      • Whether child received medical care for an injury in the last month. Parents were asked how many times their child, in the last month, had seen a doctor or other medical professional or visited a clinic or emergency room for an injury. This outcome was coded yes if the parent reported any such occurrences in the last month.
  3. Parenting Practices Domain:
    • Educational activities. Parents were asked to report on the types of educational activities they did with their child:

      • Reading to the child at home. Parents reported on the item “How many times have you or someone in your family read to [CHILD] in the past week?” Possible responses range from 1 (not at all) to 4 (every day).

      • Cultural enrichment activities. Parents reported on a 7-item checklist of activities the parent, or another family member, may have done with the child during the past month. The seven activities include going to a movie; play or concert; art gallery or museum; playground, park, or zoo; community, ethnic, or religious event; and talking about family or cultural heritage and going on errands. A total score was computed by summing the number of different activities the parent and child participated in together, with a possible score of 0 (none) to 7 (all).

    • Discipline strategies. Parents reported on the following:

      • Use of physical discipline. Parents reported on the item “Sometimes children mind pretty well and sometimes they don’t. Have you spanked [CHILD] in the past week for not minding?” For parents who responded yes, the Frequency of physical discipline was also created from parent reports on the item “About how many times in the past week?” Responses ranged from 0 to 21 times.

      • Use of time out. Parents reported on the item “Have you used ‘time out’ or sent [CHILD] to his/her room in the past week for not minding?” For parents who responded yes, the Frequency of time out was also created from parent reports on the item “About how many times in the past week?” Responses ranged from 0 to 100 times.

    • Child safety practices. Parents reported on a 10-item scale that assessed how often the 10 different safety precautions were used, including keeping harmful objects out of reach, using car seats, supervising the child during bath time, and having a first aid kit and working smoke detector at home. Possible responses ranged from 1 (never) to 4 (always). In addition to a total overall scale score, exploratory factor analyses yielded three separate subscales that were also used in the analysis: removing harmful objects from the home, restricting child movements from dangerous situations, and having safety devices available for the child.

Creation of Test Scores and Scales

As noted in Exhibit 4.1, IRT analysis was used to develop the adapted (shortened) versions of the PPVT-III and TVIP to significantly reduce the time required to test individual children (i.e., reducing the burden on the child). IRT analysis achieves this goal by treating test items as interchangeable components that can be added or substituted without altering the underlying test scale, i.e., higher ability children did not have to be administered easier items, and lower ability children did not have to be administered more difficult items to get a reliable test score for each child.

IRT analysis was also used to score child assessments for the PPVT-III, TVIP, and CTOPPP Elision tests. The advantage of IRT analysis for scoring is that it uses the actual pattern of right, wrong, and omitted responses to the items administered in an assessment and uses the item difficulty, discrimination, and guessing behavior to place each child on a continuous ability scale. In this case, data from the multiple waves of existing FACES data collection were used to conduct the IRT analysis. If an assessment is shortened (as in the case of the PPVT-III or TVIP). IRT analysis also has advantages over the use of simple raw scores. By using the overall pattern of right and wrong answers, and the characteristics of each item to estimate ability, IRT analysis can compensate for the possibility that a low-ability child will correctly respond to several difficult items by guessing. Unlike raw scores, which treat omitted items as if they had been answered incorrectly, IRT procedures use the pattern of actual responses to estimate the probability of correct responses for all assessment questions, including any omitted items. The Compuscore and Profiles Program (Riverside Publishing, 2001) was used to score the Woodcock-Johnson III subtests, and publisher look-up tables (Riverside Publishing, 1990) were used to score the Woodcock-Muñoz. The total number of correct responses (or raw score) was used for the Letter Naming Task, Color Naming/Identification, Counting Bears, and the McCarthy Draw-a-Design subtest.

In addition to the direct child assessments discussed above, the following additional scales were developed using items from the parent interview:5

  • Problem behaviors. As discussed above, scales were developed to assess children’s social-emotional development. The items making up these ratings were drawn from three measures of children’s positive behavior and behavior problems: the Entwisle Scale of Personal Maturity (Entwisle, Alexander, Cadigan, & Pallis, 1987), the Child Behavior Checklist for Preschool-Aged Children (Achenbach, Edelbrock, & Howell, 1987), and the Home Inventory in the Developing Skills Checklist (CTB/McGraw-Hill, 1990).

  • Maternal depression and locus of control. These two measures were used as covariates in the statistical models and, in the case of depression, as a moderator of program impact (see following discussion). The Depression Scale is derived from the CES-D Depression Scale (Ross, Mirowsky, & Huber, 1983), and the Locus of Control Scale is derived from the Pearlin Mastery Scale (Pearlin & Schooler, 1978).

  • The Parent Emergent Literacy Scale (PELS). PELS is a parent-report on five literacy items originally developed for use in FACES 2000: child can recognize most/all of the letters of the alphabet; child can count to 20; child pretended to write his/her name in the last month; child can write his/her first name; and child can identify the primary colors.

IRT was used to generate the scale scores for the caregiver’s depression and locus of control scales. For the remaining scales, the scale score results from a summation of the item responses for each scale.

The Analysis Sample

The sample used in this report to estimate impacts on children and families was chosen to maximize the data available by including every completed child assessment and parent interview from the spring 2003 wave of data collection (the end of the first Head Start program year). Observations were compiled independently for child assessments and parent interviews, and information was included from one of these sources even when the other source was missing. For this reason, and also due to item nonresponse for specific questions in completed questionnaires, sample sizes are not identical for all analyses, i.e., different outcome variables involve slightly differing numbers of observations. The comparability of the Head Start and non-Head Start samples established at random assignment is maintained to the greatest extent possible in each instance by adjusting the initial sampling weights to offset observable differences between respondents and nonrespondents at baseline (see the discussion below and Appendix 1.2 for details).

Rather than dropping observations with missing data on background variables from the fall 2002 data collection, a statistical “hot-deck” procedure was used to impute missing background variables for cases with either (1) no fall 2002 parent interviews, or (2) incomplete fall 2002 data caused by item nonresponse. Appendix 4.1 provides details of the imputation process, including initial missing data rates for all the imputed variables.

The set of completed questionnaires and assessments was divided into two separate samples, one for children entering Head Start 1 year before anticipated kindergarten entry—referred to as the 4-year-old group—and one for children entering Head Start 2 years prior to expected kindergarten entry—the 3-year-old group. This corresponds to the structure of the original random assignment, which was done separately for the two age groups to allow a separate experimental examination of each group of newly entering children. Analysis weights were established separately for the child assessments and for parent interviews. The analysis weights (which initially were based on the probability of selection into the study sample during random assignment) were adjusted to compensate for nonrespondents by increasing the weight for responding children with similar individual and family background characteristics on variables measured for all randomly assigned cases in 2002, prior to random assignment. (See Appendix 1.2 for details of the weight-adjustment procedures.)

The weighted data, therefore, intend to represent the same universe for all spring 2003 outcomes examined: the national population of newly entering 3-year-olds and, separately, the national population of newly entering 4-year-olds. For some purposes, the universe at each age level is divided by the primary language used to assess the child in fall 2002 and spring 2003. As discussed in Chapter 1, children who could not complete all the assessment batteries in English had their assessments primarily administered in Spanish or, for a small fraction of the sample, some other non-English language. In examining Head Start’s impact on child cognitive and social-emotional development, we report separately on the set of children assessed initially in Spanish and the children assessed initially in English or some other language.6 Like all subgroups defined by characteristics independent of the intervention and not affected by random assignment, these subsamples are, but for chance,7 well-matched between children originally randomized into the Head Start versus non-Head Start groups and represent a valid experimental examination in and of themselves. Thus, the separate language-of-assessment analyses provide equally unbiased measures of Head Start’s impact on that particular subpopulation as does the study as a whole for the full population.

In general, all the analyses described in this chapter encompass children assessed in all languages in fall 2002 (including Spanish, English, and other), other than those in Puerto Rico, and are carried out separately for the 3-year-old and 4-year-old cohorts. The one exception concerns spring outcome measures collected for children assessed initially (i.e., in fall 2002) in Spanish. As noted elsewhere, child assessments used in spring 2003 for these children were supplemented with two cognitive assessments designed specifically for Spanish-speaking children and administered only to those children whose original fall assessments were conducted in Spanish, i.e., the TVIP (adapted) and the Woodcock-Muñoz Letter-Word Identification Test. The current report includes these two tests in the impact analyses for those children whose initial fall assessments were conducted primarily in Spanish.

Methods of Estimating Head Start Impacts

The impact of Head Start is assessed from three points of view, reflecting three philosophies for arriving at the best evidence from a randomized experimental design: (1) differences in average outcomes, (2) differences in outcomes adjusted for fall 2002 demographic characteristics, and (3) differences in outcomes adjusted for both children’s demographic characteristics and fall 2002 “starting points” on the outcome measures used in the particular analysis (e.g., fall PPVT measure). Each method is discussed below.

Difference in Average Outcomes

The National Head Start Impact Study, like other evaluations that use random assignment to allocate slots to program participants, provides a framework for attributing child outcomes to the effects of the program, rather than to other factors that may influence child development. Unlike pre-test/post-test analyses and other comparison group approaches, this framework makes accurate impact measurement possible without considering any individual child’s starting point. If enough individuals are randomized to the Head Start and non-Head Start groups, and if all randomized individuals are included in the follow-up analysis, important differences in later outcomes are almost certain to result from the intervention being examined rather than other factors. In rare instances these groups can differ by chance alone on background factors affecting outcomes. However, statistical tests are used to decide if outcome differences are significant, thereby reducing the probability of reaching false conclusions to 5 percent or less. Actual measurement, and adjustment for possible chance differences in starting points, is not essential under this design (although it can be useful for certain reasons, as discussed below).

The simplicity of the basic Head Start/non-Head Start comparison of spring outcomes, without recourse to other data, provides a powerful motivation for evaluating program impacts in just this way. The transparency of the methodology, and its lack of dependence on sometimes complex statistical methods, makes these “difference-in-means” results good candidates as initial measures of Head Start’s impact. Appendix 4.2 presents the most basic version of this analysis, contrasting the average outcome level for the Head Start group with the average outcome level for the non-Head Start group using unweighted data. This is as close to a simple “randomize and see what happens” approach to experimental evaluation as possible. However, the unweighted estimates can be biased because they do not take into account the differential probabilities of selection of children in the sample. The child weights account for the sampling of PSUs, grantee/delegate agencies, centers, and children within centers so that the study sample can be used to represent the national Head Start population.

These weighted difference-in-means impact estimates are reported in Chapters 5-8 and are also included in Appendix 4.2 for comparison purposes. Statistical tests determine which of the measured outcome differences between Head Start and non-Head Start children can be considered real impacts rather than simply due to sampling error. For continuous outcome variables (e.g., PPVT III scale score), the tests are based on ordinary least-squares (OLS) regression models that replicate the difference-in-means calculation by expressing spring 2003 outcomes as the sum of an intercept term and a shift in the intercept produced by a dummy variable for inclusion in the Head Start group.8 For discrete outcome variables (e.g., use of dental care), logistic regressions were used to do equivalent computations converted to a scale of 0 to 1 to obtain measures of impact on the probability that a particular outcome occurs (e.g., getting a dental check-up).9 Appendix 4.3 describes both procedures in detail, including a formal statement of the regression equation in mathematical notation.

Outcomes Adjusted for Fall 2002 Demographic Characteristics

While an intact randomized sample and complete outcome data ensure that no systematic biases enter into the simple difference-in-mean estimates of Head Start’s impact, more sophisticated analysis methods provide further advantages. In addition to assignment to the Head Start study group, other factors such as a child’s background and family characteristics may influence her/his outcomes in later months. If these factors can be included in models that “explain” child outcomes as the joint result of Head Start access and demographic background characteristics, uncertainty about the process used to generate outcomes will decline. In addition, confidence in the role of measured factors, including assignment to the Head Start group, will increase. This effect, known statistically as “reducing variance,” will increase the chances of detecting as statistically significant any impact Head Start has on the outcomes of interest. Correspondingly, this study will be able to detect smaller impacts with 80 percent certainty, known as “minimum detectable effects,” as additional factors are taken into account. This makes the research more capable of detecting Head Start impacts should such impacts occur.

To add the explanatory power of background factors to the analysis, the regression models used to obtain difference-in-means estimates can be extended to express outcomes (or, in the case of logistic models, the probability of a particular outcome) as a function of both assignment to the Head Start group (the dummy variable used previously) and a set of key demographic variables measured in fall 2002. This regression equation includes a constant, the dummy variable modeling assignment to the Head Start group, and a set of key background variables. Appendix 4.3 shows this extension of the formal mathematical model underlying all the impact analyses.

The background variables used were selected in five stages, starting with a focus on the four different outcome domains (cognitive, social-emotional, health, and parenting) but then coming together into a single set of variables:

  • Specification of the likely predictors of child and family outcomes for each domain, based on past research and the set of child and family measures collected by the study in fall 2002.

  • Merger of the four sets of predictors (one for each outcome domain) into a single comprehensive list.

  • Identification of any covariate whose role in the regression equations is at times unstable and whose coefficient, therefore, cannot always be estimated.10

  • Removal of the unstable covariates from all regressions.11

  • Removal (from all regressions) of covariates whose values in either age group may have been affected by the group to which a given child was randomly assigned, Head Start or non-Head Start (see next subsection).

These steps resulted in a single uniform set of covariates included in all the impact regressions that take account of child and family demographic characteristics, a list provided in Exhibit 4.2. Each demographic variable used is posited to relate to the outcomes12 in linear fashion. For background variables that provide two-way categorizations of all the children in the sample (the great majority), this reduces to a simple shift in the average outcome level between the two groups.

Exhibit 4.2: Fall 2002 Demographic Variables Included in the Statistical Models Estimating the Impact of Head Start

Child Covariates

  • Child Gender
  • Child Age in Months as of 9/1/02
  • Child Race/Ethnicity, Black (all models except for cognitive outcomes for the Spanish-English language group, and logistic models of parenting and health outcomes)
  • Child Race/Ethnicity, Hispanic
  • Child Has Special Needs

Parent Covariates

  • Caregiver Depression Scale
  • Primary Caregiver’s Age as of 9/1/02
  • Both Biological Parents Live with Child
  • Biological Mother Is a Recent Immigrant
  • Mother’s Highest Level of Educational Attainment
  • Primary Caregiver’s Self-Reported Health Status
  • Parents Are Separated or Divorced
  • Mother Had a Birth as a Teenager
  • Caregiver’s Locus of Control Scale

Household Covariates

  • Grandparent Lives in the Household (all models except for cognitive outcomes for the Spanish-English language group, and logistic models of parenting and health outcomes)
  • Number of Household Moves in Last 12 Months
  • Household Monthly Income Range
  • Household Receives TANF

Outcomes Adjusted for Initial Fall “Starting Points”

Another set of factors helps explain child outcomes and increase the precision of the estimated impacts of Head Start: the initial fall starting points for the key outcome measures used in the impact analyses. A child’s cognitive abilities measured at the beginning of her or his Head Start enrollment strongly predicts her or his cognitive abilities at the end of a year in the program (or in the non-Head Start comparison group). For this reason, a third set of the Head Start impact estimates was calculated, adjusting for each child’s initial fall 2002 value on the respective outcome measures used to calculate the spring 2003 impact estimates. Thus, for example, to better explain Head Start’s impact on the spring 2003 PPVT-III (adapted) scores, the fall 2002 measure of the same cognitive assessment measure (in this case each child’s fall 2002 PPVT-III (adapted) score) was added to the regression analyses discussed above. Appendix 4.4 provides the particular fall 2002 cognitive assessment score, or social-emotional, health, or parenting indicator, added in this fashion for each of the spring 2003 outcomes for which impacts are examined.13

There is no question that spring outcomes are, on average, higher for children who tested higher on that measure in the fall and lower for children who tested lower in the fall. Similarly, children who engaged in a particular type of behavior in the fall, or parents who adopted certain child-rearing practices in the fall, were more likely to do so the following spring. This makes the pre-test version of the outcome variable especially helpful in explaining outcomes observed in the post-test period of spring 2003, thus obtaining more precise measures of Head Start’s impact in the later period. Controlling for pre-test levels of spring outcomes may also remove potential differences between the Head Start and non-Head Start samples due to nonresponse in the spring data collection. While nonresponse adjustment to the analysis weights was used to offset differential response rates, including pre-test measures as covariates helps to offset any remaining difference.

But adjustment for each outcome measure’s starting point using pre-test data creates some ambiguity in interpreting the resulting impact estimates. Any differences in initial fall pre-test measures between the Head Start and non-Head Start groups will be statistically controlled when the fall 2002 outcome measures are added to the spring 2003 impact analyses that previously contained only demographic characteristics of children and families.14 The ambiguity arises in deciding whether controlling for the initial differences in fall pre-test measures in this way enhances or diminishes the reliability of the impact estimates.

A good deal is at stake in this assessment since (based on a procedure described below) as many as 10 of the 27 pre-test measures considered as possible adjustment factors may have differed to an important extent between the Head Start and non-Head Start groups in the 3-year-old cohort, and as many as 12 of 27 measures in the 4-year-old cohort.

Whether one should adjust for such factors depends on the reason the fall 2002 measure of the spring 2003 outcome differs systematically between the Head Start group and non-Head Start group. In many randomized impact studies, removing initial pre-test differences on outcome measures between the intervention group and the control group can only enhance the impact estimates. Specifically, initial differences on average fall outcomes measures between the two groups may occur as a result of one or more of the following reasons:

  • chance differences between the types of children put in the two groups during random assignment;

  • corruption of the random assignment process that undercuts the intention of giving every child an equal probability of being randomly assigned to the Head Start group regardless of his or her characteristics or any other individual-specific factors; or

  • omission of different types of children from the Head Start group’s spring 2003 analytic sample from those in the non-Head Start group’s spring 2003 analytic sample due to potential differential nonresponse rates between the groups in collecting the spring outcome data.15

If one or more of these factors contributes to the observed initial differences between groups on the initial fall outcome measures, then adjustment for this initial difference in fall outcome scores will improve that impact estimate. If, on the other hand, the initial differences on fall outcome scores result from an early impact of the Head Start intervention, then the inclusion of the initial fall scores in the model may attenuate the resulting impact estimates. In this instance, Head Start does not get “full credit” for the impacts it achieves by the time outcomes are measured in spring 2003 since some of those potential early impacts—the portion that may have occurred by the time the initial fall 2002 data were collected—are not counted. These potential early impacts will be removed from the spring impact estimates by the inclusion of the initial fall outcome measures in the analyses.

In many randomized impact studies, the pre-test versions of the important program outcome variables, as well as all demographic characteristics, are measured prior to, or at the point of, random assignment, when one can be sure that differences between the intervention and control groups (if any) do not reflect early impacts of the program.16 In those instances, removal of any measured differences between the two groups in pre-test values or demographic characteristics when calculating subsequent impacts necessarily improves the estimates, whether those differences are caused by chance, corruption of random assignment, or differential nonresponse during data collection. So where feasible, experimental evaluations measure all sample members’ characteristics prior to randomization and adjust for them in the impact analysis without fear of potentially doing harm to the estimates. Two of the demographic variables used in this analysis (see Exhibit 4.2) fit into this category: child gender and race/ethnicity. Both come from rosters completed by program intake staff prior to random assignment.

Unfortunately, data collection prior to random assignment was infeasible for many of the important demographic variables in Exhibit 4.2 and all the initial developmental and behavioral “starting point” measures listed in Appendix 4.4. When (1) in-depth in-person information must be collected for a large, highly geographically dispersed experimental sample—implying high costs of data collection per sample member—and (2) notification of acceptance into the program must occur quickly once eligible applicants are identified, most background data cannot be collected prior to random assignment. Additional data collection for persons ultimately found ineligible for inclusion in the program or the research would be unethical and inordinately costly. These were exactly the circumstances of the National Head Start Impact Study when random assignment took place in mid-2002.17

Due to these constraints, most of the fall 2002 data on children and families in the study were collected over a 3-month period from October 2002 through December 2002 (with most completed by mid-November) at a considerable lag from random assignment. As a result, the possibility that some early Head Start impacts may have preceded fall 2002 data collection for many children cannot be ruled out. Moreover, these potential early impacts could account for some, or all, measured differences in characteristics between the Head Start and non-Head Start samples at that point. There are two exceptions, however, among the demographic variablesmeasured after random assignment: whether the biological mother of the child in question was a teen parent and whether she first arrived in the US within the last 5 years. We do not believe these measures could have been affected substantially by the program18 and, hence, have included them in all analyses involving demographic background factors. All the remaining demographic variables in Exhibit 4.2 (other than those noted earlier as collected prior to random assignment), including measures of living arrangements, marital status, income, educational attainment, and the health status of the child’s primary caregiver, in theory could have been influenced by the Head Start intervention prior to measurement in fall 2002. This same concern arises for the fall 2002 measures of developmental and behavioral “starting points” listed in Appendix 4.4. Any potential early impact of Head Start on these variables will be excluded from spring impact estimates when the analysis controls for pre-test measures when calculating effects.

To make the decision of which control variables (“covariates”) to include in the analyses, a statistical procedure was developed for this study that tests whether appreciable early impacts on factors measured in fall 2002 could have occurred. Rather than presume that no such impacts occurred unless the data prove otherwise (as one would do if the usual test for statistical significance were used), the procedure adopted requires strong evidence that early impacts of an appreciable magnitude did not occur. Only then does the tradeoff between possible small omissions of potential early fall impacts from the spring impact estimates on the one hand, and gains in statistical precision plus removal of nonresponse bias on the other, become favorable and warrant the inclusion of the initial fall outcome scores in the analyses.

The procedure adopted seeks a 90 percent assurance that Head Start’s potential early impact on fall demographic characteristics, and on initial fall outcome measures, was small or nonexistent.19 In such cases, the risk of potentially excluding a small portion of the overall impact from the spring 2003 impact estimates is more than offset by expected precision gains and nonresponse bias reductions from including the initial fall outcome variables in the analyses.20 Otherwise, it may be unwise to adjust for the initial fall outcome scores in computing spring impact estimates.

Demographic characteristics that failed the test, other than those noted above as appropriate for inclusion on other grounds, were simply omitted from the impact analyses presented in each of the following chapters. Impact estimates computed both with and without the inclusion of the initial fall outcome measures are presented in the tables within each chapter.21 However, the actual discussion of the relevant impact findings highlights the impact estimates that, in our view, provide the best evidence of Head Start impact. Based on the test for appreciable initial differences in fall measures, when strong evidence is found that Head Start exposure has at most a small effect on the fall pre-test measure, the discussion favors the impact estimate that includes the fall outcome measure as a covariate, with an anticipated gain in precision and nonresponse bias removal as a result. When such evidence is not available, a cautious approach is taken emphasizing impact estimates that exclude the initial fall outcome measure as an explanatory variable.22 This avoids all risk of excluding part of Head Start’s overall impact when reporting spring findings.

Presentation of Results

Tables of results presented in subsequent chapters include all three perspectives for measuring Head Start’s impacts discussed above, i.e., (1) simple differences in average outcomes, (2) impacts adjusted purely for demographic characteristics that are clearly unaffected by random assignment to the Head Start or non-Head Start group, and (3) impacts adjusted for both demographic characteristics and fall 2002 developmental/behavioral starting points. Based on the potential risks and rewards of adjusting impact estimates for differences in fall 2002 pre-test outcome measures (as discussed above) for each statistically significant impact, the tables highlight the single most appropriate measure among the three. Overall conclusions of the research, including the summary of findings presented in the Executive Summary, draw from findings on the preferred measure only; in practice, the pattern of results does not differ much across the three approaches. For context, the tabular results also include the average outcome levels for the Head Start and non-Head Start samples.

The discussion of the preferred findings also provides the corresponding effect sizes, which are defined as the impact estimates divided by the standard deviation of the outcome measure in the population, providing a “yardstick” for gauging the quantitative importance of a measured impact in relation to the natural variation of the child or family outcome Head Start is seeking to affect. Many researchers have used Cohen’s (1987) guidelines for interpreting the relevance of effect sizes, with an effect size of 0.2-0.5 being considered small, 0.5-0.8 moderate, and over 0.8 is large.23 Within the field of education research, some researchers have argued that an effect size has to be at least 0.25 or 0.33 of a standard deviation to be considered “educationally meaningful” (Slavin, 1990; Wolf, 1986).24, 25

In contrast, Glass et al. (1981)26 and McCartney and Rosenthal (2000)27 have asserted that the effect sizes derived from a given study always should be interpreted within the context of the empirical literature on comparable interventions designed to produce similar effects. In the NICHD Study of Early Child Care, the quality of child care predicted children’s cognitive performance at 54 months (range of effect sizes was 0.04 to 0.08).28 The Tennessee study examining the benefits of smaller class sizes in the early school grades yielded effect sizes that ranged between 0.13 and 0.27 on several direct assessments of children’s reading and math performance (Finn & Achilles, 1990).29 A meta-analysis of evaluations of family support programs yielded the following weighted mean effect sizes across several key outcome domains: children’s cognitive development (0.253), social-emotional development (0.258), physical health and development (0.091), parenting attitudes and knowledge (0.182), parenting behavior (0.246), and family functioning/family resources (0.284) (ACF, 2001).30 Finally, another recent meta-analysis of 33 studies focusing primarily on early childhood education programs for low-income 3- and 4-year-olds revealed a weighted mean effect size of 0.118 across the studies reviewed (Aos, Lieb, Mayfield, Miller, & Pennucci, 2004).31 Reflective of their contextual basis for judging impacts to be important in magnitude, this report uses the following convention: less than 0.2 is small; 0.2-0.5 is moderate; and greater than 0.5 is large. This allows interpretation of effect sizes within the broader context of findings from other similar early childhood intervention studies.

Analysis of Subgroups and Moderating Factors

To this point, the discussion has focused on ways to measure the impact of Head Start on the average child or family in the program. Of course, impacts will likely vary across different subsets of the children and families served. For example, Head Start may benefit boys more than girls (or the reverse), or it may benefit families headed by a single parent more than two-parent families (or the reverse).

In addition to an interest in the overall national impact of Head Start on children’s school readiness, Congress mandated an examination of how impacts vary for different types of children and families. The intent is to understand “what drives the overall impacts” when the program is having an effect of important magnitude for the average Head Start participant. In particular, there is interest in determining the extent to which the benefits of Head Start may be widespread—i.e., whether the benefits reach many types of children and families to produce the overall average effect rather than benefiting some but having little or no effect on others.

Identifying groups of children (or families) that benefit more or less from Head Start may have important policy and program implications. It can suggest areas where the program needs to be strengthened or enhanced to ensure that all participants advance in their development. For example, Head Start programs are required to serve children with special needs so it is important to understand the extent to which these children benefit from their participation over and above an interest in determining if Head Start improves the lives of the average participant. In addition, prior early childhood research has indicated that some groups of children follow different developmental paths and may, as a consequence, be assisted by Head Start in distinctive ways, such as children in racial and ethnic minority groups and non-English speaking children and parents.

This interest in “who benefits?” motivates two types of analyses. The first considers the impact of Head Start on individual subgroups of program participants, asking for example: Does Head Start help Hispanic children? Children with single parents? Immigrant families? Mothers who first gave birth as teens? Special needs children? This same set of results can be considered in total to determine whether certain subgroups “drive” the overall average impact or whether widespread benefits accrue to many different subgroups.

The second set of analyses considers whether impacts differ in magnitude between distinct types of children and families. For example, Head Start may have smaller effects on children of recent immigrants than on other children or larger effects on two-parent families than single-parent families. Interest in these comparisons stems from several sources:

  • Researchers want to know what factors “moderate” the influence of early childhood services (such as those provided by Head Start) on child development and family functioning. In this case, the term “moderate” means alter the size of the impact of those services when they are provided to one type of child (or family) versus another. For example, the extent to which a child’s primary caregiver reports symptoms of depression may moderate how much Head Start is able to help him/her develop good social skills, or a child’s home language may moderate the program’s ability to expand reading readiness by getting parents to read more to their child.

  • As noted above, Congress required that the study identify the types of children and families that benefit most from Head start participation, a question that implicitly relates impacts for one type of child/family to impacts for another. For example, do younger children benefit more than older children? Single parent families more than two parent families?

  • Head Start program operators might seek to enhance services in ways that would particularly benefit subgroups found to be experiencing smaller impacts than other subgroups, such as children with special needs or families with diverse cultural or linguistic backgrounds.

With sufficient data, all subgroup impacts and moderator influences would become apparent when the difference in outcomes between Head Start and non-Head Start families in one subgroup is calculated and compared to the difference in outcomes between Head Start and non-Head Start families in another subgroup. But because data are limited, the study cannot decisively answer all questions about Head Start’s impact on different subpopulations. Still, where evidence is strong that an impact on a particular subgroup, or a difference in impacts between subgroups, has occurred, the subgroup analysis will produce a finding that is not difficult to interpret: real impacts in the measured direction have taken place.32 In contrast, a non-significant finding is more ambiguous and could indicate either: (1) that there is in fact no impact, or difference in impact, for some subgroup(s) or (2) that impacts exist but are too small in magnitude to reach the threshold of what the data are able to detect. This means that statements about the subgroup and/or differential impacts that did take place will be much less equivocal than statements about impacts being lacking or undifferentiated between groups. The latter ambiguity makes it hard to be conclusive about which subsets of children and families “drive” the overall results (since additional subpopulations may contribute but escape detection), judge impacts to be widespread as opposed to narrowly concentrated, or identify subgroups that have similarly sized impacts.

As a consequence, this preliminary examination of subgroup impacts will, at times, seem incomplete. But the reader should keep in mind that the goal here is to present all the evidence available in the data on subgroup-related questions. Each piece is valid in its own right; yet the full picture may be a patchwork due to statistically inconclusive, ambiguous findings for many subgroups and potential moderating influences.

Exhibit 4.3 lists the subgroup-defining (i.e., moderating) factors examined in the current analyses across the four outcome domains of interest: cognitive, social-emotional, health, and parenting. All subgroups and moderators considered here were identified in advance of the data analysis on the basis of their program and policy importance to Head Start or their relevance to understanding early childhood development generally. Different subpopulations are germane to different domains of outcomes on this basis, cognitive, social-emotional, health, and parenting. Within a given domain, all subgroups are tested for a common group of outcomes (the set used for the overall impact analysis) so that findings, both statistically significant and insignificant, can be assessed as a group to determine how widespread the benefits of Head Start are and identify within this pattern child/family types that definitely benefit.33 The reasons for selecting the particular subgroups and moderators in the exhibit are discussed below:

  • Whether the child has special needs. Parents reported whether their child had one or more special needs (e.g., learning disability) as of fall 2002. Evidence from Head Start FACES and other studies indicates that children with special needs have lower cognitive scores and lower levels of social skills and higher levels of problem behavior at the beginning and end of the program year, especially after controlling for differences in family circumstances. This could lead to either statistically significant differences in effect sizes or significant impacts on one of these two subsets of children, special needs and non-special needs, and not the other, or both of these patterns. In addition, the existence and size of the health impacts that Head Start can achieve may also be influenced by the presence of disabilities. Consequently, it is important to learn whether the benefits that children with disabilities derive from the opportunity to participate in Head Start are similar to or greater than those of children without disabilities, and whether benefits exist at all in each instance.

  • Child race/ethnicity. Children were categorized as African-American, Hispanic, or White/Other (which includes Asian and Native American). Close to two-thirds of all children enrolled in Head Start are from racial and ethnic minority groups, especially African-Americans and Hispanics. These children, on average, enter Head Start with relatively lower cognitive scores, lower levels of parent-reported social skills, and higher levels of parent-reported problem behavior, even when lower parent education and family income levels are taken into consideration. In addition, many minority children enter Head Start with greater health needs than non-minority children and may not have the same opportunities to obtain health, vision, and dental services. Finally, there is some evidence of the use of harsher discipline by low-income African American parents.34 As a result, one might expect the impact of Head Start to be greater for minority children than for White children from low-income families. Their differences, as well as differences in impact between Hispanic and African-American children, will be important to detect, as will the existence of impacts for each one of these racial/ethnic groups in its own right.

  • Child gender. Previous research has found significant gender differences in the frequency of cooperative, pro-social, and problem behavior among young children. For example, aggressive and hyperactive behaviors tend to be more common among boys than girls, whereas there is less of a gender difference with respect to withdrawn or depressed problem behavior. Some evidence indicates that raising young boys may present more challenges to parents than raising young girls.35 Given these rate differences, it is of both theoretical and practical relevance to ask separately whether Head Start conveys benefits for males and for females, and whether the size of impacts differ between the two groups.

  • Language of child assessment. Children were assessed in either English in both fall 2002 and spring 2003, or in Spanish in the fall and English in the spring. Hispanic children from Spanish-speaking families are one of the fastest growing segments of the Head Start child population. But many local programs have to struggle to provide staff that are fluent enough in Spanish to communicate easily with both children and parents, as well as to assist children in their acquisition of English. It is important, therefore, to learn whether the benefits that these children derive from Head Start are equivalent to, less than, or greater than those of children from other language backgrounds, and whether impacts occur at all for each of the language groups in its own right.

  • Child’s home language. Parents reported the language that was most often spoken at home, which was categorized as English or not English (this variable was used as a moderator for health and parenting outcomes to capture the language of the parent rather than the child). Families whose home language is not English may be harder to engage in efforts to improve their parenting skills and may not be tied into the social welfare safety net as much as those whose home language is English, due to language and cultural barriers. As a consequence, it is important to learn whether children from non-English-speaking homes benefit from the opportunity to participate in Head Start and if those gains differ from Head Start induced impacts in English-speaking homes.

  • Parent’s marital status. Single-parent families and families with parents who are separated or divorced may be unable to provide their children with as much stability and parental resources as married-parent families, leading to greater needs in the children (as well as emotional and behavioral issues). For example, previous research has indicated that children from separated, divorced, and unmarried family situations have higher levels of difficulties in elementary school, compared with children from stable married families. In addition, Head Start emphasizes parental involvement, and parents from married-parent families may be more able to take an active part in the Head Start program, resulting in greater benefits to their children. It is, therefore, important to learn whether the impact of Head Start is different for children from different types of home environments, and to test separately for impacts in households of each distinctive marital arrangement.

  • Primary caregiver’s depression rating. As part of the fall 2002 interview, parents also reported on the CES-D scale for depressive symptoms.36 A frequent occurrence within low-income families, especially in single-parent families, is depression in the child’s primary caregiver, typically, the mother. Such instances of depression may pose an obstacle to the parent’s participating in Head Start as much as is optimal and in providing support for the child’s learning, social-emotional development, and health care needs. Parents who are struggling with mental health issues may also be less receptive to efforts to strengthen their discipline practices and to increase their engaging in various educational activities at home with their child. Consequently, the impact of Head Start may be less in situations of high levels of caregiver depression at baseline. On the other hand, to the extent that Head Start is a compensatory program designed to make up some of the support and resources that the home is not providing, children with depressed primary caregivers might be expected to show greater benefits from Head Start. The program may also assist the depressed parent in obtaining the necessary services and supports for his/her depression and thus indirectly benefit the child’s social development and emotional well-being. For all these reasons, an important question is whether degree of depression moderates the size of Head Start’s impact.

  • Mother’s age at first birth. Parents were asked, “How old were you when you gave birth for the first time?” as part of the fall 2002 parent interview. Mothers were divided into two groups, those who had given birth before or after age 19 (referred to as “teen” versus “non-teen” mothers), and this variable was used as a moderator for parenting outcomes. Mothers who first gave birth as adolescents are at greater risk for poor childrearing practices. However, because of their heightened risk, Head Start may make special efforts to engage them in services that may provide important benefits more than for mothers who were older when they first became parents.

  • Child’s achievement at the start of Head Start. The child’s score on the outcome variable as of fall 2002 was included as a moderator for both cognitive and social-emotional outcomes. FACES has repeatedly found that children who enter Head Start with lower cognitive scores (e.g., those in the lowest quartile of the Head Start child distribution) show larger cognitive gains from fall to spring than the children with average or above-average entering scores. This phenomenon has been interpreted as indicating that Head Start is of particular benefit to children with larger cognitive deficits. Others have argued that the finding is merely a manifestation of “regression to the mean,” wherein those who are lagging at a particular point of measurement make the largest advances simply by moving back closer to the middle of the distribution. Since this is the natural tendency in a developmental process in which children show spurts of growth at different times, it is not necessarily an indication that Head Start is especially effective for those with greater initial deficits. The crucial issue is whether the progress of children with lower initial achievement is further sped by participation in Head Start, precisely what comparison of the progress made by the Head Start and non-Head Start group children at lowest levels of initial achievement will illuminate. If true program-created impacts have occurred, these cognitive gains may also have positive carryover to social and emotional development, making it important to also look at outcomes in those domains for children with the lowest initial cognitive scores.37

Exhibit 4.3: List of Variables Used As Moderators by Outcome Domain
Outcome Domain Moderators
Cognitive Child Has Special Needs
Child’s Race/Ethnicity
Caregiver Depression
Language of Child Assessment
Caregiver Married
Fall Measure ((PPVT, Bear Rate, Draw Score (3-year-old group only), and Color Score (3-year-old group only))
Social Emotional Child Has Special Needs
Child’s Race/Ethnicity
Caregiver Depression
Language of Child Assessment
Caregiver Married
Caregiver Separated or Divorced
Child’s Gender
PPVT
Health Child Has Special Needs
Child’s Race/Ethnicity
Caregiver Depression
Caregiver Married
Home Language
Parenting Child’s Race/Ethnicity
Caregiver Depression
Language of Child Assessment
Caregiver Married
Child’s Gender
Was Mother a Teen at First Birth?
Home Language


This set of moderator variables defines subgroups for which separate impact estimates can be calculated. It is important, therefore, to ensure that each moderator and each subgroup determined by a categorical moderator is independent of the intervention. This is so that subgroups are well-matched between children originally randomized into the Head Start or non-Head Start groups and represent valid experimental analyses in and of themselves. For example, if Head Start participation led to greater awareness on the part of parents of a child’s special needs before the fall 2002 data collection, comparisons of the children parents report as special needs in the Head Start group versus the non-Head Start group would not be based on a consistent, matching set of individuals. This could bias measures of impact in this population.38 To avoid this risk, the same evidence of lack of early program impact was required on each moderator variable used (and in both age groups) as previously required of covariates for the regression analysis. Indeed, many of the moderators and subgroup definers in Exhibit 4.3 are the same as the covariates described earlier (compare to Exhibit 4.2 above). All the others listed here were either measured prior to random assignment (e.g., home language) or have also been convincingly shown to have effect sizes that are at most judged to be small by the statistical standard used (e.g., mother’s marital status).

A single regression provides information on both topics of interest: how impacts vary with the moderating factor examined and, if that factor is a 0/1 indicator of membership in a particular group, how large an impact Head Start had on each of the subgroups defined by the moderator variable. This analysis interacts the dummy variable for assignment to the Head Start group with each moderator variable in turn, allowing impact to vary with that factor. For 0/1 moderators such as gender (e.g., 0=boy, 1=girl), Head Start’s impact on individual subgroups can be inferred from the coefficient on the random assignment dummy variable (impact on the omitted subgroup, in this case boys) and the sum of that coefficient and the coefficient on the interaction term itself (impact on the included group, or girls). For moderators that indicate membership in a subgroup, this procedure replicates a difference-in-differences approach that estimates each subgroup impact as the Head Start/non-Head Start difference in mean outcomes for individuals in that subgroup and measures the effect of the moderator on the size of impact as the difference between those two estimates. Continuous moderators (e.g., parental depression) produce regressions that indicate if the impact of Head Start varies with the value of the moderating variable.

Depending on the analysis perspective adopted (see earlier discussion) various fall 2002 measures are added to the regressions as covariates. Tests of the statistical significance of both potential moderating influences and the average impact of Head Start in a given moderator-defined subgroup are derived along lines similar to those used in testing for the overall impact of Head Start. Details of the moderator and subgroup analysis regression approach and test procedures appear in Appendix 4.3.

Two special restrictions were placed on the subgroup/moderator analyses. First, to avoid findings that may exaggerate contrasts between subgroups due to the vagaries of small-sample analysis, subgroups with fewer than 50 observations in either the Head Start or non-Head Start group were not examined. Second, certain observations could not be included in particular subgroup analyses for certain moderators. For example, children with deceased parents could not be classified in examining how impacts vary with mother’s current marital status and so were left out of that particular moderator regression. Similarly, children who could not be assessed in either English or Spanish in fall 2002 due to lack of familiarity with these languages were dropped from the analysis sample when examining initial cognitive ability as a moderator of impacts.39

Impact findings for subgroups and moderators are presented in the following chapters using two of the three perspectives introduced in the examination of overall results: mean differences adjusted for fall 2002 demographic characteristics and mean differences adjusted for fall 2002 demographic characteristics and developmental and behavioral starting points. The preferred perspective is again highlighted in the discussion for each impact or impact difference considered.

Estimating the Impact of Program Participation

All of the impact estimates described to this point measure the effect of Head Start on the average child randomly assigned to the Head Start group. However, as discussed in Chapter 2, not all of these children actually participated in federally funded Head Start services, the intended treatment. This is not an unexpected phenomenon: in the normal course of events, some children and families accepted into Head Start never participate, because their interest in what the program has to offer has declined since application, because other center-based arrangements have been found, or because other events interrupt plans to attend (e.g., moving to another city or distant neighborhood). This suggests two different versions of the research questions posed at the beginning of the study:

  • How much does Head Start help the typical child and family admitted to the program, on average?

  • How much does Head Start help the families and children that actually participate in Head Start, on average?

It will be harder to improve the average outcome of everyone accepted than the average outcome of participants, assuming that non-participants gain little or nothing from the program. If the non-participation rate (also known as the “no-show” rate) exceeds 5 or 10 percentage points, the magnitude of the difference may matter.

Answers to both questions matter for policy and program administration purposes. Head Start programs are typically funded for a fixed number of slots, regardless of whether all slots are used. In that sense, the Federal program pays for slots rather than actual participants where the two differ, so impacts per family or child admitted has some relevance to the fiscal picture. Also, the Head Start program can offer opportunities to participate but it cannot compel any child to attend. Hence, the impact of admission into the program, whether taken or not, measures the typical result of what grantees do—provide access—rather than the effect of delivering services to every selected child and family.

Yet the question of how much children gain from actually participating in Head Start’s services remains an important one. For local programs at full attendance (not simply full enrollment, on paper) impacts per participant correspond with Federal funding per slot. Moreover, if impacts per participant are large but impacts per admitted child comparatively small, the evaluation will show the value of increasing participation rates as an adjunct or alternative to expanding the number of children accepted into the program. Procedures for estimating program impacts on participants are discussed below.

An Experimentally Based Strategy of Estimation

A research study in which random assignment to the intervention group dictates access to program services but not actual utilization of those services cannot directly estimate the average impact of program participation. This is because the Head Start group includes “no-shows” who, when granted access to the program, did not actually participate. The non-Head Start group includes equivalent types of children and families who would not have participated had they been given access.

One could look at outcomes only for actual participants in the Head Start group (excluding the “no-shows”). But the subset of the non-Head Start group that corresponds statistically to these individuals cannot be identified in equivalent fashion—there is no information to identify which of the non-Head Start children would have participated in the program had they been granted access.

Fortunately, the best way to estimate Head Start’s impact on the average participant does not require that one knows anything about why no-shows arise, or how they differ from other families and children in the sample. If it did, any impact measure produced for the participant population would have the same drawbacks that affect quasi-experimental estimates from non-randomized studies: selection bias caused by pre-existing differences between participants and comparison group members. But if one can assume that no-shows experience zero impact from Head Start, it is possible to avoid these kinds of assumptions about (or analyses of) selection into and out of the program. That is, “no-shows” can be entirely different from participants in measured and unmeasured ways, but it is unnecessary to understand how they are different or to make any adjustments for their distinctive characteristics.

This is possible by using the original comparison of all Head Start-group members to all non-Head Start group members but interpreting it in a different way. The new interpretation says that the Head Start group’s impact—how its outcomes differ from what would have transpired without a Head Start program—has two components:

  • The impact on “no-shows” who by definition do not participate in the program, even though admitted, which can logically be assumed to be zero.

  • The impact on everyone else assigned to the Head Start group—i.e., on the Head Start participants who comprise the rest of the experimentally determined intervention group.

This assumption alone—the presumption that children and families who never receive Head Start services remain unaffected by their assignment to the program group—makes it possible to translate the measured effect of the program on the entire Head Start sample (which the experimental design provides directly using research methods described in earlier sections) as a way to assess the average effect of Head Start on just the participants.40 It does not matter what the average effect would have been on non-participants had they participated. Nor does it matter whether non-participants have different outcomes than participants due to “selection” or pre-existing differences. Before focusing on this assumption’s plausibility, the next section traces the implications for measuring in a reliable fashion the impact of Head Start on the average participant.

It is important to understand that the overall impact on all children is simply a weighted average of the impact on participants and the impact on the “no-shows”:

Impact (on all children) = P (impact on participants) + Q (impact on no-shows)

where P is the number of participants in the Head Start group and Q is the number of “no-shows” in the Head Start group. If the assumption of zero impact on the “no-shows” is incorporated into this equation, one gets the following expression:

Impact (on all children) = P (impact on participants) + O {Impact (on all children)}/P = (impact on participants)

This does not say that the average effect on participants is the same as the average effect on the whole sample, derived from previously described analyses, when no-shows experience a 0 effect. Instead, the average effect on any set of individual children or families depends on both the total amount of gains accruing to all the individuals in the group—the measures represented by the “Impact (____)” terms above and the number of individuals in the group. In effect, P just rescales the total gain to all Head Start sample group members by dividing by the number of participants rather than the number of Head Start group members overall.41

The most important aspect of this “no-show adjusted” estimated average impact on program participants is demonstrated by Bloom (1984)42 in the seminal article on this topic and shown to be equivalent to the instrumental variables estimator by Angrist, Imbens, and Rubin (1996).43 It can be calculated from the initial overall average impact estimate and information on which (or how many) intervention group members participate in the program and which do not. It also has the crucial property that it cannot suffer from selection bias due to pre-existing differences between participants and no-shows or participants and comparison group members. Specifically, if the original experimental comparison of average outcomes between all Head Start group members and all non-Head Start group members is not biased by systematic differences between these two randomly generated groups at baseline, the simple rescaling of the original estimate cannot be biased. This theorem, based solely on the assumption of zero impacts on non-participants, provides a broadly accepted basis for the now almost universal practice of reporting impact estimates for participants-only along with the all-intervention-group impact findings. 44

Adjusting for Head Start Participation by Members of the Non-Head Start Sample

The study had no way to fully ensure that the children and families randomly assigned to the non-Head Start group did not participate in federally funded Head Start. The grantees and delegate agencies whose applicants made up the research sample agreed not to serve those families using Federal Head Start funds during the 2002-03 program year. But other grantees and delegate agencies in nearby communities (or, in the case of several large cities, in overlapping neighborhoods) did not enter into such agreements and, for reasons of privacy, could not be told the identities of the children and families involved in the study even had agreement been reached not to serve them. Moreover, no mechanisms existed for enforcing the commitments made by the participating grantees and delegate agencies.

In light of these limitations and the strong attraction of Head Start to many families, it is not surprising that a number of families from the non-Head Start sample in fact obtained Head Start services for their children during that year. A total of 17.6 percent of the children in the non-Head Start group are known to have participated in a federally funded Head Start program for at least 1 day during the analysis period once analysis weights are applied. Though some of these enrollments may have been very brief, Head Start likely had some of the same impacts in this subset of the comparison group as it did for those randomly assigned to the Head Start sample. If so, measured impacts from the comparison of the two complete samples will understate the average impact of the intervention, for all children granted access to Head Start and, following the adjustment described in the previous section, for children in the Head Start sample who actually participated in the program. The consequences of this “contamination” of the research design may be slight and (as discussed earlier in the report) have precedence among randomized evaluations of social programs. Still, it is important to take program participation by the non-Head Start sample into account in looking at the findings and, if possible, to develop measures of impact that reduce or remove any potential “contamination bias” that may have occurred as a result.

A number of strategies have been suggested for analyzing members of a randomly selected comparison group who receive an intervention. These individuals are referred to in the literature as “crossovers” based on the fact that they “crossed over” the line between the two situations created at random assignment. One option for accounting for these individuals (generally viewed as unacceptable) is to assume that the program had no impact on them and interpret unadjusted findings as reflective of the intervention’s full impact. This may be a good first approximation, depending on the duration and intensity of program involvement among comparison group members and the number of such individuals. But it is at best a lower bound for the desired measure,45 Head Start’s impact on the average participant compared to a statistically equivalent group with no Head Start participation at all. If no exploration is done of the potential consequences of crossover behavior, the study will be able to say with full confidence that “Head Start’s average impact is at least this big, and possibly bigger,” but it will not be able to say with confidence what the upper extreme might be in terms of Head Start’s impact.

Of the three known approaches to addressing the possible magnitude of the canceling-out problem for crossovers (see below), the current report focuses on the most intuitively plausible and straightforward method: removing the “contaminated” cross-over cases from the non-Head Start sample and recalculating the impact of the program without them. In using this strategy, it is acknowledged that the resulting estimates are no longer fully experimental, i.e., they do not emerge from the comparison of two complete sets of individuals made statistically equivalent through random assignment. Specifically, if families who obtain access to Head Start despite having been assigned to non-Head Start status differ from other families randomly assigned, the removal of the crossovers will change the composition of the remaining non-Head Start sample, i.e., the latter set of individuals will no longer match the complete Head Start sample to which it is compared. Moreover, the counterparts to crossovers cannot be removed from the Head Start sample; they cannot be identified, since no one knows which children randomly assigned to enter Head Start in the study sites would still have managed to participate in the program had they been assigned originally to the non-Head Start comparison group.

The mismatch caused by removal of crossovers from the analysis sample creates a new form of possible bias, not contamination bias (since the contaminated crossovers have been removed) but selection bias because the removed cases were selected non-randomly. If selection bias exists, it may skew the impact estimates up or down, depending on the ways crossover children differ from other children in the non-Head Start sample. For example, it may be that non-Head Start sample children who gain access to the program face greater developmental challenges than other children and fare worse in the spring regardless of their program participation. This would be the case if the families of the most disadvantaged children, and/or the Head Start providers to which those families apply, press particularly hard to obtain a strong pre-kindergarten experience for those children, and in particular a Head Start experience. If this is the case, removing crossovers from the non-Head Start sample will skew the average outcome of the comparison sample upward and thus lead to an understatement of the program’s impact. Alternatively, children with relatively favorable prospects may be the most likely to participate in Head Start as crossovers, possibly because their parents work harder to support their growth in a variety of ways, one of which may be obtaining access to Head Start when assigned to the non-Head Start sample. If this occurs, the non-Head Start sample loses some of its highest achieving children in the spring, causing what remains of that group to have artificially low outcomes compared to the full Head Start sample. In this case, the resulting impact estimates following the removal of crossovers will overstate the impact of the program.

The reliability of adjusted impact measures that omit crossovers from the comparison group will hinge on the researcher’s ability to measure and adjust for how crossovers differ from non-crossovers. This means the success of the technique will depend on the ability to model selection into the program within the non-Head Start sample, a challenge facing all attempts to measure social program impacts absent random assignment. In the current report, this problem is handled in the same way that nonresponse is treated in sample surveys: the non-crossover members of the control group—who now constitute a non-experimental comparison group rather than an experimental control group—are weighted back up to represent the complete non-Head Start group adjusting weights in cells defined by available baseline characteristics.

After dropping crossovers from the comparison group sample, the analysis weight adjustments used originally to offset missing data caused by non-response to the spring 2003 collection of follow-up data were recalculated. As explained earlier in the report, children and families omitted from impact calculations due to lack of outcome data are “put back” statistically by increasing the analysis weights (i.e., the degree of influence) of other observations with outcome data that have similar background characteristics. This strategy, applied previously (and separately) to the Head Start and non-Head Start samples, was repeated for just the non-Head Start sample with crossovers treated as additional “non-respondents.” This technique adjusts for the absence of potentially contaminated control group members from the now trimmed-back analysis sample by increasing the analysis weights applied to similar non-Head Start sample members who did not cross over and who therefore remain in the analysis. Appendix 1.2 on analysis weights explains in detail this re-weighting process. The regression analysis used to calculate impacts then controls for remaining differences in observed background characteristics between the two samples being compared using covariates as described above.

Just as was true of the original nonresponse adjustments to analysis weights and regression modeling of background characteristics, there is no assurance that the combination of these methodologies effectively or completely offsets the potential bias of using an incomplete comparison group sample. Only the distinctive attributes of the missing non-Head Start sample members captured by the stratification variables used to re-weight the data (in this case principally site and child age; see Appendix 1.2 for details) or the background covariates in the regression equations will be compensated for, leaving considerable potential for remaining differences in attributes to skew the cross-over adjusted results. Additional research in future reports will consider how much of an influence this possible remaining selection bias may have on the magnitude of the estimates by exploring the other two strategies in the literature for addressing crossovers in random assignment impact evaluations:

  • Sensitivity analyses of how much larger or smaller outcomes (or impacts) for crossovers would have to be—compared with known values of these quantities for other sample members—for the omission of this group to substantially influence the character of or “story” in the findings;

  • Construction of specific alternative scenarios to serve as upper and lower bounds on the size of true impacts, including a scenario sometimes used in the literature that treats crossovers symmetrically with the no-show adjustment described in the previous section. Treating crossovers as has been proposed for no-shows requires much stronger assumptions that generally cannot be justified as the principal response to the crossover problem Specifically, one would have to assert that (a) the average impact of Head Start on crossovers equals that on the corresponding children and families in the Head Start sample even though the former entered the program through a different indirect or surreptitious route, often with different provider agencies, and (b) that average impact does not differ appreciably from Head Start’s impact on non-crossover-type sample members. Though not justifiable on their face, a reinterpretation of these assumptions as part of an upper bound scenario makes this approach worth pursuing in future research.

For now, the discussion of findings in subsequent chapters focuses on the information the impact study supplies directly, without recourse to these types of quasi-experimental and simulation analysis methods.

Presentation of Results for Participants and Adjusted for Crossovers

Chapters 5-8 present the average impact of access to Head Start (referred to as “intent to treat” estimates). Related appendices present the average impact of Head Start participation, using the no-show adjustment for those outcomes for which overall average impacts or subgroup/moderator impacts are reported as statistically significant. Overall average impacts, presented in the appendix tables for reference, are divided by the corresponding participation rate to obtain average impacts on participants, using subgroup-specific participation rates where appropriate. Tests of the statistical significance of the participant-only impact findings are identical to those of their full-sample antecedents. Given the maintained assumption that Head Start had no effect on non-participants, a zero (non-zero) impact on the entire Head Start group will occur if, and only if, a zero (non-zero) impact occurs for the average participant. Thus, hypothesis test results for all Head Start group members imply hypothesis test results for participants. One can, therefore, reject the null hypothesis that the average impact on participants is zero whenever the full-sample analysis shows impact to be statistically significant for the broader set of all Head Start group members.

Additionally, the appendix tables that summarize statistically significant impacts include a third set of estimates of effects on participants adjusted to remove the influence of crossovers, i.e., of children assigned to the non-Head Start sample who nonetheless participated in Federal Head Start for at least a day during the 2002-03 program year. These estimates are not attenuated by the potential impacts of Head Start on crossover children as is true of other impact findings, but they may be subject to uncorrected selection bias up or down. Only impacts for an entire age group are examined in this way, not the more detailed findings for subgroups and moderating factors. After computing crossover-adjusted impacts as described earlier in this chapter, the “no-show” adjustment is applied to these estimates to convert results into average impacts on participants for inclusion in the tables. Separate tests of the statistical significance of the crossover-adjusted estimates are presented with the findings, constructed using the procedures described already for the non-crossover-adjusted analyses but applied to the new estimates from the re-weighted data.




1 For certain cognitive measures, information is provided for both Item Response Theory maximum likelihood values and standard scale scores. (back)

2 For each item, the parent was asked to judge whether the behavioral description was “not true,” “sometimes true,” or “very true” of the child. There were seven items in this scale, and scores could range from zero (meaning all the items were rated "not true" of the child) to 14 (meaning all the items were rated "very true" of the child). Mean scores on the scale obtained from parents of Head Start children in the Head Start Impact Study were closely comparable to mean scores obtained from parents of an independent national sample of Head Start children in FACES 2000.2 As in FACES, social skills and positive approaches to learning scores tended to be skewed toward the higher end of the range because parents tended to rate their children as exhibiting most of the positive attributes asked about in the rating instrument. Nonetheless, the scale has shown significant relationships with other measures of children’s social development and with relevant child and family characteristics. (back)

3 Administration for Children and Families. (2001). Retrieved 10/15/04 from: http://www.acf.hhs.gov/programs/opre/hs/faces/index.html#instru. (back)

4 Zill, N., et al. (2003). Head Start FACES 2000: A Whole-Child Perspective for Program Performance. Fourth Progress Report. Washington, DC: Administration for Children and Families, US Department of Health and Human Services. (back)

5 Citations for these scales are in Appendix 4-1. (back)

6 Both sets of children had advanced sufficiently in their English language skills by spring 2003 to be administered follow-up assessments primarily in English (with continued use of several measures administered in Spanish) for use as outcome data for the impact analysis, with the exception of children in Puerto Rico. All Puerto Rican children in the sample spoke Spanish as their native language at the time of random assignment and continued to be assessed in Spanish in the spring. For this reason, Puerto Rico sample members, and hence the Puerto Rico-based portion of the national Head Start program, are not included in the report: The cognitive measures thought crucial to gauging Head Start’s impact are not available for these children on a comparable basis at this early age. Puerto Rico will be added to the analysis sample in subsequent years. (back)

7 In addition to chance, the comparability of the Head Start and non-Head Start samples for the different language groups will depend on the success of the nonresponse weight adjustments made to the overall sample to deal with possible differential nonresponse in the spring 2003 data collection. (back)

8 The coefficient on the dummy variable in this specification provides the impact estimate and is computationally identical to the simple difference-in-means estimate. Its statistical significance is tested using the same estimate of variance (i.e., standard deviation) as the equivalent difference-in-mean estimate and the same Student’s t distribution. (back)

9 Impacts are initially estimated in terms of log-odds ratios, which become probabilities when passed through the logistic transformation. If assignment to the Head Start group has a statistically significant impact on the log-odds ratio when tested using the usual maximum-likelihood test procedure of a logistic model, one can conclude that it also significantly influences the probability of the outcome in question. (back)

10 Unstable coefficients arise for a variety of reasons, most often because a two-way categorical variable has very few—or, for the replicate subsamples used to calculate variances for all the regression coefficients, no—observations in one of its cells. The SUDAAN estimation procedure used for all impact regressions could not produce numeric values of the desired coefficients in these instances. (back)

11 Removal of unstable covariates resulted in the consolidation of certain 5-, 7-, and 8-way categorizations of children/families as sets of dummy variables into 2- and 3-way categorizations, represented in the regressions by 1 or 2 dummy variables in each instance (omitting one of the consolidated categories each time). This collapsing of categories was necessary for child race/ethnicity, mother’s education, primary caregiver’s self-reported health status, and family income range in order to get all impact regressions involving demographic covariates to converge to estimated equations that contain no missing coefficients. (back)

12 Or, in the case of the logistic models, to the log-odds ratio. (back)

13 As noted previously, the estimation procedures for including covariates in the formal regression model are described in Appendix 4.3. (back)

14 The measure of program impact from the regression models—the coefficient on the variable indicating membership in the Head Start group—will include only that portion of the overall difference between the two groups in spring 2003 that is not accounted for by other variables in the model. Fall measures that are systematically higher (or, for factors that Head Start participation might reduce such as parental use of physical discipline, lower) for the Head Start group than the non-Head Start group and that predict child-by-child variations in spring outcomes to some degree will account for some of the systematically higher (or lower) spring outcomes for the Head Start group, precluding the coefficient measuring program impact from doing so. (back)

15 Differences in average fall “outcomes” between the two groups would arise in this instance only to the extent that the analysis weight adjustments for dealing with nonresponse described earlier do not completely compensate for differential nonresponse. (back)

16 Regardless of the intervention, or the channels through which it might influence the initial true characteristics of sample members or the way those characteristics get reported to the evaluation, that influence cannot possibly begin for one group (but not the other) when no one knows which children are in which group. This is precisely the situation that must hold prior to random assignment for the Head Start applicant families and the grantees operating the program (and even for members of the evaluation team collecting the data). (back)

17 In-depth in-person data collection was made necessary because of the number and complexity of the child assessment scales needed for each child, which could only be administered in person by highly trained staff. The high unit cost of this type of data collection dictated that the minimum number of children be put in the sample and assessed, putting a premium on identifying children certain to be included in the evaluation before initiating field data collection. Thus, data collection could not begin until a firm determination was made that a particular child would be randomly assigned into the study sample. This required that the Head Start provider organization deem the child appropriate for services based on its local service-targeting priorities. However, at the point this determination was made, the grantee also faced substantial pressure to notify families selected for the Head Start group that their children would be allowed to participate in the program. This forced random assignment to take place—and in many cases actual Head Start program participation to begin—almost immediately after eligibility was determined, leaving no time for baseline data collection prior to that point. Postponing random assignment long enough for extensive in-person testing of children to take place would have imposed an unacceptable hardship on families and Head Start agencies left wondering which children would be served by the program. Accelerating data collection to substantially precede eligibility determination would inevitably have led to many costly interviews and assessments being conducted for children and families who in the end proved ineligible for inclusion in the study. (back)

18 A biological mother of a child applying for Head Start could not have become a parent for the first time following random assignment, so teen-parent status for all mothers was already established well ahead of the point where Head Start participation began and could have had an effect on actual fertility. Similarly, with all families in the research sample living in the U.S. at the time of application, random assignment could not have changed the fact or timing of immigration. (back)

19 Appendix 4.5 describes the procedure used. “Small” is defined on a relative basis (an effect size of 0.2 or smaller) that takes account of how much the fall measure varies in the population being studied using a guideline suggested by Cohen (1988) that keys off effect size (the ratio of impact to standard deviation, a measure of variation). See Jacob Cohen. (1988). Statistical Power Analysis for the Behavioral Sciences. Hillsdale, NJ: Lawrence Erlbaum Associates Inc. (back)

20 This is true notwithstanding prior steps to remove a portion of the nonresponse bias, if any, through sample weight adjustments. These adjustments, described in Appendix 4.1, remove only those differences attributable to a subset of the fall background characteristics one might want to take into account. By stipulating that the influence of all background factors is approximately linear, impact regressions can adjust for many more (actually, an almost unlimited number of) fall measures that could differ between the Head Start children in the analysis sample and the non-Head Start children in the analysis sample due to differential nonresponse in the spring. The regressions are not constrained by the rapidly shrinking cell sizes that typically limit the number of factors that can be taken into account through stratified matching and reweighting of the data. (back)

21 Language of assessment in the fall (Spanish, English, Other) affected the way impact estimates were calculated for the combined analysis sample. Three different versions of the impact estimates were computed for the following outcome measures: PPVT-III adapted, CTOPPP Elision, WJ-III Oral Comprehension, WJ-III Spelling, Letter Naming Task, and WJ-III Applied Problems. For each outcome, different impact estimates were derived using as fall 2002 covariates: (i) “English PPVT-III adapted” and “Spanish PPVT-III adapted”, (ii) “English PPVT-III adapted” only, or (iii) neither language-specific PPVT-III variable. A similar specification was used to estimate impacts on WJ-III Letter-Word Identification scores using the language-specific versions of this test in fall 2002. The exhibits used to present findings in Chapters 5 through 8 present the results of versions “ii” and “iii” with version “i” discussed in a footnote to the tables. (back)

22 For instances in which three different versions of the impact equations were estimated (see Footnote # 21), results favor the version of estimated impact that includes “English PPVT-III adapted” from fall 2002 (or its WJ-III Letter-Word Identification score equivalent) as a covariate. There is strong evidence that Head Start exposure has at most a small effect on this measure for the sample assessed primarily in English while such evidence is lacking for children assessed primarily in Spanish. (back)

23 Cohen, J. (1987). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Hillsdale, NJ: Erlbaum. (back)

24 Slavin, R.E. (1990). Cooperative Learning: Theory, Research, and Practice. Englewood Cliffs, NJ: Prentice Hall. (back)

25 Wolf, F.M. (1986). Meta-analysis: Quantitative Methods for Research Synthesis. Newbury Park, CA: Sage. (back)

26 Glass, G.V., B. McGaw, and M.L. Smith. (1981). Meta-Analysis in Social Research. London: Sage. (back)

27 McCartney, K. and R. Rosenthal. (2000). “Effect size, practical importance, and social policy for children.” Child Development, Vol. 71(1), pp. 173-180. (back)

28 NICHD Early Child Care Research Network & Duncan, G. (2003). “Modeling the impacts of child care quality on children’s preschool cognitive development.” Child Development, 75(5), pp. 1454-75. (back)

29 Finn, J.D. and C.M. Achilles. (1990). “Answers and questions about class size: A statewide experiment.” American Educational Research Journal, 27(3), pp. 557-577. (back)

30 Administration for Children and Families. (2001). National Evaluation of Family Support Programs: Final Report. Washington, DC: Author. (back)

31 Aos, S., R. Lieb, J. Mayfield, M. Miller, and A. Pennucci. (2004). Benefits and Costs of Prevention and Early Intervention Programs for Youth. Document #: 04-07-3901: Washington State Institute for Public Policy. (back)

32 Strong evidence of a subgroup impact or difference in impacts takes a more complex form here than usual. Because so many different subgroups and subgroup differences are tested, at least some will appear to be statistically significant by chance alone. This follows from the construction of standard tests of statistical significance for individual results. Because not all uncertainty can be removed from the analysis of sample-based data, individual tests of statistical significance must be constructed to allow the possibility of an incorrect conclusion on some occasions—typically, a 5 percent chance. Thus 1 in every 20 cases in which no impact (or difference in impact) has occurred will produce a statistically significant finding. When, in the face of no actual impacts, many tests are run for impacts by subgroup or between subgroups, some of the tests are certain to have this feature—i.e., they will produce “false positive” results. One has no way of knowing which, if any, of the potentially many statistically significant results on subgroups or subgroup impact differentials constitutes a 1-in-20 “false positive.” Hence, testing for whether a “false positive” exists among many statistically significant results (a statistical finding that would itself carry some uncertainty) is of no value absent the ability to determine which one it is. More useful would be a test of whether all significant findings on subgroups are “false positives”; until this possibility is ruled out, one should not draw strong conclusions from subgroup analysis. Two procedures are used here for ruling out false positives. Subgroup impacts for a particular outcome measure such as oral comprehension or use of dental care cannot all be “false positives” when overall impact on that outcome is significant, since some subgroup must have benefited if the average child or family did. In addition, the set of significant impacts for a particular subgroup (or for the difference in impact between two subgroups) is very unlikely to consist of only “false positives” when an important share of all the impacts tested for that subgroup (or of all difference in impacts tested between two subgroups) are individually significant) Adopting a cautious approach for this purpose, the analysis in later chapters requires that three times the share of significant results predicted by chance alone when no true impacts occur be significant in order to consider the whole set of results real, i.e., that at least 15 percent of all tests run for a given subgroup or subgroup comparison be statistically significant at the 95 percent confidence level. When this standard is met, or the preceding standard based on a significant average impact for the full sample, it is appropriate to consider each individual subgroup finding reliable in its own right.(back)

33 The word “definitely” here as its usual statistical meaning of something that is true with a very high degree of certainty—95 certainty in this case. Some such conclusions will be wrong, 1 in 20 on average, but these constitute “false positives” unavoidable in any statistical analysis. As long as one does not add subgroups to be examined during the course of the analysis, a process guaranteed to produce a statistically significant “false positive” finding eventually, mistaken conclusions on who benefits, each taken on its own terms, continue to be very unlikely events. (back)

34 Pinderhughes, E.E., K. Dodge, J. Bates, G. Pettit, and A. Zelli (2000). “Discipline responses influences of parents’ socioeconomic status, ethnicity, beliefs about parenting, stress, and cognitive-emotional process.” Journal of Family Psychology, 14(3), pp. 380-400. (back)

35 Leaper, C. (2002). “Parenting Girls and Boys.” In M.H. Bornstein (Ed.), Handbook of Parenting, Vol. 1, 2nd Edition. Hillside, NJ: Erlbaum. (back)

36 Ross, C.E., J. Mirowsky, and J. Huber. (1983). “Dividing work, sharing work, and in-between: marriage patterns and depression.” American Sociological Review, 48, 809-823. For this analysis the continuous scale scores were used rather than clinical cutoff scores for depression. (back)

37 In this comparison, any “regression to the mean” that occurs for the Head Start sample will be matched and hence cancelled out by similar regression in the non-Head Start sample. (back)

38 Continuing the example, the children identified by parents in the Head Start group might have less severe needs than other special needs children and hence better developmental outcomes in spring 2003 data. Their inclusion in the Head Start portion of the special needs subgroup analysis, but not in the comparison group portion, would lead to a misleadingly favorable estimate of the program’s impact. Their absence from the non-special needs subgroup analysis would produce an inappropriately unfavorable indication of Head Start’s impact in that subpopulation. (back)

39 Some cognitive measures were collected for all children in fall 2002 regardless of language background and could be analyzed as moderators for all sample members. These included tests administered without major reliance on a particular spoken language, such as counting bears, color naming, and the McCarthy drawing test and the PELS measure based on parent interviews. (back)

40 Appendix 4.6 discusses the basis for this assumption. (back)

41 Impact (average Head Start group member) = Impact (all)/N (all) (back)

42 Bloom, H.S. (1984) “Accounting for no-shows in experimental evaluation designs.” Evaluation Review, 8, pp. 225-246. (back)

43 Angrist, J.D., G.W. Imbens, and D.B. Rubin. (1996). “Identification of causal effects using instrumental variables,” Journal of the American Statistical Association, 91, pp. 444-472. (back)

44 The National Early Head Start Evaluation, for example, reports primarily “no-show-adjusted” estimates of impact on participants rather than highlighting more prominently the more directly obtained impact findings for the average intervention group member. (back)

45 The unadjusted estimate is a lower bound because (a) if true impact on crossovers is zero, it gives the correct answer and (b) if true impact on crossovers is more than zero and in the same direction as the program’s impact on participants in the Head Start group, it is too low by virtue of the impact on crossovers canceling out a portion of the impact on participants when the overall non-Head Start and Head Start groups are compared. (back)

 

Table of Contents | Previous | Next