Table of Contents | Previous | Next |
CHAPTER VI: FIELD TESTING QUALITY ENHANCEMENTS
Quality enhancements that have been carefully developed, replicated, and evaluated on a small scale are ready for the next step: a broad field test in a representative group of Head Start program grantees and centers. Whereas the small-scale evaluation is designed to address whether the enhancement can work in Head Start programs that are willing to try it, the field test is designed to address whether the enhancement works in a group of programs that are representative of all Head Start programs in selected regions or nationally. To extrapolate the results of the evaluation to the Head Start program nationally will require a very large sample because of clustering at the classroom, center, and grantee stages. A larger sample increases the costs of implementing the enhancement, either through higher implementation costs or through data collection costs (or both), which will limit the number of field tests that can be completed.
Funding for programs such as Head Start must be used wisely to ensure that the programs are as effective as possible at enhancing children’s development. Thus, conducting research and evaluation as new ideas are implemented can help to ensure that the Head Start program continually identifies good ideas and refines them so that, over time, the Head Start community learns what works best for children. In this way, Head Start can reinforce its position as a national laboratory for good early childhood practice.
At times, policymakers might be ready to move to a larger-scale implementation of an idea that seems to address a pressing Head Start program need. It is likely that several relevant ideas would be in the development stages or beyond, and reports summarizing experiences and lessons as the idea is refined would help policymakers to consider and, perhaps, test alternatives. Even if ideas have not yet emerged from Stages 1 and 2, an evaluation still could be conducted if they are implemented strategically to support a rigorous evaluation approach.
The newly instituted Head Start National Reporting System (NRS), which provides assessment data covering vocabulary, letter recognition, and early mathematics for all four-year-old Head Start children, has increased the urgency of conducting a large-scale evaluation of quality enhancement ideas. Head Start programs recently received the results of the first year of Head Start NRS testing and many wondered what their next steps should be. Companies are offering preschool curriculum materials and technical assistance to help programs to address perceived gaps in instructional quality. Although there is likely to be no shortage of ideas for helping Head Start programs to improve their NRS assessment scores, there also is little solid evidence of the efficacy of the options being offered to programs. Many programs will adopt whatever approach or curriculum that is presented directly or is well advertised. However, decisions about alternative approaches would be better informed if the Administration for Children and Families (ACF) could spearhead an effort to systematically test alternative promising approaches.
This chapter describes research designs that can provide nationally representative estimates of the effects of Head Start quality enhancement ideas on classroom environments, teacher and staff practices, and children’s development. We first describe the general approach to these designs, including ways in which the Head Start training and technical assistance (T/TA) system can be strategically deployed and administrative data used to conduct the evaluation at reasonable cost. Some quality enhancement ideas will be best implemented at the grantee level, while many others will be implemented at the center level. We have selected two examples of quality enhancements, a grantee-wide self-study and improvement process and alternative literacy curricula, which can be implemented at the grantee and center level respectively. These examples illustrate the issues in evaluating quality enhancement ideas at the Stage 3 level. Section B describes the design of an evaluation of a grantee-wide quality enhancement initiative to use program data in a self-study process to identify and implement quality improvement efforts, and Section C describes the design of an evaluation of alternative literacy curricula, which would be implemented within centers.
APPROACH TO FIELD TESTING QUALITY ENHANCEMENTS
Field Test Design
Implementation
Measuring Outcomes and Other Family and Program Characteristics Using Head Start Administrative Data
-
Comprehension of Spoken English . This component is an English-language screener to identify children whose English is insufficient to participate in the full assessment.1 It consists of items from two subtests from the Oral Language Development Scale of the PreLAS 2000 (Duncan and DeAvila 1998). The first set of 10 items uses the “Simon Says” game to request that children follow simple commands, such as, “Touch your ear” and, “Point to the door.” In the second set of 10 items, children are asked to name or describe the function of objects in pictures.
-
Vocabulary . T his section, adapted from the Peabody Picture Vocabulary Test, third edition (PPVT-III), includes 24 items that represent a range of difficulty. The PPVT-III was shortened for this purpose using Item Response Theory (IRT) techniques based on data from the field test and from the Family and Child Experiences Study (FACES).
-
Letter Naming . T his section, developed for the Head Start Quality Research Centers curriculum intervention studies, presents all 26 pairs of upper- and lowercase letters of the alphabet in three groupings. (The Spanish-language version of the assessment, described below, contains 30 letters.) Children are asked to name the letters they know.
-
Early Math . This section, adapted from the mathematics assessment used in the Early Childhood Longitudinal Study—Kindergarten cohort (ECLS-K), includes 17 items on number understanding, shape recognition, relative size judgments and measures, and simple word problems involving counting or basic addition and subtraction.
TRAINING AND TECHNICAL ASSISTANCE TO SUPPORT PROGRAM ASSESSMENT TO IMPROVE QUALITY
-
Assessments of children conducted by the program during the year, using locally selected measures and gauging children’s progress across a broad set of child outcomes
-
Services received by children and families during the year; intensity; and participation rates
-
Family and community service needs
-
Observational assessments of the quality of classroom instruction, availability of materials to support learning, and adult-child interactions
-
Staff satisfaction surveys
-
Discussions with teachers, staff, parents, and partners in the community about service needs, adequacy, quality of current services, and areas for improvement
-
Study Design
-
Did staff implement all features of the enhancement, including data collection, data analysis to identify program strengths and weaknesses, identification of strategies for program improvement, and setting priorities for improvement plans for the coming year?
-
What level of T/TA was needed to implement the enhancement? What were the education levels and experience levels of T/TA staff? How much T/TA was provided, and over what time period?
-
What challenges were encountered in implementing the enhancement, and how were they resolved?
-
To what extent are control-group programs collecting data, analyzing the data, and identifying areas for improvement?
-
Is the quality of the classroom environment and learning activities higher in programs that implemented the enhancement relative to the control group?
-
Is the quality of teacher-child interactions higher in programs that implemented the enhancement relative to the control group?
-
Is parent participation in the program’s parent education programs higher in programs that implemented the enhancement relative to the control group?
-
Are children who need health, mental health, and disabilities services linked with these services more effectively in programs that implemented the enhancement relative to the control group?
-
Do children in programs using a continuous improvement process progress further in vocabulary, early literacy skills, and early mathematics skills than those in control-group programs?
-
Do children in programs using a continuous improvement process show greater social skills and fewer behavioral problems than those in control-group programs?
-
Do children in programs using a continuous improvement process show greater sustained attention to task, greater engagement of adults and peers, and more self-control than those in control-group programs?
- What are the impacts of the continuous improvement process on outcomes for key subgroups of children? What are the outcomes for Head Start programs with different characteristics?
-
Implementing the Quality Enhancement
-
Outcomes Measurement and Data Collection Plans
APPROACHES TO ENHANCING EARLY LITERACY
-
Ladders to Literacy. This program consists of 60 teacher-led activities organized into three units: (1) print awareness (for example, understanding the parts of a book and reading from left to right), (2) phonological awareness (understanding rhymes, understanding syllables, and linking sounds with letters), and (3) oral language (vocabulary). Activities require varying amounts of preparation, but all of them can be easily included in classroom routines.
-
Creative Curriculum Approach to Literacy. This program shows teachers how to incorporate language development and early literacy activities into the Creative Curriculum, one of two comprehensive preschool curricula commonly used in Head Start. Ideas are provided for activities that fit into all 11 of the curriculum’s interest areas, and they can be used throughout the daily program schedule.
-
Waterford Early Reading Program. The first level of this reading program focuses on developing phonological awareness; letter recognition; and understanding of print concepts, such as reading from left to right. Phonological Awareness and Writing offer companion modules for children at the initial levels of the program. This curriculum is computer software-based.
-
Breakthrough to Literacy. This curriculum includes ideas for books to be read aloud and discussed in the classroom; small-group instruction in language, phonological awareness, letter recognition, and words and sentences; language and literacy centers for independent activity; and writing instruction.
-
Let’s Begin with the Letter People. This curriculum uses songs, stories, and puppets representing letters in an engaging approach to learning letter recognition and phonological-awareness skills.
Major Activities and Timetable for the Evaluation
-
Draft study design and protocol for recruiting program grantees and centers and submit for OMB review
-
Develop data collection instruments and prepare and submit review packages for IRB and OMB review
-
Select grantees, recruit grantees and centers, and randomly assign centers
-
Implement the curriculum in randomly selected centers and conduct implementation and cost study site visits
-
Obtain parents’ consent and select a sample of children for the research sample
-
Collect data from enhancement and control groups during the fall and spring of the Head Start year; observe classrooms during the spring
- Analyze the data and report findings
Study Design
-
Did staff implement the curriculum fully? Did they implement it with a high degree of fidelity?
-
What strategies were used to implement the curriculum? How much training was provided, and over what time period? What amount and types of technical assistance were provided, and over what time period? What were the education levels and experience of T/TA staff?
-
What challenges were encountered in implementing the curriculum, and how were they resolved?
-
To what extent are control-group classrooms providing language environments and early literacy skills instruction similar to those provided by classrooms using the enhancement curricula?
-
Does the use of a language/literacy curriculum increase the amount of time and quality of related activities, such as reading books, introducing new words, discussing the content of books, and engaging in activities to promote letter recognition and phonemic awareness?
-
Does the use of a language/literacy curriculum reduce the amount of time spent in play, on gross motor activities, and on cooperative activities?
-
Do children in centers using a language/literacy curriculum progress further in vocabulary, book knowledge, and early literacy skills, such as letter recognition and phonological awareness, than those in control-group programs?
-
Do children in centers using a language/literacy curriculum make less progress in social-emotional development, including increases in cooperative behavior, initiative, persistence, and self-control (relative to children in centers without that curriculum)?
-
What were the impacts of the language/literacy curriculum on outcomes for key subgroups of children? What were the outcomes for Head Start programs with different characteristics?
- Implementing the Quality Enhancement and Measuring Fidelity
Outcomes Measurement and Data Collection Plans
- ANALYSIS AND REPORTS
-
Implementation. This report would describe how the enhancement was implemented and the degree of fidelity to implementation.
-
Impacts. A longer version of the report described above would address a technical audience and would describe the research design, sample characteristics, analytic approaches, and findings with sufficient detail for other researchers to evaluate the quality of the findings
-
Cost-Effectiveness. This report would present estimates of the costs of the enhancement, describe how those cost estimates were obtained, and compare them with impacts on the outcomes most closely related to the enhancement, presented in effect size units.
Weighting the Sample
Estimating Impacts of the Quality Enhancement
Adjusting for Participation
Subgroup Analyses
- Full-day or part-day program
- High-fidelity implementation or incomplete implementation
- Teachers’ qualifications
- Center or program size
- Gender
- Children’s English proficiency
- Mothers’ education levels
- Family income levels
- Parents’ employment status
COST-EFFECTIVENESS ANALYSIS
-
Identify the costs associated with the enhancement and ask center directors and teachers to provide this information
-
Obtain the full budget for centers (or grantees) assigned to the enhancement group for the year preceding enhancement implementation and for the current year and estimate the cost of the enhancement as the difference in budgets (adjusting for normal cost inflation from one year to the next)
-
Obtain the full budget for the centers (or grantees) assigned to the enhancement group and for those assigned to the control group and estimate the cost of the enhancement as the difference between enhancement and control group costs
Obtaining measures of the impact of quality enhancements that are representative of Head Start programs regionally or nationally requires a large sample of program grantees, centers, and children that are participating in the evaluation. The larger scale of the evaluation is feasible because of two important Head Start program resources: (1) the Head Start T/TA system; and (2) administrative data, including the NRS. These resources have the potential to provide an important foundation for implementing new enhancements and collecting data to support an evaluation so that the Head Start community can test good ideas and determine what works best for children and families. Nevertheless, to support large-scale evaluations, the T/TA system and the NRS would have to be used in new ways.
This section provides an overview of our approach to field tests of quality enhancement ideas in Head Start. We first discuss how the strategic implementation of new ideas can provide fairly quick evidence about the ideas’ effectiveness, or about the relative effectiveness of alternative approaches. We then discuss the T/TA system and how it might support strategic implementation to improve our understanding of what works in Head Start, and why. Finally, we discuss how the administrative data collected regularly from all Head Start program grantees might support a field test.
A field test of a Head Start quality enhancement would have many similarities to a national implementation of a new program enhancement. Like a national implementation, a field test ideally would involve implementing the idea in all ACF regions as well as in a large number of diverse programs.
One important aspect of a field test would distinguish it from a national implementation. To measure the impact of the enhancement on children’s progress, a contrast must be established between the enhancement and something else of interest. In most cases, the most interesting question is whether the quality enhancement being tested is an improvement over usual Head Start practice. To answer this question, some Head Start programs or centers must be randomly selected to implement the enhancement, and others selected to not implement the enhancement. In other cases, policymakers might want to know whether one enhancement approach is more effective than another. To answer this question, some Head Start programs or centers must be randomly selected to implement one enhancement, and others must be randomly selected to implement an alternative enhancement.
A field test of quality enhancements will be large and geographically dispersed, which will pose a challenge for implementation. Ensuring that the enhancement is implemented to high fidelity will require a strong T/TA plan and careful monitoring. In addition, researchers will have to minimize the potential for spillover of the enhancement into control-group sites (or of enhancement groups across one another, if alternative enhancements are evaluated) to ensure that a contrast can be made between the enhancement and control group or between different enhancements in a large-scale initiative. Spillover can be minimized by randomly assigning grantees or centers rather than classrooms, because the challenges posed by classroom random assignment in a small-scale evaluation (discussed in Chapters IV and V) will be even more difficult to address in a large-scale evaluation. The scale and geographic dispersion of a large-scale evaluation will make it more difficult to discourage enhancement-group teachers from discussing the enhancement with control-group teachers. It also will be more difficult to monitor the changing configurations of classrooms and the attrition of teachers so that random assignments are preserved from the initial implementation year into the data collection year.
Random assignment of either program grantees or centers to implement an enhancement, an alternative, or neither option must occur independently of any considerations of program desires or initial characteristics. For this reason, random assignment should be conducted and monitored by a research organization that is independent of the programs. That organization would work with the regional offices and T/TA staff to ensure that the enhancements are fully implemented in the selected program grantees or centers.
At the field test stage, enhancements would be implemented using large-scale methods commonly used to implement initiatives program-wide, but implementation would proceed in a structured way: Grantees or centers assigned to the enhancement group would implement the enhancement, and grantees or centers assigned to the control group would not. Implementing the enhancement strategically in randomly selected grantees or centers will enable the Head Start community to measure the impact of the enhancement by comparing the outcomes of children in the enhancement centers (or grantees) with those in control centers (or grantees). Any differences in children’s outcomes could be attributed to the enhancement with a measurable degree of statistical certainty.
Clearly, an important foundation to guide implementation of quality enhancements on a large scale is documentation about the enhancement and manuals describing how to implement it that have been refined during the first and second stages of research. These materials and a good understanding of key steps in implementing the enhancement are critical to supporting high-fidelity implementation at Stage 3, when the enhancement developer cannot exercise direct control over the day-to-day progress of implementation. T/TA staff will have to be trained to assist with implementation, measure fidelity to implementation, and address issues that arise during implementation.
Training methods used to implement program-wide initiatives could include national or regional training of trainers and distance learning approaches that use web-based lectures with electronic questions and feedback. The T/TA system could provide critical support for implementation. Here, we discuss each of these training resources. In addition to these resources, documentation on the enhancement (printed documents, training videos, and other materials) would be provided to centers that have been randomly assigned to implement the enhancement.
Training of Trainers. In this model, one or two staff from a local program receives training at a national or regional training conference. Training focuses on the enhancement, the theory behind the enhancement, and strategies for offering practical training on key concepts. The staff person then becomes the “trainer” for all other staff at the local program. This approach was used to implement strategies for enhancing classroom literacy environments (Strategic Teacher Education Program, or STEP training), as well as to train program staff to administer the NRS child assessment. The effectiveness of this approach depends on the skill levels of the local trainers, the complexity of the enhancement, and the extent to which key ideas can be covered during the time allotted for training. After receiving STEP training in summer 2002, the new early literacy specialists were charged with providing training to teachers in their local programs on the techniques they had learned, in preparation for the 2002-2003 program year. Because the early literacy approaches were not fully implemented in all Head Start programs, in November 2002, the early literacy specialists attended a second training to learn mentor-coaching skills that would enable them to observe teachers, and to provide ongoing feedback throughout the year.
Like national and regional training, the train-the-trainers approach has the potential to offer training provided by skilled trainers during the first round of training, and to ensure a high level of consistency across many programs, as all local trainers will have received the same instruction and materials. The weakness of this approach lies in the loss of control over the quality of the second round of training, when the local trainers return to their programs to work with local staff. The skills and qualifications of local trainers are likely to vary widely across programs, as are the intensity and duration of training provided locally. Ultimately, the quality of implementation and degree of fidelity achieved at the local program level depend heavily on the ability of trainees to provide effective local training, and to ensure that implementation proceeds according to the model. The weaknesses of the approach could be addressed by requiring a higher minimum initial skill level for local trainers, thus ensuring that training is designed so that essential information is covered; “certifying” trainers at the close of training by asking them to demonstrate mastery of essential skills; and prescribing the content, frequency, and intensity of local training. In addition, T/TA staff could monitor implementation and could provide additional assistance, where necessary, to ensure high-fidelity implementation.
Distance Learning. Especially when programs do not have access locally to the training they need, distance learning is an alternative strategy for training teachers to implement specific enhancements in the classroom. One example is HeadsUp! Reading, provided by the National Head Start Association. Under this approach, teachers gather for weekly two-hour classes that take place during a satellite broadcast. Trained facilitators are on site to lead discussions and other classroom activities. One advantage of this approach is that many teachers—even those who live in remote, rural areas— can participate in training provided by highly skilled trainers at lower cost than the cost of traveling to a national or regional conference. HeadsUp! Reading also incorporates call-in segments that enable teachers to interact with instructors.
Distance learning is likely to be more cost-effective than on-site training, because a single, highly skilled trainer can train teachers from many programs simultaneously. However, local programs must have access to satellite equipment, and local facilitators must be recruited and trained. In addition, distance learning may be better suited to lower-intensity, longer-duration training courses in which classes meet for several hours per week over a period of months, rather than to intensive pre-service training in preparation for implementing a new curriculum or enhancement approach.
An alternative to a satellite broadcast is a webcast, which the Head Start Bureau currently is using to communicate with program staff about the progress and results of the NRS assessments, as well as other topics. Compared with satellite equipment, the equipment necessary to participate in a webcast might be more readily available to Head Start program staff, which could make it a more useful vehicle for communicating information about a quality enhancement. However, it would be very difficult to ensure that only the centers or program grantees that had been randomly selected to implement the enhancement participate in the webcasts.
Role of the Head Start T/TA System. Head Start’s new T/TA system, initiated on September 1, 2003, consists of 12 contracted centers that are managed by the ACF regional offices. In the 10 geographically based regions, two or three T/TA managers and several content experts (in such areas as early literacy, disabilities, health, and administration) work out of the regional offices, with the goal of facilitating closer communication and coordination among Head Start offices and staff. In addition, the T/TA specialists work directly with a group of 12 to 15 Head Start grantees, under the supervision of the T/TA managers.
T/TA specialists could play a critical role, if not in actually providing training, then in assessing the fidelity of implementation, and of ensuring that additional training and technical assistance are obtained by staff who need it. T/TA staff periodically visit all programs, and this direct contact could be valuable in ensuring that enhancements are implemented to high fidelity, and that random assignment is not compromised. T/TA staff who work with the programs selected for the enhancement group could be trained to assist with implementation, and to collect data periodically on the fidelity of implementation.
Implementing the enhancement with high fidelity is critical if the evaluation is to provide useful information about the effectiveness of the enhancement. A weakly implemented enhancement is unlikely to show impacts, wasting evaluation resources and subjecting the enhancement to an unfair test. Accordingly, implementation will require careful training and monitoring of the T/TA staff and of any other individuals charged with training staff or assisting with full implementation. This effort should be intensive enough to ensure that the enhancement is implemented well across many Head Start programs.
In arguing for an intensive, serious approach to implementing enhancements for a Stage 3 evaluation, we are not suggesting that the effort go beyond a reasonable effort to implement enhancements nationally. However, it is possible that the current approach to national implementation is not sufficiently intense. If resources required for high-fidelity implementation for a Stage 3 evaluation go beyond what is customary for a national implementation, the approach to national implementation might merit further consideration. In a national implementation, we cannot determine whether the enhancement is more effective than usual practice because no rigorous contrast will have been established. Without the prospect of a rigorous basis for comparing children’s outcomes with and without the enhancement, some might assume that implementing any version of the enhancement will improve outcomes for children. That assumption might be called into question when the quality enhancement idea is being evaluated, as planners will more realistically understand that a weak implementation will likely generate weak or no impacts. However, if our approach to implementation nationally is not as intense or serious as that effort would be if we were expecting the enhancement to be evaluated, we risk wasting implementation resources on an activity that induces teachers and programs to act differently but does not ultimately benefit Head Start children.
Collecting data for a field test of Head Start quality enhancements could be greatly simplified if outcomes measured by the NRS can be used. The NRS, first implemented in the 2003-2004 program year, is an ambitious initiative to systematically assess the early literacy, language, and numeracy skills of all four- and five-year-old children enrolled in Head Start. The NRS aims to collect information on a standard set of child outcomes from all Head Start programs in a consistent manner. It includes a 15-minute child assessment battery; a system for training staff from all Head Start grantees to administer the assessment; and a computer-based reporting system that programs use to report information on a limited set of characteristics of participating Head Start programs, teachers, and children. The NRS thus offers the potential to provide measures of outcomes of Head Start quality enhancements.
Even though the Head Start Bureau has decided to report average scores and other information only at the grantee level, the data offer the potential to link an individual child’s outcome data with demographic information about that child, and with data about the child’s teacher (for example, education level) the center, and the grantee. NRS data at the classroom and center levels could provide a basis for estimating the impacts of enhancements that target outcomes measured by the NRS. However, before evaluations of Head Start quality enhancements requiring NRS data can proceed, ACF would have to approve this new use of the data.
Because of the potential usefulness of these data to field tests of Head Start quality enhancements, we provide background on the content of the assessments, the reliability of the data, and the quality of administration of the assessments. We then describe two additional sources of administrative data that can provide background information on programs, centers, classrooms, teachers, and children for the analyses.
Content of the NRS. The current NRS assessment battery includes four components:
A Spanish-language version of the child assessment also was developed. All children whose home language is Spanish are assessed in both English and Spanish, provided that they pass the language screener for each version of the assessment.2
Most Head Start program grantees have decided that the NRS assessments can be conducted most efficiently by someone other than the child’s own Head Start teacher. In the 2003-2004 program year, the child’s own teacher conducted approximately 35 percent of the assessments, other Head Start professionals conducted approximately 40 percent, and about 25 percent were conducted by contractors/consultants or others.
The specific vocabulary and math items in a given round of data collection are drawn from a pool of items whose difficulty is ranked using IRT methods. The item selection work is designed to keep the level of test difficulty constant across the annual administrations (fall to fall and spring to spring). However, the items on the test differ little from year to year. The similarity between tests from year to year raises concerns that, over time, Head Start staff could, “teach to the test.”
Reliability of the NRS Data in 2003-2004. Analyses of reliability of the 2003-2004 NRS data were based on data submitted on nearly 407,000 children from 1,766 programs.3 The internal consistency reliability of the spring 2004 data at the child level was good (.76 for understanding English and .81 for understanding Spanish; .81 for English vocabulary and .81 for Spanish vocabulary; .93 for English letter naming and .93 for Spanish letter naming; and .82 for English early math skills and .83 for Spanish early math skills). The reliability of the data at higher levels of aggregation, such as at the program level, was higher (more than .90 at this level). Thus far, the NRS data demonstrate that the three assessments reliably tap the targeted child outcome domains.
Quality of Administration of the Assessments. During the 2003-2004 program year, Mathematica Policy Research, Inc. conducted an implementation study of the NRS based on site visits to a nationally representative sample of 35 Head Start programs. Site visitors observed a random sample of approximately 10 child assessments per program, interviewed key Head Start staff about NRS implementation, and held a focus group with staff conducting the assessments to learn about their experiences. Using the certification procedures and criteria as a guide, site visitors coded the observed assessments for errors and computed a certification score for each one. In spring 2004, 87 percent of observed English assessments scored 85 or higher, the minimum score required for certification (93 percent of Spanish assessments scored 85 or above). The study also compared the certification scores on assessments conducted by teachers who assessed children in their own classrooms with those of other assessors, as well as scores for experienced and new assessors in spring 2004. There were no significant differences in mean certification scores for teachers versus others who conducted the assessments. The overall error rate was low in both fall 2003 and spring 2004, with assessors making fewer errors, on average, in the spring than in the fall.
Strengths and Weaknesses of NRS Data for Evaluating Head Start Quality Enhancements. Although the NRS data provide a ready source of data on important child outcomes, the current scope of the assessment is quite narrow. There are plans to field test teacher-reported social-emotional measures next year, but enhancement studies may still need to be augmented with additional outcome measures rather than relying on the NRS as the only source of child outcome data. Two strategies are possible. First, if Head Start program staff are willing, one or two brief outcome measures relevant to a particular enhancement might be included as part of the center’s data collection activities at the time that the NRS is conducted. Second, centers from the enhancement and control groups could be sampled for more-intensive measurement of implementation, interim outcomes, and child outcomes, much as FACES currently does.
One of the challenges of using the NRS in addition to other outcome data collected specifically for an enhancement study is that the data collectors across the two types of measures would be different; Head Start staff would collect the NRS measures, and study staff would collect the additional measures. As described above, the NRS data are collected by a variety of staff members with different levels of familiarity to the child, including the child’s teacher, another Head Start staff member, and an outside consultant hired to conduct the NRS assessments. Study staff members would be “strangers” to the children, which could lead some children to perform differently on the assessments than they would if they knew the staff. Analyses to examine whether children perform differently on the same FACES and NRS assessments could address the question of whether having people with different relationships to the child conduct assessments would affect the quality and comparability of the data.
Plans indicate that the NRS will continue to evolve, and, in the future, it may tap additional Head Start Child Outcomes Framework domain indicators (see Appendix A). Expansion of the NRS, particularly in concert with the directions of major field tests of Head Start quality enhancements, would increase its potential as a source of outcome data in large-scale enhancement studies.
The Computer-Based Reporting System. The Head Start Bureau has implemented the CBRS to collect background information on Head Start programs, teachers, and children; to facilitate the identification of eligible children; and to track completed assessments. The CBRS is a web-based system in which Head Start staff enter program-, classroom-, and child-level data (Table VI.1). After programs enter these data, the CBRS assigns unique identification numbers to Head Start grantees, centers, assessors, classrooms, and eligible children. In a field test evaluation of Head Start quality enhancements, classroom-, teacher-, and child-level information included in the CBRS could be used to improve the precision of impact estimates, and to define subgroups.
The Head Start Family Information System. The Head Start Family Information System (HSFIS) is a computer-based management information system developed by the Head Start Bureau during the mid-1990s. The HSFIS stores information on family and child characteristics relevant to determining eligibility for Head Start, as well as for determining service needs. HSFIS also stores an ongoing record of program services received, including child development services, parent services, health and mental health services, and others. These data have the potential to provide information on intermediate outcomes of Head Start quality enhancements, as well as background variables on families and children that could help to improve the precision of the impact estimates.
The potential for using the HSFIS in an evaluation is diminished by the fact that many programs have not adopted the system. Nevertheless, these programs might be collecting similar data on a small number of alternative management information systems. Thus, it might be possible to use program management information system data in an evaluation if a sufficient number of useful variables can be obtained with reasonable amounts of effort.
In the following sections, we discuss research designs for field tests of two enhancements. The first is an evaluation of T/TA approaches to improving program quality through a year-long self-evaluation process. Because the self-evaluation process involves all levels of program staff, from the director, through education and other service coordinators, to the center and classroom levels, this enhancement would be implemented grantee-wide and would therefore involve random assignment at the grantee level. This research design would be applicable to any enhancement that is implemented grantee-wide because it influences levels of staff beyond the individual center. The second research design is an evaluation of alternative curricula to enhance language and literacy. These curricula could be implemented at the center level, so we describe a center-level random assignment plan. This research design is applicable to any quality enhancement implemented at the center level for which ACF is interested in trying alternative approaches.
With completion of NRS assessments of four-year-old children for fall 2003 and spring 2004, Head Start programs received their first "progress reports." These initial reports to programs on 2003-2004 assessment data provided results only at the program level (grantee or delegate agency); results were not broken out by center, classroom, or individual child. For each subscale of the assessment, programs received mean scores for their program and for all Head Start programs. Summary results for each of the eight assessments (four English and four Spanish) were presented by reporting the distribution of results across six skill levels that ranged from lowest performance (none or the easiest questions answered correctly) to highest performance (many or the most difficult questions answered correctly).
| Program | Center | ||
|---|---|---|---|
| Program name a | Center name | ||
| Director name a | Center contact information | ||
| Director email a | Center type (Head Start center, family child care home, home visitor cluster, child care partner) | ||
| Agency description a | Enrollment year start date | ||
| Number of delegates a | Enrollment year end date | ||
| Number of centers a | Center NRS lead name and contact information | ||
| Program auspice a | |||
| Number of family child care homes | |||
| Number of home visitors | |||
| Program NRS lead name and contact information | |||
| Classroom | Teacher | Child | |
| Teacher's name | Teacher's name | Child's name | |
| Class session (am, pm, mwf, or tth) | Teacher's language fluency | Date of birth | |
| Classroom type (5 days, 4 days, home-based option, family child care, locally-designed option) | Language in which teacher provides instruction | Child entry date into classroom | |
| Day option (part or full) | Total years of teaching experience | Child exit date from classroom | |
| Total enrollment | Total years of Head Start teaching experience | Child unique ID from center | |
| Number of child development staff in addition to the lead teacher | Highest grade or year of school completed | Number of prior years in Head Start | |
| Teacher date of entry to classroom | Highest degree in Early Childhood Education or related field | Disability status Other languages spoken English-speaking ability Primary language spoken at home Child race/ethnicity |
|
| Assessor | Assessment Information | ||
| Assessor's name | Child ID | ||
| Highest grade or year of school completed | Assessment date | ||
| Highest degree held in Early Childhood Education or related field | Completion status | ||
| Assessor's program position | Session (fall or spring) | ||
| Other comments | Language (English, Spanish, both) Assessor ID Whether assessor ischild's teacher |
||
|
Source: National Reporting system computer-based reporting system: User’s Manual, July 2004. a Imported from the Head Start Program Information Report (PIR).
|
Many programs questioned the meaning and implications of the NRS assessment results. If their program’s average spring scores were lower than the average for all Head Start programs, or if the change in average scores from fall to spring was smaller than the average for all Head Start programs, what might they do to improve children’s performance? Head Start programs collect data from many sources for management purposes, but only a few programs have developed systems and expertise to analyze the data to identify areas for improvement. Many programs would benefit from technical assistance to help them to understand how their data can illuminate program strengths and weaknesses, and how to use that information to identify ways to improve the program. Currently, the T/TA system has begun to work with grantees on a system for program self-evaluation based on a manual developed in Region I. This research design describes how an evaluation could be designed to accompany that effort.
Programs seeking to improve their services have many sources of data to consult in addition to the NRS, as several types of data must be collected regularly:
Those data could be used to identify areas on which the programs could focus to improve the match of services and family needs, or to enhance children’s outcomes in a particular area.
In addition, to identify areas that are strong and areas that could be improved, programs could collect data that are not required, but that would shed light on their operations:
The additional sources of data would assess the quality of particular program service areas that teachers or managers identify as critical to overall service quality.
The idea of collecting and using data to monitor program operations and identify areas for improvement is not a new one. Some grantees already have implemented broad data collection efforts that include the areas identified above, and they have procedures and a schedule for analyzing the data, identifying areas for improvement, and setting priorities for the upcoming year. The process results in consistent measures of program strengths and weaknesses that can be tracked over time, as well as an action plan for improvement for the next year.
Nevertheless, although all Head Start programs collect a substantial amount of data on services and outcomes, many do not fully use the information to identify areas for program improvement. T/TA staff thus may be able to help program staff develop a self-study and individualized improvement process. A recent Quality Research Consortium project involved training Head Start teachers, education coordinators, and the director to conduct regular assessments of children and of classroom quality, and to analyze the data to identify areas in which practice could be improved. An interventionist helped the teachers and managers to understand what the assessment data indicated, and how it could translate into recommendations for improvements. In a broader implementation of this idea, the T/TA system could work with programs to select appropriate assessments, interpret the data collected, and identify areas for improvement. Over time, program staff would gain an understanding of the stages of data collection, analysis, and interpretation, and they would be able to manage the continuous program improvement feedback loop themselves.
An enhancement to help Head Start program grantees to use data to improve program services so that children’s outcomes are measurably improved could be evaluated based on random assignment of grantees and a comparison of children’s progress toward academic and social competence. This section discusses key elements of a research design to evaluate the impacts of this enhancement on children’s outcomes. We begin by discussing the quality enhancement and its counterfactual, the research questions, the sampling strategy, random assignment plan, and sample sizes.
The Quality Enhancement, Counterfactual, and Research Questions
At the core of this quality enhancement is a feedback loop that includes collecting data on program services and outcomes; analyzing the data to summarize performance, and to identify strengths and weaknesses; discuss progress and weaknesses; and identify priority areas for improvement and methods of addressing those priorities. Ideally, data will be collected at the grantee, center, classroom, and child levels, and measures will be chosen carefully to ensure reliability and relevance. Different program staff members can be responsible for collecting specific types of data. The full range of program stakeholders, including staff, parents, and community partners, should see the results of the performance assessments and should have an opportunity to suggest areas for improvement. The program management team would set priorities based on the ideas submitted by all stakeholders.
A reasonable approach to evaluating this quality enhancement idea would be to contrast the continuous feedback approach to current practice, which will be a minimal self-study process. Because a basic approach to self-study is about to be implemented nationwide, it might be useful to describe how that effort could be evaluated.4 If ACF were interested in evaluating this initiative, a randomly chosen set of grantees in each region could be selected to not implement the procedures so that a contrast could be made in each region between grantees implementing and not implementing the self-study process. The implementation process would likely vary across regions, depending on the T/TA organization and the emphasis placed on this task by the ACF regional office. With sufficient sample, regional subgroup impacts could be estimated to examine these potential differences.
Research questions to be examined as part of the evaluation include whether the enhancement was implemented to high fidelity:
Another set of research questions focuses on the impacts of the enhancement on the quality of program services:
Whether the enhancement yields measurable program improvements is only part of the story. The reason for engaging in the continuous program improvement process is to improve the program in ways that enhance children’s outcomes. Thus, a set of research questions focuses on children’s development:
Finally, although the evaluation will examine the effectiveness of the enhancement for children overall, the Head Start community also will be interested in whether the enhancement is effective across different subgroups of children and families, and across programs with different characteristics:
If many subgroups are examined, some impacts will emerge simply by chance, so some caution must be exercised in examining subgroup impacts. To guard against finding impacts by chance, researchers can implement a Bonferroni correction that essentially inflates the standard error of the impact estimates to correspond to the number of hypothesis tests. This provides a higher bar for deciding that a subgroup impact is significantly different from zero.
In a Stage 3 design that is nationally or regionally representative of all Head Start programs, subgroup analyses should be based on a representative sample of that subgroup in Head Start. Therefore, subgroups of interest must be identified in advance so that statisticians can design a sampling plan for grantees, centers, and children that ensures adequate representative samples of the subgroups of interest.
For an evaluation of a program self-assessment process, the ACF regions would be useful subgroups because the strategies for implementing the self-assessment process will vary somewhat across regions. Because there are 10 geographically based regions, they will have to be grouped for analysis to keep the sample size within reasonable bounds. Obtaining information about each region’s implementation strategies would help to group regions with similar strategies. Because approval from the Office of Management and Budge (OMB) is necessary before contacting regions and T/TA contractors about implementation strategies, the information required to form regional subgroups would be available only after the sample has been drawn. To permit flexibility, sampling would be designed to be representative of each region, but the precision of the estimates would be lower at the individual region level, under the assumption that at least two or, possibly, more than two regions would be grouped for analysis.
Another subgroup of interest is program schedule, particularly full-day versus part-day. Because of the cost differences between the two program types, policymakers are interested in learning how much more full-day programs can contribute to children’s cognitive and social-emotional development relative to part-day programs. A related question is whether quality enhancements can contribute more to children’s development when they are implemented in full-day programs than when implemented in part-day programs. Information about the proportion of children served in part-day and full-day programs is available at the grantee level; these proportions vary considerably by region (for example, 20 percent of children in the western regions are served in full-day programs compared with 74 percent of children in the southern regions; see Table II.1). Unless the sample is very large, it will not be possible to examine subgroups defined by region and program schedule; therefore, we recommend drawing a sample of grantees that is regionally representative, but using program schedule as an implicit stratification variable at a later stage of sampling (center-level).
Sampling, Random Assignment, and Sample Sizes
An enhancement that implements a program self-assessment process would be most efficiently and effectively implemented at the grantee level. Programs would be more likely to improve if all centers are engaged in the continuous feedback process, and staff of the entire program work together. Working with one or two centers per grantee to implement this process center-wide might be useful, but it would fail to generate the synergies that would result from program-wide implementation that would enhance the quality of implementation.
Because implementation would take place at the grantee level, random assignment to the enhancement or control group also would have to be at that level. Within strata defined by ACF regions, urbanicity, and percentage of children who are African-American and percentage who are Hispanic or Latino (large versus small percentage), a sample of program grantee or delegate agencies would be selected using the most recent Head Start Program Information Report (PIR) data as the sampling frame. The PIR contains data from annual reports submitted by all program grantees and delegate agencies about program size, schedule, and other characteristics. The selected grantees and delegate agencies would then be randomly assigned to the enhancement or control group. This design assumes that ACF is planning to implement this enhancement nationally while retaining a control group for evaluation purposes; therefore, the control group would not receive information or T/TA to implement the self-assessment process, while the enhancement group and any programs other than the control group would implement the enhancement.
The least expensive source of outcome data for an evaluation at the grantee level is NRS data. NRS data are available for all four- and five-year-old children in every grantee; therefore, if they can be obtained and analyzed for the sample of grantees included in the evaluation, all centers and children in those grantees can be included in the evaluation as well. We assume that an average of eight centers per grantee and 45 children per center would be available, using NRS data for the evaluation. Table VI.2 shows the number of grantees required for the evaluation in order to detect impacts with a minimum effect size between 0.1 and 0.2 (that is, the impact as a proportion of the standard deviation of the outcome measure). To detect an effect size of at least 0.15, the evaluation would require either at least 88 grantees in the enhancement group and 88 in the control group (if no baseline data are available) or 70 grantees in each group (if some baseline data are available). Because NRS data are available on all children in the program, the number of centers that would be included in the evaluation is 560 per group (if 70 grantees are used and if each grantee has an average of eight centers) and 25,200 children per group (if each center has an average of 45 children in the follow-up sample).
| Total Number of Grantees per Evaluation Group | Total Number of Centers per Evaluation Group | Number of Children in Final Sample per Evaluation Group | ||
|---|---|---|---|---|
| MDE = 0.10 | No baseline | 197 | 1,576 | 70,920 |
| Minimal baseline | 158 | 1,264 | 56,880 | |
| MDE = 0.15 | No baseline | 88 | 704 | 31,680 |
| Minimal baseline | 70 | 560 | 25,200 | |
| MDE = 0.20 | No baseline | 49 | 392 | 17,640 |
| Minimal baseline | 39 | 312 | 14,040 | |
| Note: Sample size calculations assume
that grantees are randomly assigned, all centers are included
in the evaluation, each grantee includes 8 centers (on average),
and each center includes 45 children (on average) in the
final sample.
“Minimal baseline”
means that demographic information and NRS fall test scores
are available so that the R2 for the regression
adjustment of the impact estimates is .20.
Sample size calculations also assume
a two-tailed test with 80 percent power and a 95 percent
confidence level. Appendix B provides details about the
calculations.
Sample size calculations assume
that the sampling strategy used minimizes the design effect
of weighting for the source of data. Thus, if the evaluation
is based on NRS data on outcomes for all children in the
grantee, sampling of grantees would be based on equal probability
of selection for all grantees stratified by size. The sample
size calculations do not include an adjustment for the design
effect of weighting for sample nonresponse.
MDE = minimum detectable effect;
NRS = National Reporting System. |
The NRS currently provides only a limited set of outcome measures and no information on the quality of implementation. Consequently, a broader data collection effort that includes measures of children’s social and emotional well-being as well as the quality of classroom environments and fidelity of implementation are important for understanding why the enhancement is or is not successful. To collect additional data will require drawing a more efficient sample from the participating grantees. Head Start classrooms average 17 children. A sample of two or three centers per grantee and 10 children per center would provide a reasonable basis for estimates of the effects of the enhancement in each grantee.
The optimal strategy for sampling grantees is different depending on whether the evaluation is based on NRS data on all children in the grantee or broader data on a sample of centers and children in the grantee. If NRS data are used, the sampling strategy that minimizes the design effect of weighting is equal probability sampling of grantees. If data from a sample of centers and children are used, the sampling strategy that minimizes the design effect of weighting is selecting grantees with probability proportional to size.
To sample centers, grantees and delegate agencies would be contacted for information about the number of centers and, within centers, the number of classes, by age of the children. Sampling would be based on probability proportional to size, with an implicit stratification procedure that would draw from a sorted list of centers, with sorting based on such characteristics as classroom schedule and percentage of non-English-speaking children. Drawing from a sorted list helps the sample to be more proportionally representative of the characteristics used in the sort.
Table VI.3 shows the numbers of grantees, centers, and children included in a sample that includes two centers per grantee and 10 children per center at the final followup. We use slightly larger minimum detectable effect (MDE) sizes in this table (0.15 to 0.25) because data collection is considerably more expensive, but also because the measures used may be more sensitive to the enhancement. Notably, the number of grantees that would be required to attain the same precision level as in the preceding example (.015) is higher (107 per group for this design, compared with 70 per group in the previous design), but the number of centers and children in the evaluation would be much lower. The two designs could be combined by selecting a target MDE level for the in-depth data collection, identifying the target number of grantees, and then using that sample of grantees for the NRS analysis as well, which would have more power to detect impacts because all centers and children would be included in the NRS outcome data. For example, if broader child assessment and classroom observation data are collected, the target MDE size could be set at 0.2, and 60 grantees per group would be required (assuming that the fall NRS data can be used to improve the precision of the impact estimates). If NRS data are obtained for all of these grantees, the precision levels for the impact analyses would be approximately .17 because the sample of children and centers within each grantee would be larger. A sample of centers and children could be drawn from these grantees for more-extensive data collection. Of course, if the two designs are combined, a single grantee sampling strategy would need to be chosen which would strike a compromise between the probability proportional to size and the equal probability sampling strategies. One possibility is to use the square root of the number of children as the measure of size at each stage of sampling, and then select with probability proportional to this measure of size. This strategy would result in some design effect of weighting in analyses based on both the NRS and the broader data collection, and this would decrease the power of the sample to detect impacts relative to the figures presented in Tables VI.2 and VI.3.
| Total Number of Grantees per Evaluation Group | Total Number of Centers per Evaluation Group | Total Number of Classrooms per Evaluation Group | Number of Children per Evaluation Group in Final Sample | Number of Children per Evaluation Group in Initial Sample | ||
|---|---|---|---|---|---|---|
| MDE = 0.15 | No baseline | 134 | 268 | 804 | 2,680 | 2,948 |
| Minimal baseline | 107 | 214 | 642 | 2,140 | 2,354 | |
| MDE = 0.20 | No baseline | 75 | 150 | 450 | 1,500 | 1,650 |
| Minimal baseline | 60 | 120 | 360 | 1,200 | 1,320 | |
| MDE = 0.25 | No baseline | 48 | 96 | 288 | 960 | 1,056 |
| Minimal baseline | 39 | 78 | 234 | 780 | 858 | |
| Note: Sample size calculations assume
that grantees are randomly assigned, two centers are selected
randomly for the evaluation sample, and centers include
an average of three classrooms. Within each center, eleven
children are randomly selected for the evaluation sample
and approximately ten children are available at follow-up.
“Minimal baseline”
means that demographic information and NRS fall test scores
are available so that the R2 for the regression
adjustment of the impact estimates is .20.
Sample size calculations also assume
a two-tailed test with 80 percent power and a 95 percent
confidence level. Appendix B provides details about the
calculations.
Sample size calculations assume
that the sampling strategy used minimizes the design effect
of weighting for the source of data. Thus, if the evaluation
is based on data from a sample of centers and children,
sampling of grantees would be based on probability of selection
proportional to size. The sample size calculations do not
include an adjustment for the design effect of weighting
for sample nonresponse.
MDE = minimum detectable effect;
NRS = National Reporting System. |
Recruiting Selected Grantees
Because the evaluation is based on the random assignment of grantees, sampling of grantees and random assignment can occur simultaneously. Subsequently, if the evaluation will rely on administrative data to measure outcomes, recruitment can involve sending letters to the enhancement grantees to inform them of the opportunity to work with a T/TA specialist to understand the data they collect, and to implement a continuous program improvement process. An enhancement like this one would be welcomed by grantees not already engaged in such a process. Grantees that are already conducting these activities would receive help to move the process to a higher level.
If more-extensive data will be collected from a sample selected from within the grantee, both enhancement and control grantees will have to be recruited to participate in the study, and to provide information on centers to support sampling and recruiting at that level. Grantees can be recruited to participate in the study using letters that ask them directly to participate according to their random assignment status. Grantees selected to implement the enhancement will be informed about the opportunity to learn about a continuous program improvement process and informed that some centers, classes, and children will be selected to participate in a study. Grantees selected for the control group will be informed that, because the Head Start Bureau is interested in learning more about how Head Start programs operate and about how children are faring, some centers, classes, and children will be selected to participate in a study. The control-group letter would be much like the current invitation to participate in FACES. Currently, program cooperation with the FACES is very high, so we anticipate no significant difficulties in gaining program participants.
After grantees have been selected and have agreed to participate, they will be asked to provide information on the centers, including contact information for the director; center characteristics (full-day or part-day services, and length of program year); and the number of classrooms, teachers, and students per class. Based on this information, a sample of centers can be selected for data collection beyond the NRS. The centers’ directors would be contacted to arrange the sampling of classrooms and children for the evaluation, and to obtain teachers’ and parents’ consent to participate in the study.
The enhancement would be implemented using the manual developed by Region I, with numerous examples and resources for program staff, as well as on-site assistance from the T/TA system. The manual would provide ideas about measures to use, instructions for administering the measures, and information about how to report the results in light of Head Start Program Performance Standards. T/TA staff could help to orient staff, train the staff how to collect the data, train the staff to score and present results, and work with them to interpret the measures to understand program strengths and weaknesses. T/TA staff could then suggest ideas for program improvement, and program staff would identify high-priority areas. By going through the entire self-evaluation process once with a good set of measures and clear information to guide analysis of the data, program staff can experience how the process works. In the next year, they could either repeat the process with more confidence or modify the process by changing or adding some measures to cover areas in which they would like additional or different types of information.
The amount of time needed for implementation and evaluation of this enhancement is likely to be longer than for other enhancements discussed in this report because one cycle of the information-gathering and program improvement process takes a full program year. Meaningful program changes and effects on children cannot occur until one cycle of self-review has been completed, priorities have been established, and the first set of improvements has been implemented. Two consecutive cohorts of child outcome data collected after the first implementation could be even more useful in demonstrating the effects of this approach, as successive years of the continuous feedback cycle should yield cumulative program improvements. Moreover, program staff will make the process more informative and useful over time as they become more familiar with it, identify new areas to measure, and otherwise tailor the process to their needs. A longer implementation period will be difficult for the control group, which will want to implement the enhancement more quickly. A timeline of activities associated with evaluating this enhancement based on NRS data is shown in Figure VI.1.
Measuring fidelity of implementation is important so that the evaluation measures the impacts of the enhancement as designed, rather than a pale substitute. Because this enhancement has not emerged from a Stage 1 Development phase, measures of fidelity have not been developed. However, measures of adherence to a program measurement, data analysis, stakeholder consultation, and priority-setting process could be developed with reasonable effort. Such measures should tap key steps in the process as well as the intensity and breadth of the process.
T/TA staff can measure these aspects of implementation as they visit programs to provide technical assistance. Results of fidelity measurement should help them to determine priorities for consultation, training, and assistance. Ratings of fidelity to implementation can also be used in the evaluation, if measurement is comparable across raters. Comparability can be improved if T/TA staff are carefully trained to use the assessment protocol and, where items involve some judgment on the part of the rater, if steps are taken to check the reliability of coding. Reliability can be checked initially and over longer periods by asking T/TA staff to videotape the situation they are coding, and to send the videotape and coding results to the research organization to check. Research staff would then view the videotapes, code independently, compare coding, and provide feedback to the T/TA staff to help improve their understanding of any items for which there was substantial disagreement.
We discuss two options for measuring impacts of the enhancement on children’s outcomes. First, the evaluation can rely on data from the NRS. This strategy would provide information on vocabulary, letter recognition, and early mathematics skills for four- and five-year-old children in the program grantees that were randomly assigned. Alternatively, the evaluation can include data collected from a sample of centers and children in the participating program grantees.
One challenge facing any measurement strategy for this enhancement is the potential breadth of the strategy’s effects. The continuous program improvement feedback loop will highlight different areas for improvement for different programs. For example, one program might decide that staff education levels need the most attention in the coming year, another might implement a mathematics curriculum, and a third might institute a set of interventions to address children’s social-emotional development and behavioral issues. Thus, programs might make very different decisions about what steps to take to increase quality, and to improve compliance with the Head Start Program Performance Standards. These changes in different program areas will generate impacts on different aspects of child development. As a result, the average impact of the enhancement on any single area of child development might be quite small, as only a subset of programs might be implementing a program change to improve that area. Although the areas addressed by programs might become more similar over time, during the initial years, we would expect several different areas of focus across the grantees that implement the continuous program improvement approach.
[D] |
Use of Head Start Administrative Data
NRS data on children’s vocabulary, letter recognition, and early mathematics skills can support estimation of the impacts of the enhancement on these outcomes. As discussed in Section A, the NRS measures have good reliability, and they provide a measure of three important areas of children’s academic progress in Head Start. The NRS currently does not measure children’s social emotional well-being, so any changes in this area would not be measured if the evaluation relied on administrative data.5 Moreover, the NRS data do not cover three-year-olds in the program, but children in this age group constitute about one-third of all Head Start children.
Information on children, classrooms, teachers, and centers can be taken from the CBRS; for the most part, this information would offer control variables to improve the precision of the impact estimates (see Table VI.1). However, one item (teachers’ education levels) could be considered an intermediate outcome of this enhancement, because programs might decide to focus on increasing the proportion of lead teachers with master’s degrees as one strategy to improve instructional quality. Grantee-level information on location, size of the program, and aggregate child and family characteristics can be obtained from the PIR and can be used as control variables to improve precision, or to define subgroups for analysis. Information about parents’ education levels and learning activities in the home would not be available; nor would any measures of classroom quality.
Broader Data Collection from a Sample of Children Within Grantees
As an option, ACF might decide to collect more-extensive outcome data and information on intermediate outcomes from a sample within the grantees participating in the study. The number of children per center (five) would be much less than under the design based on administrative data. The number of centers per grantee (two) and number of classrooms per center (two) also would be much less than under the administrative-data design. Table VI.3 summarizes the sample sizes required to detect effect sizes (impacts as a percentage of the standard deviation of the outcome measure) of 0.15, 0.20, and 0.25.
Under this option, researchers would supplement the NRS outcomes with intermediate outcomes measuring directors’, teachers’, and other staff members’ knowledge of the process of gathering information, analyzing program strengths and weaknesses, and recommending steps toward improvement, as well as their knowledge of the management climate (see Table VI.4). Additional intermediate outcomes would focus on the classroom, including the overall quality and the quality of instructional practices in reading, language development, and early mathematics. Measures of children’s development would be supplemented by measures of phonemic awareness and writing, as well as aspects of social-emotional development, including aggressive behavior, hyperactivity, prosocial behavior, and approaches toward learning.
Expanding data collection beyond Head Start administrative data would increase the costs and complexity of the study beyond the obvious need to develop data collection instruments and then visit centers to collect classroom-level and child-level data. The study design and data collection instruments would have to be approved by both OMB and an Institutional Review Board (IRB); centers, classrooms, and children would have to be sampled; and center staff and parents would have to consent to participate. Because of the need to obtain OMB clearance and to recruit centers, teachers, and families, the timeline for the study would have to be longer as well. We discuss these tasks and their timing below. Figure VI.2 summarizes the activities and their timing if additional data are collected from a sample of centers and children within the grantees.
Develop Data Collection Instruments. Data collection instruments, including the consent form with a form requesting demographic information; the director, Head Start staff, and teacher surveys; the child assessment and observation protocol; and the classroom observation protocol; will be developed during the first year of the study. Many of these instruments will include standardized assessments that will have to be formatted to simplify administration by a trained assessor. Others (such as the teacher questionnaire) will rely on questions that have been used in previous studies. After the data collection instruments have been developed, they will be pretested to ensure that respondents understand the questions, that the question flow proceeds logically and smoothly, and that the time required to complete the questions is reasonable.
IRB and OMB Research Review. Research on human subjects must be reviewed by an IRB, which considers the benefits of the research to society, the programs, and the participating families and weighs them against the cost of the research to the families and program staff. The IRB also ensures that research participants are protected from harm by determining that confidentiality is maintained. In addition, if the evaluation is federally funded, data collection instruments and the research plan must be approved by OMB. The data collection instruments are reviewed to ensure that they do not overlap with ongoing federal data collection efforts, and that burden is not excessive. These reviews will be conducted during the first year of the study and will have two parts: the first review will be initiated quickly and will cover the study design, recruiting protocols, and implementation study protocols. The second review will be conducted during the second half of the first year and will pertain to the data collection protocols.
| Outcome | Recommended Measure | Type of Measure | |
|---|---|---|---|
| Directors’ Knowledge and Practice | Using data to identify program improvements | Questions about key steps in the process; who participated; what recommendations were made; how priorities were set; what changed as a result | Director survey |
| Management climate | Policy and Program Management Inventory (Lambert, Abbott-Shim, and Oxford-Wright 2004) | Directory survey | |
| Head Start Staff Knowledge and Practice | Using data to identify program improvements | Questions about key steps in the process; who participated; what recommendations were made; how priorities were set; what changed as a result | Staff survey |
| Management climate | Policy and Program Management Inventory (Lambert, Abbott-Shim, and Oxford-Wright 2004) | Staff survey | |
| Classroom Processes | Materials and teacher activities to promote learning | Early Childhood Environment Rating Scale - Revised (Harms et al. 1998) | Observation and Teacher Survey |
| Teacher Behavior Rating Scale (CIRCLE 2005) | Observation | ||
| Classroom Assessment Scoring System (subscales measuring Emotional Climate and Instructional Climate) (CLASS; LaParo, Pianta, and Stuhlman 2004) | Observation | ||
| Teachers’ Knowledge and Practice | Attitudes and knowledge about developmentally appropriate practice | Teacher Beliefs Scale (Burts, Hart, Charlesworth and Kirk 1990) | Teacher survey |
| Using assessment data to individualize instruction | Questions about key steps in the process; what assessments were used; how priorities were set; how instruction is individualized | Teacher survey | |
| Management climate | Policy and Program Management Inventory (Lambert, Abbott-Shim, and Oxford-Wright 2004) | Teacher survey | |
| Children’s Development | Phonemic awareness | Woodcock-Johnson III Sound Awareness | Assessment |
| Writing, small motor skills | Woodcock-Johnson III Spelling | Assessment | |
| Behavioral problems | Child Behavior Checklist (Achenbach and Rescorla 2000) | Teacher report | |
| Social competence | Social Skills Rating Scale (Gresham and Elliott 1990) | Teacher report | |
| Approaches toward learning | Preschool Learning Behaviors Scale (McDermott et al. 2000) | Teacher report | |
[D] |
Timetable for Implementation. The enhancement will be implemented in the beginning of the Head Start year, nine months after the study begins. Implementation requires working with the program for a full Head Start year through the full cycle of data collection, analysis, identification of strengths and weaknesses, and development of recommendations for program improvements in the coming year. During the second year, programs will go through the self-study cycle again but will also implement some of the recommendations for program improvement. Because the self-study process by itself is not expected to produce measurable improvements in children’s well-being, we recommend that data on children’s outcomes be collected starting during the middle of the third year, after programs have had a year to implement program improvements based on their self-study process.
Selection of Centers and Children for the Study. When classrooms and children’s enrollments are established in August of the third year, researchers will work with program grantee directors to obtain a list of centers, teachers, and the number of children enrolled in each classroom. Two centers will be selected for the research sample. Parent consent forms will be distributed to parents of children in those centers. Returned consent forms and associated information sheets will be sent by program staff to the researchers for processing. The researchers will enter data from the forms to classify children by center, by consent status, and by demographic characteristics. A sample of 10 children from the pool of eligible children in each research center will be randomly selected for the data collection.
Consent. Consent for children to participate in the research must be obtained from parents or guardians. The parent consent form will clearly inform parents (guardians) about the duration of the study, the types of assessments that will be administered, and the voluntary nature of participation. An information sheet will be included in the consent package to collect basic demographic information about the family, such as age, race and ethnicity, and family structure; these variables will be used to improve the precision of impact estimates as well as to define subgroups.
The consent process could proceed more smoothly if it is incorporated into the home visits that many programs make to families. During these visits, which typically occur just before children attend class for the first time, teachers bring forms that parents must complete before the start of the school year. The teacher is available to explain the forms, and to ensure that they are completed correctly. If the study’s consent is part of this process, the teacher would be able to explain the nature of the study, what will happen if the child participates in the research, and the voluntary nature of the child’s participation.
Child Eligibility Criteria for the Study. Because the sample of children is intended to represent all three- four-, and five-year-old children in each center, all children participating in Head Start should be eligible for the study. Although implementation of the enhancement will occur before the sample of children is drawn, we recommend including all Head Start children in the potential sample for the study, rather than limiting the sample to children new to Head Start during the year in which fall and spring data are collected.
Directors’, Staff, and Teachers’ Self-Administered Questionnaires. All levels of staff will be asked about the continuous improvement process, how the process is working to improve quality of the management climate, and their satisfaction with their work. Teachers’ attitudes about developmentally appropriate classroom practices will be tapped, as will ways in which the teachers are intentionally instructing children on language arts, early reading, and early mathematics skills. In addition, the questionnaire will include questions about the teachers’ background, and about their attitudes and beliefs about developmentally appropriate practice. During the spring data collection, the teacher questionnaire will omit the teacher background questions (except in the case of teachers who are new), but it will include a short behavior problems scale, social skills scale, and approaches toward learning scale for each child in the research sample. The behavior problems scale and the social skills rating scale will provide additional information on children’s behavior based on the teachers’ observations during the year. The approaches toward learning scale will help identify positive factors, such as curiosity and attentiveness, that are associated with academic success.
Classroom Observation. Measuring the overall quality of the classroom using a commonly used observational protocol will enable researchers to understand how the continuous program improvement process has contributed to the overall quality of the classroom environment. Thus, to measure classroom quality, we recommend the use of the Early Childhood Environment Rating Scale-Revised (Harms et al. 1998). We also recommend including subscales of the Teacher Behavior Rating Scale (CIRCLE 2005) that tap the quantity and quality of instruction in language development, early reading skills, and early mathematics skills. In addition, we recommend using two subscales of the Classroom Assessment Scoring System (CLASS; LaParo et al. 2004): (1) the Instructional Climate subscale, and (2) the Emotional Climate subscale. We recommend that the classroom observations be conducted during the two weeks before the spring follow-up assessments begin.
Child Assessment. The impacts of the quality improvement process on children will depend on what program improvements are implemented. The measurement plan would include measures of domains not included in the NRS; thus, we recommend measures of phonemic awareness, writing, and social skills to supplement NRS measures of language development, letter recognition, and early mathematics skills.
The importance of obtaining a true baseline child assessment, ideally before the children have had any experience in the Head Start classroom, must be balanced against both the need to control data collection costs by assessing children in the Head Start center and the two- to three-week period required to stabilize enrollment in Head Start classes. We recommend a field period that starts approximately two to three weeks after classes begin at the Head Start center, with a six-week window for data collection in the sites. To assess how children’s outcomes have been influenced by their experiences in the Head Start classroom and by what they have learned during the Head Start year, the follow-up assessment should be conducted as close to the end of the year as possible. We recommend that the follow-up data be collected during a six-week window that ends two weeks before the end of the year, and that data collection be matched to the timing of the fall assessment so that classes assessed early in the fall field period also are assessed early in the spring field period.
Head Start has invested considerable resources into strengthening the support for early literacy instruction in Head Start classrooms. Toward this end, the Head Start Bureau developed the Strategic Teacher Education Program (STEP) to train all Head Start teachers in techniques that support children’s emergent literacy. Research suggests that children progress further in reading if they have strong vocabulary skills (so that the words they read make sense) and strong decoding skills (so that they recognize letters, recognize the sounds of individual letters or groups of letters, and are able to combine the sounds to form words) (Whitehurst and Lonigan 1998). The vocabulary skills are referred to as “outside-in” skills because knowledge of the context and vocabulary supports the ability to read specific words, and the decoding skills are referred to as “inside-out” skills because knowledge of sounds and letters is used to read words and sentences.
Based on the reading research, STEP training and the curricula we discuss below include a combined focus on (1) building children’s vocabulary and familiarity with children’s books; and (2) teaching letter recognition, rhyming, the sounds of words, and the relationship between letters and sounds. Each Head Start grantee identified one staff member to attend intensive STEP training during the summer of 2002. After receiving the training, the 2,500 early literacy specialists returned to their Head Start programs to train teachers locally in preparation for the 2002-2003 program year. A system of mentor-coach training was designed to provide individualized follow-up training for teachers. Early literacy specialists provided one-on-one training to teachers, offering individual assistance to tailor the early literacy teaching techniques to each teacher, and to respond to the diverse backgrounds and needs of children in the classroom. To further reinforce the techniques and provide resource materials, the Head Start Bureau developed a website containing resource materials for teachers.
Given the level of investment in language development and emergent literacy initiatives already made in Head Start classrooms, in-depth examination of alternative approaches in this area might seem unnecessary. However, it is possible that some programs have not fully implemented the STEP techniques, so that following a well-documented, well-implemented approach would make a difference for children. Even if ACF is not interested in alternative literacy curricula at this time, this design offers a template for evaluating competing approaches to enhancing children’s development in the same domains as are covered in STEP ( language and early literacy). Many of the early literacy curricula are at a stage of development that would enable them to be widely implemented, making them good candidates for a Stage 3 evaluation. This enhancement example is distinct from the previous one because it can be implemented at the center level, and, because we expect little spillover across centers, random assignment can take place at that level. Other curriculum innovations or changes that are implemented primarily at the classroom level, such as a mathematics curriculum or approaches to enhancing children’s social-emotional development, would be good examples of enhancements that would fit this design.
Several approaches to enhancing preschool children’s language development and early literacy skills have been developed. Some approaches are embedded within a broader preschool curriculum, and some present guides to classroom activities in large or small groups to promote the desired outcomes; another one is a computer-based approach. Manuals, training guides, and materials have been developed for most of the approaches, and most of them have been implemented in many preschool classrooms. We include these as illustrative examples, although they might not all be ready for Stage 3 evaluation:
Systematically implementing these five curricula in different Head Start centers would provide a test of which ones are more effective for Head Start children, and whether any or all of them improve children’s emergent literacy skills relative to current Head Start practice. We describe the study design next.
Several activities must occur to carry out the evaluation:
We recommend that the first three tasks be conducted during the first year of the evaluation, with implementation, sampling of children, and collection of baseline data during the second year, and follow-up data collection, analysis, and reporting during the third year (see Figure VI.3). Because the Head Start year typically runs from August or September through May or June, the timing of activities will proceed most smoothly if the evaluation activities begin in January or February. We discuss the steps in more detail in the rest of this chapter.
The first year of evaluation activities will be dominated by OMB review and by the sampling and recruitment of grantees and centers. OMB review of the study design and protocols for recruiting grantees and centers must be completed before researchers and curriculum developers can discuss the study with grantees, obtain information about prospective centers, or negotiate agreements to participate. We have estimated two months to draft and submit a package to OMB that includes the study design and recruiting protocols, and six months for OMB review. During the OMB review period, researchers can conduct other evaluation-related activities that do not involve data collection from prospective grantees and centers. For example, data collection protocols can be developed and submitted for OMB review, and the study design and data collection plans can be submitted for IRB review. Data systems for tracking children and for managing evaluation data can be developed as well. In addition, ACF can inform the Head Start community about the pending evaluation. After receiving OMB clearance, grantees can be sampled, and the curriculum developer and researchers can contact the selected grantees and centers and can begin the recruiting process. We estimate that recruitment, executing agreements, and randomly assigning centers or classrooms will require approximately five months.
The main evaluation activities during the second year consist of implementing the curricula to high fidelity, conducting the implementation study, obtaining consent for children to participate in the study, sampling children, and collecting fall baseline data. Ideally, implementation will occur during the spring so that the Head Start classes can benefit from a fully implemented curriculum during the entire Head Start year that follows. Third-year evaluation activities will include collecting follow-up data in the spring on classrooms and children, analyzing the data, and reporting the results.
[D]
|
An evaluation of the literacy curricula described here would be based on random assignment of centers and a comparison of children’s progress toward language development and emergent literacy skills. This section discusses key elements of a research design to evaluate the impacts of this enhancement on children’s outcomes. We begin by discussing the quality enhancement and its counterfactual, the research questions, sampling strategy, random assignment plan, and sample sizes.
The Quality Enhancement, Counterfactual, and Research Questions
The quality enhancement to be evaluated is a curriculum to support children’s language development and the acquisition of early literacy skills. ACF could decide to study one curriculum alone or multiple curricula. If more than one curriculum is studied, the outcomes of children using the different curricula can be compared to test whether any of the curricula are more effective than the others. Because implementing each curriculum is likely to increase the focus on language development and early literacy skills, it is likely that the differences in outcomes between children using the alternative curricula will be small. However, the evaluation still can establish whether any (or each) of the curricula is more effective than current practice. For this reason, under any evaluation design, it would be useful to include a control group that does not implement any of the curricula in order to permit comparisons of children under each curriculum with those not implementing any of the curricula.
The Head Start STEP training provided a foundation for language and literacy activities in each classroom. We have no information on how well STEP training supported better language development and early literacy activities in the classroom. Some information on the quality of implementation of the STEP approaches could be obtained from measures of language and literacy activities in the control-group classrooms in an evaluation of alternative curricula.
Research questions to be examined as part of this evaluation include whether each curriculum was implemented to high fidelity:
The most important impacts of the language and literacy curricula are those pertaining to the quality of the classroom language and literacy environment, and to children’s language development and emergent literacy skills. Thus, the following research questions are critical ones:
Finally, although the evaluation will examine the effectiveness of the curricula for children overall, the Head Start community also will be interested in whether the curricula are effective across different subgroups of children and families and across programs with different characteristics:
Some caution must be used when examining subgroup impacts, because, if many subgroups are examined, some impacts will emerge simply by chance. To guard against finding impacts by chance, researchers can implement a Bonferroni correction that essentially inflates the standard error of the impact estimates to correspond to the number of hypothesis tests. This method provides a higher bar for deciding that a subgroup impact is significantly different from zero.
In a Stage 3 design that is nationally or regionally representative of all Head Start programs, subgroup analyses should be based on a representative sample of that subgroup in Head Start. Thus, subgroups of interest must be identified in advance so that statisticians can design a sampling plan for grantees, centers, and children that ensures adequate representative samples of the subgroups of interest. Perhaps the most interesting subgroups in a curriculum study are program schedule (part-day versus full-day), because of the tradeoff between program costs and outcomes; race/ethnicity, because of concerns about gaps in educational progress; and home language, because a curriculum may or may not work well for children whose home language is not English.
Sampling, Random Assignment, and Sample Sizes
Alternative curricula would be most efficiently and effectively implemented at the center level. Implementing the curricula center-wide would enable teachers within each center to discuss how to approach various activities, trade ideas for books to read to the class, and share other good practices. It is unlikely that spillover would occur across centers, as the curricula involve classroom-based activities. However, some communication across centers does occur, and if this could lead to some spillover, it needs to be monitored and measured.
Because implementation would take place at the center level, random assignment to the enhancement or control group should be at the center level as well. However, researchers would have to draw a first-stage sample of program grantees, delegate agencies, and then sample centers within these agencies. The sample of program grantee or delegate agencies would be drawn within strata defined by Census regions, urbanicity, and percentage of children who are African-American and percentage who are Hispanic or Latino (large versus small percentage). Selection would be based on probability proportional to the number of children in the program, using the most recent PIR data as the sampling frame. The PIR contains data from annual reports submitted by all program grantees and delegate agencies about program size, schedule, and other characteristics.
NRS data are available for all four- and five-year-old children in every center within every grantee. If these data can be obtained and analyzed for the sample of centers included in the evaluation, evaluation costs could be reduced. In the previous example, in which the enhancement is implemented at the grantee level, the availability of NRS data on every child in every center leads to a strategy of minimizing the number of grantees included in the evaluation by including in the evaluation sample every center from the selected grantees. However, in the example of a curriculum implemented at the center level, it makes sense to minimize the number of centers included in the evaluation even if NRS data are available on all children, as implementation requires additional resources per center. Compared to using four centers per group from each grantee, if we randomly assign two centers per grantee to each curriculum group or to the control group, we will slightly increase the number of grantees to be recruited, but substantially decrease the number of centers that would have to implement the curriculum as part of the evaluation. Because many grantees have eight or more centers, assigning two centers from each grantee to each evaluation group also enables the evaluation to expand to include more than one or two curricula, if that is desirable.
We assume that, with NRS data used for the evaluation, an average of two centers per evaluation group from each grantee would produce outcome data on an average of 45 children per center. The sampling strategy that minimizes the design effect of weighting is equal probability sampling of centers. Table VI.5 shows the numbers of grantees and centers per evaluation group that would be necessary to detect impacts with a minimum effect size between 0.1 and 0.2 (that is, the impact as a proportion of the standard deviation of the outcome measure). We present sample size requirements for an effect size of 0.1 because the NRS measures are expected to be less sensitive to intervention and more “noisy” (to have higher measurement variance) than measures we would use as direct assessments. However, to obtain a precision level of 0.10, sample sizes must be very large. To detect an effect size of at least 0.15, the evaluation would require at least 206 grantees in each group (if no baseline data are available) or 164 grantees in each group (if some baseline data are available). Because NRS data are available on all four- and five-year-old children in the program, 164 centers per group would be included in the evaluation (if 82 grantees are used and if each grantee contributes an average of 2 centers to each evaluation group), with 7,380 children per group (if each center has an average of 45 children in the follow-up sample).
| Total Number of Grantees per Evaluation Group | Total Number of Centers per Evaluation Group | Number of Children per Evaluation Group in Final Sample | ||
|---|---|---|---|---|
| MDE = 0.10 | No baseline | 232 | 464 | 20,880 |
| Minimal baseline | 185 | 370 | 16,650 | |
| MDE = 0.15 | No baseline | 103 | 206 | 9,270 |
| Minimal baseline | 82 | 164 | 7,380 | |
| MDE = 0.20 | No baseline | 58 | 116 | 5,220 |
| Minimal baseline | 46 | 92 | 4,140 | |
|
Note: Sample size calculations assume that grantees are randomly selected and two centers per grantee are randomly assigned to each evaluation group. Within each center, 45 children on average are included in the follow-up data collection. “Minimal baseline” means that demographic information and NRS fall test scores are available so that the R2 for the regression adjustment of the impact estimates is .20. Sample size calculations also assume a two-tailed test with 80 percent power and a 95 percent confidence level. Appendix B provides details about the calculations. Sample size calculations assume that the sampling strategy minimizes the design effect of weighting for the source of data. Thus, if the evaluation is based on NRS data on outcomes for all children in the center, sampling of grantees would be based on probability proportional to size, but sampling of centers within grantees would be based on equal probability of selection for centers stratified by size. The sample size calculations do not include an adjustment for the design effect of weighting for sample nonresponse. MDE = minimum detectable effect; NRS = National Reporting System.
|
Because the NRS currently provides only a limited set of outcome measures and no information on the quality of implementation, a broader data collection effort that includes measures of children’s expressive language ability, phonemic awareness, and social-emotional well-being would provide a stronger basis for drawing conclusions about whether the curricula have important impacts on children’s development. Measures of the quality of classroom environments, quantity and quality of language development, and quantity and quality of early literacy activities in the classroom are critical for understanding whether the curricula have impacts on classroom learning environments.
Collecting additional data will require drawing a more efficient sample of children from the participating grantees and centers. To sample centers, grantees and delegate agencies would be asked to provide lists of centers containing information about enrollments, by age of child; the number of classes and class sizes; and directors’ contact information. Sampling would again be based on probability proportional to the number of children in the center and on an implicit stratification procedure that would draw from a sorted list of centers, where sorting is based on characteristics, such as classroom schedule, size, and percentage of non-English-speaking children. Drawing from a sorted list helps obtain a sample that is more proportionally representative of the characteristics used in the sort.
Table VI.6 shows the number of grantees, centers, and children to be included in a sample consisting of two centers per group from each grantee, two classrooms per center, and six children per class at the final followup. We use slightly larger minimum detectable effect sizes in this table (0.15 to 0.25) because data collection is considerably more expensive and the measures used likely to be more sensitive to the enhancement.than in the previous example. Notably, the number of grantees that would be required to attain the same precision level as in the previous example (.015) would be higher (105 per group for this design, compared with 82 per group for the previous design), but the number of children in the evaluation would be much lower. The strategy of increasing the number of grantees while reducing the number of children in the sample is efficient because assessing children is very expensive. The tradeoff produces a much lower sample size of children for a relatively modest increase in the number of grantees to be recruited. The increase in the number of grantees reduces the effect of clustering in the sample of children. The two designs could be combined by selecting a target MDE level for the in-depth data collection, identifying the target number of grantees and centers, and then using the sample of grantees and centers for the NRS analysis as well (in other words, using all the children in the centers, rather than only the children sampled for broader data collection). A sample of children could then be drawn from the centers for more-extensive data collection. Of course, sampling of centers for this design would need to be based on a compromise between probability proportional to size and equal probability sampling. One possible compromise is to sample based on probability proportional to size, but to use the square root of the sample size as the measure of size.
Recruiting Selected Grantees and Encouraging Centers to Participate
The selected grantees will be directly recruited to participate in the evaluation. However, given that the grantees’ willingness to participate will likely depend on the willingness of the centers’ directors and staff to become involved, meetings should include the directors and key staff fairly early in the process. Researchers should call each grantee to describe the study, and to explain its importance. With support from the Head Start Bureau and regional offices, researchers should then schedule a meeting with the grantee executive staff and with as many center directors as possible to discuss in depth the goals of the study, the way that the study would be conducted in the centers, and the benefits and costs of participating. Ideally, the curriculum developers would accompany the researchers to the meeting to help generate excitement about the enhancement options.
| Total Number of Grantees per Evaluation Group | Total Number of Centers per Evaluation Group | Total Number of Classrooms per Evaluation Group | Number of Children per Evaluation Group in Final Sample | Number of Children per Evaluation Group in Initial Sample | ||
|---|---|---|---|---|---|---|
| MDE = 0.15 | No baseline | 131 | 262 | 524 | 3,144 | 3,668 |
| Minimal baseline | 105 | 210 | 420 | 2,520 | 2,940 | |
| MDE = 0.20 | No baseline | 74 | 148 | 296 | 1,776 | 2,072 |
| Minimal baseline | 59 | 118 | 236 | 1,416 | 1,652 | |
| MDE = 0.25 | No baseline | 47 | 94 | 188 | 1,128 | 1,316 |
| Minimal baseline | 38 | 76 | 152 | 912 | 1,064 | |
|
Note: Sample size calculations assume that grantees are randomly selected for the evaluation and two centers from each grantee are randomly assigned to each evaluation group. Within each center, two classrooms and seven children per classroom are randomly selected for the evaluation sample (six per classroom, on average, are available for the follow-up). “Minimal baseline” means that demographic information and NRS fall test scores are available so that the R2 for the regression adjustment of the impact estimates is .20. Sample size calculations also assume a two-tailed test with 80 percent power and a 95 percent confidence level. Appendix B provides details about the calculations. Sample size calculations assume that the sampling strategy used minimizes the design effect of weighting for the source of data. Thus, if the evaluation is based on data from a sample of classrooms and children, sampling of grantees and centers would be based on probability of selection proportional to size. The sample size calculations do not include an adjustment for the design effect of weighting for sample nonresponse. MDE = minimum detectable effect; NRS = National Reporting System.
|
The initial contact materials should briefly explain the study’s focus (in this case, the early literacy curricula) and should briefly summarize the benefits of participation. The contact materials also might indicate that all centers in the program will have an equal chance of participating in the study, but that only some will be chosen to implement the curricula. Those not chosen to implement the curricula during the first year will be given priority to implement it, if desired, after the follow-up data have been collected. The materials also will explain that the programs will receive information about the study’s findings, and that they will be partners in the research study.
Recruitment of grantees and centers will be easier if their staff believe that they will gain more from participation than they might lose. Accordingly, after the initial contact has been made, the benefits to sites must be explained in detail, and concerns about study burden must be discussed. Curriculum developers and research staff should visit program directors, center directors, and other relevant administrative staff to discuss the curricula, their expected benefits to children, what will be involved in implementing it, and the research aspects of the evaluation. Researchers also will explain the program’s role in ensuring that the study yields useful information about the curricula’s effectiveness. For example, researchers will have to work with program staff to obtain information required to implement random assignment (for example, the number of classes in each center, teachers’ names, class sizes, and children’s ages). Program staff will have to maintain random assignment statuses (for example, by ensuring that teachers in the intervention group do not share information about the curricula with control-group teachers). Researchers will have to monitor the integrity of random assignment over time, a key piece of information to demonstrate the reliability of the study. They also will have to work with program staff to schedule child assessments and classroom observations.These evaluation-related requirements will be balanced by the opportunity to implement new curricula that could benefit children. Benefits to the control group are more challenging to identify, but an important benefit that could be offered for control-group participants is first priority to implement a curriculum after the children participating in the evaluation have finished their Head Start year.
Agreements by grantee and center directors to participate should include a memorandum of understanding that describes the benefits to the program and center, and that specifies the respective responsibilities of the curriculum developer, researchers, and Head Start staff in this joint research undertaking. This level of detail ensures that misunderstandings that have the potential to threaten the success of the research study do not arise after it is too late to resolve them. A memorandum of understanding also offers a useful way of informing new staff about the study.
Gaining cooperation of grantee and center directors over time can become easier as the Head Start community gains experience with the process of evaluating quality enhancements. Head Start directors have had positive experiences participating in FACES, the QRC Consortium, and the Early Head Start evaluation, and these positive experiences have made it easier for researchers to recruit programs to participate in subsequent rounds of FACES and in the NRS Quality Assurance study. Publishing results of the quality enhancement studies in outlets that are read by Head Start staff and presenting findings at professional meetings attended by Head Start staff (such as the National Head Start Association and National Association for the Education of Young Children) as well as at regional gatherings of Head Start staff would help to generate excitement, and to foster support of future research efforts.
After the selected grantee directors have agreed to participate, they will be asked to provide information about the centers, including contact information for the directors; the centers’ characteristics (full-day or part-day services, and program year); and the number of classrooms, teachers, and students per class. Based on this information, a sample of centers will be selected and randomly assigned to one of the curriculum groups or to the control group. Directors of those centers would then be contacted to discuss plans for implementing curricula, obtaining teachers’ and parents’ consent, and sampling classrooms and children for the evaluation.
The goal of implementation is to ensure that each curriculum is implemented to high fidelity in the classrooms of the designated Head Start centers. Clearly, high-fidelity implementation at Stage 3 must not require enormous resources; if it does, the curriculum would not be a good candidate for implementation in a large national program such as Head Start. Nevertheless, the task of implementing each curriculum should have strong leadership and the resources and resolve to reach high fidelity. T/TA staff periodically should measure teachers’ practices and classroom environments, using the fidelity measure developed for that curriculum, and should use the measures as the basis for conferring with teachers about successful practices and about how to address weak areas.
Each curriculum will be implemented using the relevant curriculum manual, necessary materials, and T/TA staff who are trained to help teachers implement the curriculum to high fidelity. The manual should provide ideas for arranging materials in the classroom, leading and supporting large- and small-group activities, and individualizing instruction for children having difficulty with any parts of the curriculum. T/TA staff will provide initial training in a workshop format during the spring of the second evaluation year (see Figure VI.3) and over the next two months will regularly observe and measure performance as part of technical assistance visits to improve fidelity. When the Head Start program begins again in the fall, new teachers will receive the initial workshop training while experienced teachers will receive a shorter refresher. T/TA staff will then make periodic visits to centers during the fall and spring, with more frequent visits to teachers having more difficulty reaching or maintaining fidelity to the curriculum.,
Measuring fidelity of implementation is important so that the evaluation measures the impacts of the enhancement as designed rather than a pale substitute. One option for obtaining fidelity measures is to obtain the regular measures taken by T/TA staff as they conduct visits to work with teachers on implementation. If measurement is comparable across raters, these ratings of fidelity to implementation can be used in the evaluation. Comparability can be improved if T/TA staff are carefully trained to use the assessment protocol and, where items involve some judgment on the part of the rater, steps are taken to check the reliability of coding. Reliability can be checked initially and over time by asking T/TA staff to videotape the situations they are coding, and to send the videotapes and coding results to the research organization to check. Research staff will view the videotapes, independently code the results, compare coding, and provide feedback to the T/TA staff to help improve their understanding of any items for which the coding was discrepant.
Concerns that the validity of fidelity measures obtained for T/TA purposes could be compromised if they also are used for research may lead to an alternative strategy. A trained cadre of observers could visit each center once during the Head Start year to obtain fidelity measures. Visiting such a large number of centers and classrooms (assuming three or more curricula are being evaluated) will be fairly expensive. Nevertheless, if policymakers want reliable, valid measures of implementation fidelity so that the results of the evaluation can be trusted as measures of the effectiveness of each curriculum in Head Start, one visit to each center to collect implementation fidelity measures should be strongly considered. If this visit is scheduled for the spring of the evaluation year, the observers also could be trained to collect a set of child outcome measures that reflects the breadth of the early literacy curricula beyond the areas currently covered by the NRS.
We discuss two options for measuring impacts of the literacy curricula on children’s outcomes. Under one option, the evaluation could rely on data from the NRS. This strategy would provide information on vocabulary, letter recognition, and early mathematics skills for all children in the centers that were randomly assigned.6 Under the second, the evaluation could include data collected from a sample of children in the participating centers.
Use of Head Start Administrative Data
NRS data on children’s vocabulary and letter recognition can support estimation of the impacts of the literacy curricula on those outcomes. As discussed in Section A, the NRS measures have good reliability, and they provide a measure of important areas of children’s progress in Head Start. However, the literacy curricula will include a focus on phonological awareness, which is not measured fully by the letter recognition test, and on vocabulary, which could be more fully measured by adding an assessment of expressive vocabulary. The NRS currently does not measure children’s social-emotional well-being, so any changes in this area would not be measured if the evaluation were to rely on administrative data.7
Information on children, classrooms, teachers, and centers can be taken from the CBRS; for the most part, this information would offer control variables to improve the precision of the impact estimates or information required to define subgroups (see Table VI.1). Grantee-level information on location, program size, and aggregate child and family characteristics could be obtained from the PIR and used as control variables to improve precision. Information on the fidelity of implementation obtained from the T/TA staff could be used to distinguish classrooms that reached a threshold for high-fidelity implementation from classrooms that did not reach the threshold.
Broader Data Collection from a Sample of Children Within Grantees
As an option, ACF might decide to collect more-extensive outcome data and information on intermediate outcomes from a sample of children in the centers participating in the study. The number of classrooms per center (two) and children per classroom (six) would be much lower than under the design based on administrative data (shown in Table VI.5 and VI.6). Under this option, the NRS outcomes would be supplemented with intermediate outcomes measuring the overall quality of the classroom and the quality of instructional practices in language development and early literacy (see Table VI.7). Measures of children’s development would be supplemented by measures of expressive vocabulary, phonemic awareness, and writing, as well as aspects of social-emotional development, including aggressive behavior, hyperactivity, prosocial behavior, and approaches toward learning. Because of the high cost of data collection when several curricula are evaluated, we recommend a single visit during the spring of the Head Start year to collect only follow-up child outcome data and classroom observation data. NRS data can serve as the baseline for all children in the sample.
Expanding data collection beyond Head Start administrative data would increase the costs and complexity of the study beyond the obvious need to develop data collection instruments and to then visit centers to collect classroom-level and child-level data. The study design and data collection instruments would have to be approved by OMB and an IRB; classrooms and children would have to be sampled; and center staff and parents would have to consent to participate. Here, we discuss the steps necessary to collect additional data beyond the NRS for this evaluation.
Develop Data Collection Instruments. Data collection instruments will be developed during the first year of the study. They will include the consent form with a form requesting demographic information for parents; the director, Head Start staff, and teacher surveys; the child assessment and observation protocol; and the classroom observation protocol. Many of these instruments will include standardized assessments that will have to be formatted to simplify administration by a trained assessor. Others (such as the teacher questionnaire) will rely on questions that have been used in previous studies. After the data collection instruments have been developed, they will be pretested to ensure that respondents understand the questions, that the question flow proceeds logically and smoothly, and that the time required to complete them is reasonable.
IRB and OMB Research Review. Research on human subjects must be reviewed by an IRB, which considers the benefits of the research to society, the programs, and the participating families and weighs them against the cost of the research to the families and program staff. The IRB also reviews protection of research participants from harm by ensuring that confidentiality is maintained. In addition, if the evaluation is federally funded, data collection instruments and the research plan must be approved by OMB. The data collection instruments are reviewed to ensure that they do not overlap with other federal data collection efforts, and that burden is not excessive. These reviews will be conducted during the first year of the study, in two parts: The first review will be initiated quickly and will cover the study design, recruiting protocols, and implementation study protocols. The second review will be conducted during the second half of the first year and will pertain to the data collection protocols.
| Outcome | Recommended Measure | Type of Measure | |
|---|---|---|---|
| Teachers’ Knowledge and Practice | Attitudes and knowledge about developmentally appropriate practice | Teacher Beliefs Scale (Burts, Hart, Charlesworth and Kirk 1990) | Teacher survey |
| Curriculum and support for language development and early literacy activities in the classroom | Questions about curricula used in the classroom; instruction onthe curricula; materials available to support language development, book reading, and early literacy activities | Teacher survey | |
| Daily activities, language activities, and early literacy activities | Questions about daily schedule and mix of language and literacy-oriented activities | Teacher survey | |
| Classroom Processes | Materials and teacher activities to promote learning | Early Childhood Environment Rating Scale - Revised (Harms et al. 1998) | Observation Teacher survey |
| Teacher Behavior Rating Scale (CIRCLE 2005) | Observation | ||
| Children’s Development | Expressive language | Preschool Language Scale, 4(th) Edition (PLS-4; Zimmerman et al. 2002) or Oral and Written Language Scales (OWLS; Carrow-Woolfolk 1995) | Assessment |
| Phonemic awareness | Woodcock-Johnson III Sound Awareness (Woodcock, McGrew, and Mather 2001) | Assessment | |
| Letter Recognition | Woodcock-Johnson III Letter-Word Identification (Woodcock, McGrew, and Mather 2001) | Assessment | |
| Writing, small motor skills | Woodcock-Johnson III Spelling (Woodcock, McGrew, and Mather 2001) | Assessment | |
| Behavioral problems | Child Behavior Checklist (Achenbach and Rescorla 2000) | Teacher report | |
| Social competence | Social Skills Rating Scale (Gresham and Elliott 1990) | Teacher report | |
| Approaches toward learning | Preschool Learning Behaviors Scale (McDermott et al. 2000) | Teacher report | |
Selection of Classrooms and Children for the Study. After classrooms and children’s enrollments have been established (during August of the third year of the study), researchers will work with center directors to obtain a list of teachers and the number of children enrolled in each classroom. Two classrooms from each center will be selected with probability of selection proportional to class size. Parent consent forms will be distributed to parents of children in those classrooms (as detailed in the next paragraph). Returned consent forms and associated information sheets will be sent by program staff to the researchers for processing. The researchers will enter data from the forms to classify children by center, by classroom, by consent status, and by demographic characteristics. A sample of seven children from the pool of eligible children in each research classroom will be randomly selected for the data collection.
Consent. Consent for children to participate in the research must be obtained from parents or guardians. The parent consent form will clearly inform parents (guardians) about the duration of the study, the types of assessments that will be administered, and the voluntary nature of participation. An information sheet will be included in the consent package to collect basic demographic information about the family, such as age, race and ethnicity, and family structure, as these variables are necessary to improve the precision of impact estimates as well as to define subgroups.
The consent process could proceed more smoothly if it is incorporated into the home visits that many programs make to families. During these visits, which typically occur just before children attend class for the first time, teachers bring forms that parents must complete before the start of the school year. The teacher is available to explain the forms, and to ensure that they are completed correctly. If the study’s consent is part of this process, the teacher would be able to explain the nature of the study, what will happen if the child participates in the research, and the voluntary nature of the child’s participation.
Child Eligibility Criteria for the Study. Because children’s language development and early literacy skills grow rapidly between three and five years of age, we probably would have to analyze three- and four-year-olds separately, thus reducing the effective analysis sample. We therefore recommend focusing the research sample on children who will be four years old in the fall of the Head Start year. Although teacher training will begin during the final months of the previous Head Start year, we expect that children attending Head Start that year will have had very little exposure to a fully implemented enhancement. Therefore, we recommend including all four-year-old Head Start children in the potential sample for the study, rather than limiting the sample to children who will be new to Head Start during the year in which fall and spring data are collected.
Teachers’ Self-Administered Questionnaires. Teachers will be asked about the curricula used in the center and the support they receive for language and early literacy activities in the classroom through purchase of books and other materials and training. In addition, teachers’ attitudes about developmentally appropriate classroom practices will be tapped, as will ways in which they are intentionally instructing children in vocabulary, letter recognition, and phonological awareness. Teachers also will complete questionnaires about every research-sample child from their classes ; the questionnaires will include a short behavior problems scale, a social skills scale, and a scale measuring approaches toward learning. The behavior problems scale and the social skills rating scale will provide information on children’s behavior based on the teachers’ observations during the year. The scale measuring approaches toward learning will help identify positive factors, such as curiosity and attentiveness, that are associated with academic success.
Classroom Observation. Including a commonly-used observational protocol to measure the overall quality of the classroom will enable researchers to understand how the early literacy curriculum may have influenced the overall quality of the classroom environment. Therefore, we recommend using the Early Childhood Environment Rating Scale-Revised to measure classroom quality (Harms and Clifford 1998). We also recommend including subscales of the Teacher Behavior Rating Scale (CIRCLE 2005) that tap the quantity and quality of instruction in language development and early literacy skills.
Child Assessment. The early literacy curriculum is expected to support improvements in children’s vocabulary, letter recognition, phonological knowledge, and writing ability. We recommend measuring these outcomes with well-established scales that have been used in populations similar to the children in the proposed enhancements. For expressive language ability, we recommend using the Preschool Language Scale (Zimmerman et al. 2002) or the Oral and Written Language Scales (Carrow-Woolfolk 1995). For phonological awareness, we recommend the Woodcock-Johnson III Sound Awareness task; for letter recognition, the Woodcock-Johnson III Letter-Word Identification task; and for writing ability, the Woodcock-Johnson III Spelling task (Woodcock et al. 2001).
For any of the field test designs described in this chapter, the evaluation should culminate in a report that clearly describes the enhancement, the research design, and the estimated impacts of the enhancement on key outcomes. This report should be written for a broad audience. In addition, other reports and papers can be written to address other information needs:
Detailed and summary versions of each report should be written to address the needs of technical audiences as well as the Head Start community and other early childhood practitioners and policymakers.
In this section, we discuss the approach to estimating impacts when grantees are randomly assigned and when centers are randomly assigned. The following section describes the approach to the cost-effectiveness analysis.
Analysis weights will be created to account for variations in the probabilities of selection, eligibility rates, and cooperation rates among those selected. First, the probability of selection should be calculated for each stage of sampling (grantee, center, classroom, and child), and within each explicit sampling stratum (for example, region). The inverse of the probability of selection at each stage is called the sampling weight. This stage of weighting would account for the probability proportional to size (PPS) sampling approach, any certainty selections, and any oversampling. At each stage, the sampling weights will be adjusted by the inverse of the weighted response rate among eligible cases at that stage.
If sampling is based on a PPS method at each stage, the children in the resulting sample will have roughly equal probability of selection. In this case, weights are unlikely to be very different for each child, and, therefore, the weighted analyses will not be very different from the unweighted analyses.
Using a random assignment evaluation design means that fairly simple estimation methods can be used to determine the impacts of the quality enhancements at a point in time. Under random assignment of centers, the center-level mean outcomes are estimated, after which, separately for the enhancement and the control groups, the center mean outcomes are averaged over all centers in that group. An analogous approach is taken when grantees are randomly assigned. The simple difference in the means between enhancement and control centers (grantees) is the impact estimate.
More-precise estimates can be obtained by estimating regression models. Regression procedures can improve the precision of the estimates and can adjust for any residual differences in the observable characteristics of program and control group members due to random sampling and interview nonresponse. Regression models take the following form:
(1) Y = α + Xβ + γT + ε ,
where Y is an outcome variable; X is a vector of explanatory variables; T is an indicator that equals one for members of the enhancement group and zero for members of the control group; α, β, and γ are parameters to be estimated; and ε is a random-error term. The estimate of the parameter γ is the estimated impact of the quality enhancement compared with regular Head Start services.
Because random assignment will have been conducted at the grantee or center level (depending on the design) and sampling will have taken place at the grantee, center, and, sometimes, classroom level, the regression adjustment takes the form of a hierarchical linear model (HLM) of child development consisting of two or three nested levels (depending on whether classrooms are sampled). By specifying the model at each level, we can conduct analyses for the appropriate units of analysis and can conduct statistical hypothesis tests that correctly account for the clustering of children within classrooms, classrooms within centers, and centers within grantees. Although not strictly necessary for conducting impacts (because the evaluation is based on a random assignment design), using an HLM model to adjust the impacts can help increase the precision of the estimates.
For the design involving random assignment of grantees with sampling of grantees, centers, and children, the analytic model is the following, where the variables are indexed by child (i), centers (j) and grantees (k):
Child-level model
(2) Yijk(t) = αYijk(0) + βXijk + Cjk + ε1ijk.
Center-level model
(3) Cjk = ΩZjk + gk + ε2jk.
Grantee-level model
(3) gk = ηTk + δVk + ε3k.
For the design involving random assignment of centers with sampling of grantees, centers, classrooms, and children, the analytic model is the following, where the variables are indexed by child (i), classrooms (h), centers (j), and grantees (k):
Child-level model
(4) Yihjk(t) = αYihjk(0) + βXihjk + lhjk + ε4ihjk.
Classroom-level model
(5) lhjk = Ψ|Whjk + Cjk + ε5hjk.
Center-level model
(5) Cjk = λTj + ΨΩZjk + gk + ε6jk
Grantee-level model
(3) gk = δVk + ε7k.
In the models, Y(t) is the outcome at follow-up period t; X is a set of child characteristics, such as gender; T is a variable indicating whether the child was in the enhancement group or control group; W is a set of classroom variables, such as the literacy environment; Z is a set of center-level variables, such as whether classes are full day; V is a set of grantee-level variables, such as the region or the grantee director’s education level; and ε1 through ε7 are disturbance terms assumed to have a mean of zero, and to be uncorrelated with each other. Parameters to be estimated include α and β, vectors of coefficients on the child baseline characteristics; c, the center effect; y and λ, the effect of the quality enhancement in the two models; δ, the coefficients on the grantee variables; Ω, the coefficients on the center variables; and Ψ|, the coefficients on the classroom variables.
The statistical techniques used to estimate regression-adjusted impacts in equations (2) and (4) will depend on the form of the outcome, Y. If the dependent variable is continuous (such as the score on the Woodcock-Johnson Letter-Word Identification assessment), ordinary least squares methods produce unbiased estimates of the parameters Y or λ. However, if the dependent variable is binary (such as whether the child's score on the Woodcock-Johnson Letter-Word Identification subtest was one or more standard deviations below the norm), logit or probit maximum-likelihood methods will be used to obtain consistent parameter estimates.
The research design is based on the assumption that random assignment will be implemented well so that a large proportion of children in the enhancement group remain in that enhancement center for the Head Start year and all children in control-group centers remain in those centers for the Head Start year. Nevertheless, children who start out in an enhancement center might move during the Head Start year to a different center that has not implemented the enhancement. At the same time, children in a control-group center might move during the Head Start year to a center that has implemented the enhancement.
To the extent that these movements across centers occur, the impact analysis described above will not measure completely the average impact of the quality enhancement on children, but rather the impact of being given the opportunity to attend a center with the enhancement; this is called the “intent to treat” impact estimate. The effect of both types of “crossovers” is to reduce the average impact because some children in the enhancement group did not receive the full effect of the enhancement and some children in the control group did. Researchers can adjust the impact estimates to account for such crossovers. The intent-to-treat estimate can be adjusted by dividing it by the difference between the proportion of students who remained in the enhancement centers and the proportion of students from the control centers who switched to an enhancement center.
Because the effectiveness of an enhancement may differ by program setting or by characteristics of the children served, it would be useful to determine the groups for which the enhancement is most effective. This type of subgroup analysis would then enable individual programs to decide which enhancements might be useful to them. For example, analysis may demonstrate that the early literacy curriculum is much more effective in programs that offer full-day services than in those offering part-day services, or that its impacts are stronger when teachers have higher educational credentials at the outset of training.
The subgroups of interest will depend on the quality enhancement to be tested. Examples of center- or grantee-level characteristics that can define subgroups include:
The program subgroups defined by whether the enhancement was implemented with high fidelity or not are perhaps the most important for understanding the impacts of the quality enhancement. High-fidelity implementation of the quality enhancement is critical to providing a fair test of its effectiveness. High-fidelity implementation should also be feasible by Stage 3, as the details of how to implement the quality enhancement in a variety of Head Start settings should have been developed by this point. Nevertheless, rather than assuming that implementation was successful, the Stage 3 evaluation plan should include measurement of fidelity.
If some centers or grantees participating in the evaluation did not reach critical thresholds for high-fidelity implementation, the impact analysis should include an examination of subgroups defined by degree of fidelity to the enhancement model. Constructing subgroups in both the enhancement and control groups for the analysis of impacts by implementation status poses a challenge because the control group should not have implemented the enhancement. Under center random assignment, the best approach is to define the implementation status for the two enhancement centers within a particular grantee. If both centers reached high-fidelity implementation, both the enhancement and control group centers from that grantee would be placed in the high-fidelity implementation group for analysis. An analogous procedure would follow if both centers did not reach high-fidelity implementation; the enhancement and control centers within a particular grantee would be classified according to the implementation status of both enhancement centers. If the implementation status of the two enhancement centers is divergent, however, some judgment would need to be made regarding the appropriate placement of the set of centers from that grantee. If both are fairly close to the threshold for full implementation but only one exceeded the threshold, they could both be placed in the high-fidelity category. If the statuses of the two centers are quite different, however, it might be best to omit the enhancement and control centers from that grantee from the analysis. Under grantee random assignment, each grantee is selected from within sampling strata defined by region, grantee size, and other characteristics. These sampling strata could define “groups” of enhancement and control grantees that could then be classified according to the implementation status of all or most of the enhancement-group grantees.
Examples of categories of child and family characteristics (measured prior to the experience of the quality enhancement) for subgroup analysis include:
Subgroup estimates can be obtained using the same procedures as the ones to be used for calculating overall impacts,8 but these calculations are made for particular subgroups. Regression-adjusted subgroup estimates can be obtained by introducing an interaction term that is the product of the treatment indicator and an indicator of membership in the subgroup of interest. This term is entered into the model appropriate for the level of random assignment, and for whether the subgroup is defined at the child level or at the classroom, center, or grantee level.
However, unless the overall sample is very large, it will be possible to detect impacts only for large subgroups of the population, as subgroup estimates, which are based on only part of the full sample, are less precise than are full-sample estimates. For example, in the center random assignment design with minimal baseline data, our sample of 1,416 children (in 118 centers) is sufficient for detecting impacts with effect sizes of .20 (one-fifth of a standard deviation on the outcome measures) or more. However, for a subgroup that includes 50 percent of the children across all the centers (for example, children whose mothers have less than a high school diploma), impacts with effect sizes of .22 or more can be detected. For a subgroup that includes all of the children in half the centers (for example, full-day programs), impacts with effect sizes of .24 or more can be detected.
Program administrators care not only about the effectiveness of the quality enhancements, but also the costs of these enhancements. An enhancement may not be implemented if its impacts are not viewed as commensurate with the amount of program resources it requires. Thus, program administrators must be provided with estimates of the quality enhancement’s costs as well as estimates of the program impacts.
Relevant Costs of the Enhancement. To support analyses of the cost of an enhancement relative to its impacts or benefits, researchers ideally would obtain information during the implementation period on total program costs relative to the cost of the program without that enhancement. Researchers should measure two types of costs: (1) the upfront, one-time costs of beginning implementation; and (2) the ongoing additional costs resulting from the enhancement. Among the first set of costs are the costs associated with initial training, including the T/TA staff’s time, the cost of substitute teachers hired to cover classrooms while regular teachers attend training, and the cost of additional staff days for teachers who are paid for the days that they attend training. The first set of costs also includes costs associated with the training staff’s technical assistance visits and any costs associated with having teachers attend extra training sessions outside their normal classroom duties (for example, periodic group discussions, if any). Time spent by teachers and trainers would be valued at their hourly wage, including fringe benefits. If this training actually supplanted the usual teacher training sessions conducted during the year, those routine training costs should be subtracted. If space for training is rented, the cost of the rental will be included as well. Materials, including documents, videos, and other training materials, should be valued at their cost.
Ongoing additional costs resulting from enhancements would include the costs of materials, any computer-related costs, the cost of ongoing technical assistance, and costs to train new teachers. Teachers will have to maintain a ready supply of materials for any curriculum-related enhancement. Because broken or lost materials must be replaced, the cost of maintaining the supply of curriculum materials during the program year is part of its ongoing cost. Similarly, if a computer-based software program is part of the curriculum, the cost of setting up and maintaining the personal computers in the classroom with appropriate software is an additional cost. The cost of refresher training and ongoing technical assistance to teachers during a normal program year to ensure that the curriculum continues to be implemented to high fidelity is an ongoing cost as well. For example, T/TA staff might make two or three visits to each classroom during the program year to observe; measure fidelity; and meet with the teachers and with the education coordinator to discuss classroom practices, and to respond to questions. Finally, overall teacher turnover in Head Start is approximately 15 percent per year (National Institute for Early Education Research 2003). Accordingly, an ongoing cost of the mathematics curriculum is the cost of T/TA provided to 15 percent of the teachers in the enhancement group each year. Assuming that the evaluation includes 20 to 50 classrooms in the enhancement group, three to seven teachers would have to be fully trained each year.
Any of three different approaches to estimating these costs of implementing an enhancement could be used:
The first approach is the least burdensome for the enhancement centers or grantees (and eliminates burden for the control centers or grantees) but could fail to account for costs associated with the enhancement. The second approach is more burdensome for the enhancement group than is the first one, and it does not require a response from the control group; however, if anything other than the implementation of the enhancement changes from one year to the next, the estimate of the cost of the enhancement will be inaccurate. The third approach is the most burdensome, involving extensive collection of cost information from both enhancement and control groups, but it would provide the most accurate estimate of the costs of implementing the enhancement.
Obtaining Cost Information in a Field Test Evaluation. Information about costs can be obtained from semistructured interviews with program and center directors, from T/TA supervisory staff, and from program records. For a field test that relies on NRS data without making visits to programs to collect data, cost information could be obtained from telephone interviews with key T/TA supervisory staff about the number of hours spent at each program or center site, and from Head Start program cost records submitted annually to the federal government. However, Head Start programs often receive funding from multiple sources, so a particular program’s federal Head Start budget could be an imperfect measure of program costs. Interviews with program and center directors about costs could be accommodated in a study that includes visits to programs to collect fidelity and child outcome data, but this approach would require extending the visits, thereby adding to evaluation costs.
All cost information must be obtained in dollars. However, to ensure that the dollar values collected at various time points represent the same value, the dollar values collected from informants and records should be either inflated or deflated, using the Consumer Price Index, to represent dollar values in a single target year (for example, the analysis and reporting year).
Finally, cost information sometimes is obtained using two perspectives. First, the actual dollar costs of the curriculum must be obtained. Second, a measure of cost to society would place a value on any volunteer labor or donated space and materials and would add those costs to the actual expended costs. The latter costs would be difficult to add to a study that relies on data obtained without visiting programs. Both cost perspectives could be used in the analysis of the cost-effectiveness of an enhancement, if researchers visit programs for the purpose of conducting data collection.
Approach to the Cost-Effectiveness Analysis. A benefit-cost analysis allows a direct comparison of impacts and costs by converting the impacts into “benefits” that are valued in dollars. A dollar value is placed on the impacts by estimating the market price for the change in resources that results from the impact. For example, a reduction in the use of special education services in the elementary school that results from the quality enhancement can be valued by estimating the resulting reduction in special education costs.
Impacts on the cognitive, social-emotional, and physical development of children, however, are very difficult to value. Children’s development during the pre-school years may affect future employment and earnings, involvement with the criminal justice system, and use of public assistance programs—all of which can be quantified. However, as yet, insufficient evidence exists on the relationships between preschool child outcomes and these later outcomes to be able to put a dollar value on changes in the early outcomes. Accordingly, we do not recommend conducting a cost-benefit analysis based on outcome data collected from a single year in Head Start.
Instead, a cost-effectiveness framework can be used to evaluate the costs and benefits of the enhancement. This type of analysis does not attempt to place a dollar value on impacts. Instead, impacts are measured in a common unit, such as an effect size (the impact divided by the standard error of the outcome measure). The impact in effect-size units is compared with costs measured in dollars. For each quality enhancement, an effect size per dollar spent on the enhancement can be calculated. For example, if an early literacy curriculum were to produce an impact on vocabulary of 0.3 in effect-size units, and if the cost were estimated to be $10 per child, then the cost-effectiveness of the curriculum would be 0.03 per dollar. Measuring cost-effectiveness in this way enables program administrators to compare the cost of producing impacts that they consider important, using the same metrics, so that enhancements can be assessed in terms of their ability to provide the most “bang for the buck.”
| Table of Contents | Previous | Next |

