Skip Navigation
acfbanner  
ACF
Department of Health and Human Services 		  
		  Administration for Children and Families
          
ACF Home   |   Services   |   Working with ACF   |   Policy/Planning   |   About ACF   |   ACF News   |   HHS Home

  Questions?  |  Privacy  |  Site Index  |  Contact Us  |  Download Reader™Download Reader  |  Print Print      

Office of Planning, Research & Evaluation (OPRE) skip to primary page content
Advanced
Search

Table of Contents | Previous

Appendix A: Methodology

A.1 Sample Design Overview

The Head Start children for FACES 2000 were selected as a two-stage sample. The first stage sampling units were Head Start programs; the second stage units were classes within sampled programs. In each sampled classroom, all children in their first year of Head Start were included in the sample.

A.1.1 Programs

The sampling frame of eligible Head Start programs was constructed from the 1998-1999 Program Information Report (PIR). Migrant and Seasonal Head Start programs, American Indian/Alaska Native Head Start programs, Early Head Start programs, programs in the territories, and programs that do not serve children directly were excluded, resulting in a frame of 1,675 programs. The programs were stratified by Census region (North East, Midwest, South, and West), percent minority (above/below 50 percent), and metro or urban/rural status (Metropolitan Statistical Area (MSA)/non-MSA). These are the same stratification variables used in sampling programs for the FACES 1997 cohorts.

A sample of 45 programs was selected for FACES 2000. The sample size in each stratum was proportional to the stratum first year Head Start enrollment. The programs were selected with probability proportional to the program’s first year enrollment using systematic sampling. The first year enrollment was calculated from the PIR by subtracting the reported second and third year enrollment from the total enrollment. A Keyfitz procedure was used to minimize the overlap with the 40 programs sampled for the FACES 1997 cohorts. As a result, there was no overlap with the previous program sample. Of the 45 programs selected, two were later discovered to be ineligible because they had been defunded, resulting in a total of 43 sampled programs for FACES 2000.

A.1.2 Classrooms

In the 43 remaining Head Start programs, lists of the anticipated classes for Fall 2000 were obtained in late Summer 2000. The programs also provided the expected number of first-year Head Start children in each class.

These lists formed the basis for the classroom sampling frame, after excluding classes with no first-year children. Classes with fewer than five first-year children expected were combined with another class in the same center to form a “class group.” The class groups were treated as a single unit for sampling purposes and sample size calculations. The total target sample size of first-year children was 2,825 or 66 per program. In general, the desired sample size of classes in each program was determined as 66/(average class size for the program), where the average class size was in terms of the number of first-year children. The actual initial sample size was increased by 2 classes to allow for a reserve sample in each program. In programs where the total first year enrollment, as obtained from the class rosters, was more than twice the measure of size used to sample the program, the initial class sample size was increased to prevent large variation in class weights. In small programs where the initial sample size exceeded the number of classes available, all classes were taken with certainty.

Classes were sorted by center within program and were sampled with equal probabilities. A subsample of the initial sample was selected with equal probabilities to obtain a main sample of the desired sample size and a reserve sample of two classes in each program. A total of 367 classes was selected: 279 classes for the main sample and 88 for the reserve sample. The number of main sample classes in each program varied from 3 to 15, with an average of 6 classes. (In terms of collapsed classrooms, a total of 252 classroom groups was sampled for the main sample and 82 for the reserve sample, for an average of 6 per program and a range of 3 to 10.)

In Fall 2000, the eligibility status of the main sample classes was determined. One or two reserve classes were added in some programs to prevent a shortfall in the target number of first-year children for the study. The final sample for weighting purposes included all sampled classes where an attempt was made to collect data from the classroom, including those discovered to be ineligible. The rationale for this is because ineligible classes in the sample represent ineligible classes on the Head Start program frame. A total of 307 main and reserve classes were in the sample in Fall 2000. When the program was contacted by field staff in Fall 2000, 20 of the 307 classes were discovered to be ineligible because they no longer existed, they did not receive Head Start funding, or they had no first-year Head Start children. In 286 of the remaining 287 eligible classes (one teacher refused to allow the children in her class to be sampled), all first-year Head Start children were taken into the sample.

A.2 Response Rates

Fall 2000

  • 2,508 Child assessments were completed out of 2,790 for a completion rate of 90 percent;

  • 2,488 Parent interviews were completed out of 2,790 families selected for the sample (89 percent);

  • Teacher Report forms were obtained on 2,532 of the 2,790 sample children (91 percent);

  • Assessment, parent, and teacher data were obtained on 2,396 of the 2,790 sample children (86 percent); and

  • A total of 278 classrooms were observed out of 286 in the sample for a completion rate of 97 percent.

Spring 2001

  • 2,232 Child assessments were completed out of 2,288, representing 98 percent of the children who remained in the program, and 80 percent of the original sample of 2,790 children;

  • 2,166 Parent interviews were completed out of 2,288, representing 95 percent of the children who remained in the program, and 78 percent of the original sample;

  • Teacher report forms were obtained on 2,236 of the sample children, representing 98 percent of the children who remained in the program and 80 percent of the original sample;

  • Assessment, parent, and teacher data were obtained on 2,115 of the 2,288 sample children who remained in the program (92 percent); and

  • A total of 275 classrooms were observed out of 284 in the sample for a completion rate of 97 percent.

A.3 Sampling Weights

There are two main reasons why weights are desirable and are calculated for FACES data. The most important reason for the use of weights is that all units generally do not have the same probability of selection. In FACES, probabilities of selection vary among sample programs, among sample centers, among sample classes, and among sample children. The first function of weights is to reflect the probability of selection. For example, large programs were selected with higher probabilities than small programs, and large centers within a given program were selected with higher probabilities than small centers. Thus, probabilities of selection among centers vary considerably. If there tend to be differences in center characteristics according to the size of the center or the size of the program the center is in, then weighted estimates may be different from unweighted estimates.

A second important reason for using weights is to reduce bias and sampling error. Inevitably, there is some level of nonresponse in all surveys. Use of nonresponse adjustment factors in the weighting process can reduce the bias caused by nonresponse. For example, there are differences in response rates in FACES by gender of child. To the extent that there are also differences in test scores by gender, then forming nonresponse cells based on gender reduces the bias caused by differential gender response rates. Also, known population totals can frequently be used as poststratification controls in the weighting, which reduces both bias and sampling error. Poststratification, however, has not been deemed feasible for FACES.

A.3.1 Program Weights

The program weight was calculated as the inverse of the program’s probability of selection. As mentioned earlier, a Keyfitz procedure was used to minimize the overlap with the program sample drawn for FACES 1997. This procedure involved calculating conditional probabilities of selection which are based on whether the program was sampled previously or not, and whether its probability of selection increased compared with the previous sample. Prior to sampling for the FACES 2000 cohort 3, the unconditional probability of selection for each program on the cohort 3 frame was calculated as 

Equation to find the value of variable “NEWPSEL sub i”
[D]
 

where Nh is the number of programs on the frame in stratum h, NEWSMPSZh is the sample size for stratum h for the cohort 3 design, and FIRSTYRi is the first year enrollment for program i from the PIR. The probability of selection for each program under the Abt sample design for Cohorts 1 and 2 was also calculated as

Equation to find the value of ORIGPSEL
[D]
 

where Nh was the number of programs on the Abt frame in stratum h, SAMPSIZEh was the sample size for stratum h under the Abt design, and ENRTOTi was the total enrollment for the i-th program from an earlier PIR.

The conditional probability of selection was calculated for each program on the cohort 3 frame according to the Keyfitz procedure as:

Equation to find the value of ORIGPSEL
[D]
 

These conditional probabilities of selection were the measures of size used to select the cohort 3 program sample. It can be shown that the Keyfitz procedure preserves the unconditional cohort 3 program probabilities of selection, while at the same time minimizing the overlap. Thus the cohort 3 program weight is the inverse of NEWPSEL, the unconditional probability of selection under the cohort 3 design.

All 43 eligible programs cooperated with the study, so that nonresponse adjustments at the program level were unnecessary.

For each program, a set of 43 jackknife replicate weights was created for calculating standard errors. The replicate weights were created using a standard stratified jackknife procedure. One program at a time was dropped (i.e., given a zero replicate weight) and the weights of the remaining programs in the same stratum were adjusted by a factor of nh/(nh - 1) where nh is the number of sampled programs in stratum h. The program weights in the other strata were left unchanged. By repeating it 43 times, 43 replicate weights were obtained for each program. For estimates involving child or classroom data from all 43 programs, the degrees of freedom for the variance of the estimate is #PSUs — #varstrat = 43 – 12 = 31. (One of the 13 original sampling strata was collapsed with an adjacent stratum for variance estimation purposes because it contained only one eligible sampled program.)

A.3.2 Classroom Weighting

Two sets of class weights were produced for classroom level estimation: one set for Fall 2000 cross-sectional estimates and a second set for Fall 2000–Spring 2001 longitudinal classroom analysis. Class base weights were first created that reflected the overall probability of selection for the class, including the program probability of selection. These base weights were adjusted for classroom level nonresponse, using the following criteria for a complete classroom:

    Fall 2000 Cross-Sectional Estimates. Classroom must have complete Fall 2000 observation data. Classroom observation data includes counts of children and adults, Assessment Profile (Scheduling, Learning Environment, and Individualizing), ECERS-R, Arnett Caregiver Interaction Scale, Teacher-Directed Activities Checklist and Wrap-Up measures; and

    Fall 2000–Spring 2001 Longitudinal Analysis. Classroom must have complete observation data for either Fall 2000 or Spring 2001 and child assessment data for both Fall 2000 and Spring 2001.

A.3.2.1 Class Base Weights

A class base weight was created for each of the 367 initially sampled classes in Fall 2000. Fifty-four reserve classes that were never used were given base weights of zero. Six main sample classes were sampled out on an ad hoc basis by field staff to reduce burden and to have independence between classes. They were assigned base weights of zero, since they were not part of the final sample. In this situation, a teacher had both a morning and an afternoon class in the sample. One class out of the morning/afternoon pair was subsampled.

The remaining 307 classes considered to constitute the sample were each assigned a class base weight equal to the inverse of their overall probability of selection. The overall probability of selection is the product of the program probability of selection and the probability of selecting the class within the program. The inverse of the overall probability of selection can also be written as the product of the program weight and the within-program class weight:

Class Base Weight = Program Weight * (Total # Classes in Program / # sampled classes fielded).

Collapsed classrooms were counted as one classroom in the base weight calculations, since they were treated as a single unit in sampling. The ad hoc subsampling was reflected by multiplying the base weight of the retained class in the am/pm pair by a factor of 2 and the dropped class by zero. One class that had merged with another was given a zero base weight, and the newly merged class had its base weight multiplied by a factor of .5 to reflect its increased probability of selection.

Forty-three jackknife class replicate base weights were created from the program replicate weights:

Class Replicate Base Weight j = Program Replicate Weight j * (Total # Classes in Program / # sampled classes fielded); j = 1, 2, …43.

A.3.2.2 Cross-Sectional Fall 2000 Class Weights

Of the 307 sampled classes that were fielded in Fall 2000, 279 were eligible and had complete classroom data, 8 were eligible but didn’t complete data collection, and 20 were discovered to be ineligible. A class nonresponse adjustment factor was applied to the class base weights of the 279. The nonresponse adjustment factor was computed separately by program. Both the 8 incomplete and the 20 ineligible classes were given a zero final class weight. The classroom replicate base weights were also adjusted for nonresponse by program, so that the sampling variability in the nonresponse adjustments were reflected in the standard error estimates.

The sum of the nonresponse-adjusted Fall 2000 classroom weights is 34,638. The unweighted and weighted completion rates are both 97 percent, excluding ineligibles from both numerator and denominator. The unweighted and weighted eligibility rates are both 94 percent. The class base weight was used in calculating the weighted rates.

A.3.2.3 Longitudinal Fall 2000–Spring 2001 Class Weights

Of the 286 eligible classes in Fall 2000, 280 completed data collection in Spring 2001. Note that the 279 Fall 2000 classroom completes are not a subset of the 280 Spring 2001 completes. Five classes that completed Fall 2000 data collection did not complete the Spring 2001, and six classes that completed Spring 2001 data collection did not complete the Fall 2000. There were 79 new classes added in Spring 2001 because children who switched classes after the Fall 2000 data collection were followed to the new class. However, no classroom observations were done at these new classes, so they were not considered to be part of the classroom sample and were assigned a zero base weight.

A class nonresponse adjustment factor was applied to the class base weights of the 280 eligible completes. The nonresponse adjustment factor was computed separately by program. The classroom replicate base weights were also adjusted for nonresponse by program, so that the sampling variability in the nonresponse adjustments were reflected in the standard error estimates. The incomplete and ineligible classes, along with the 79 new classes, were given a final class weight of zero.

The sum of the nonresponse-adjusted Fall 2000 – Spring 2001 classroom weights is 34,768. The unweighted and weighted completion rates are both 98 percent, excluding ineligibles from both numerator and denominator. Both unweighted and weighted eligibility rates are 94 percent. The class base weight was used in calculating the weighted rates.

A.3.3 Child Weights

Three sets of child weights were produced: a cross-sectional set for Fall 2000 estimates, a Fall 2000 – Spring 2001 set for base year longitudinal analyses, and a Fall 2000 – Spring 2003 set for longitudinal analysis including the kindergarten school year. Child base weights were first created that reflected the overall probability of selection for the child, including the program and classroom stages of sampling. These base weights were adjusted for child nonresponse, using the following criteria for a complete child case:

    Fall 2000 Cross-Sectional Analysis. A child is considered a complete case if the child has a parent interview from either a Fall 2000 or Spring 2001 and a Fall 2000 child assessment or teacher rating;

    Fall 2000 – Spring 2001 Longitudinal Analysis. A child is considered a complete case if the child has either a Fall 2000 or Spring 2001 parent interview and one of the following data pairs: a child assessment for both Fall 2000 and Spring 2001, or a teacher rating for both Fall 2000 and Spring 2001; and

    Fall 2000 – Spring 2003 Longitudinal Analysis. A child is considered a complete case if there was at least one parent interview, an assessment while the child was in Head Start, either in Fall 2000 or Spring 2001, and an assessment while the child was in kindergarten, either Spring 2002 or Spring 2003.

A.3.3.1 Child Base Weights

In 286 eligible Fall 2000 classes, all eligible children in their first year of Head Start were included in the sample with certainty. A base weight was created for each child as the product of their program weight and nonresponse-adjusted classroom weight. Note that these nonresponse adjusted class weights are not the same as those described earlier, which were designed for use in classroom level analyses. The creation of special classroom weights for the child weights was necessary because there were eligible classrooms that did not have complete classroom observations, but did allow their children to be sampled, and vice versa. To create this special classroom weight, the classroom base weight was adjusted for classes which had eligible children but where “sampling” of children did not take place. This nonresponse-adjusted classroom weight was then used in calculating the child base weight. Since there was no subsampling of children within classrooms, the within-classroom child weight is equal to one and the overall child weight can be written as:

Child Base Weight = Program Weight * Nonresponse-adjusted Classroom Weight.

A set of 43 jackknife (JKn) replicate base weights was also created for each child using the program replicate weights and the special full-sample nonresponse-adjusted classroom weight:

Child Replicate Base Weight j = Program Replicate Weight j * Nonresponse-adjusted Classroom Weight; j = 1, 2, …43.

A.3.3.2 Child Fall 2000 Cross-Sectional Weights

Of the 3,100 children in the Fall 2000 sample, 2,535 were considered complete for the Fall 2000 data collection, 251 were eligible but incomplete (30 of these had assessments but no parent interview), and 314 were ineligible. Children could be ineligible if either they came from classrooms that were ineligible or they were discovered to be in their second year of Head Start or were otherwise ineligible when Fall 2000 data collection began.

The child base weights of the eligible, complete children in each classroom were adjusted for nonresponse separately by classroom. The ineligible and incomplete children were given a zero final child weight and were dropped from the sample for the Spring 2001 data collection. The replicate child base weights were also adjusted for nonresponse by classroom, so that the sampling variability in the nonresponse adjustments were reflected in the standard error estimates.

The sum of the nonresponse-adjusted Fall 2000 child weights is 337,247. The unweighted and weighted completion rates are both 91 percent, excluding ineligibles from both the numerator and denominator. The unweighted and weighted eligibility rates are 90 percent and 91 percent, respectively. The child base weight was used in calculating the weighted rates.

A.3.3.3 Child Fall 2000 – Spring 2001 Longitudinal Weights

In Spring 2001, the eligible first-year children were again given assessments, a teacher rating, and an attempt was made to interview the child’s parent(s). Of the 2,535 eligible children who had completed Fall 2000 data collection, 2,359 were eligible, complete cases for the Fall 2000 – Spring 2001 data collection; 171 were eligible, incompletes; and five became ineligible because they moved out of the area.

Children who had switched to new classes in the Spring 2001 were followed up, but classroom observations were not done at the new classes. There were 91 children from the Fall 2000 sample who were followed to 79 new classrooms in Spring 2001. In calculating their base weights, these children were given the classroom probability of selection associated with the classroom from which they were originally sampled in Fall 2000.

The child base weights of the eligible, complete children in each classroom were adjusted for nonresponse separately by classroom. The ineligible and incomplete children were given a zero final child weight. The replicate child base weights were also adjusted for nonresponse by classroom, so that the sampling variability in the nonresponse adjustments will be reflected in the standard error estimates.

The sum of the nonresponse-adjusted Fall 2000 – Spring 2001 child weights is 338,047. The unweighted and weighted conditional Spring 2001 completion rates are both 93 percent. The conditional rate is the percent of Fall 2000 eligible completes who also completed the Spring 2001 data collection. The overall (unconditional) completion rate is the product of the completion rates for the Fall 2000 and Spring 2001 data collections: 91 percent * 93 percent = 85 percent. This rate is the percent of eligible, sampled children in Fall 2000 that completed the Spring 2001 data collection.

A.3.3.4 Child Fall 2000 – Spring 2003 Longitudinal Weights

For the year in which the children were in kindergarten, either Spring 2002 or Spring 2003, the children were again given assessments, a teacher rating, and an attempt was made to interview the child’s parent(s). No classroom observations were done for these children. Of the 2,535 eligible children who had completed Fall 2000 data collection, 1,895 were respondents again for kindergarten data collection, either Spring 2002 or Spring 2003, and 640 were nonrespondents.

The child base weights of the eligible, responding children in each classroom were adjusted for nonresponse separately by classroom. Classrooms were collapsed within centers and within programs when necessary to prevent excessively large nonresponse adjustment factors. The ineligible and nonresponding children were given a zero final child weight. The replicate child base weights were also adjusted for nonresponse by classroom.

The sum of the nonresponse-adjusted Spring 2002-2003 child weights is 337,247. The unweighted completion rate is 68 percent, i.e., 68 percent of eligible sampled children were Spring 2002-2003 (kindergarten) respondents. The weighted completion rate is 69 percent.

A.4 Variance Estimation

Estimates obtained from the FACES sampled children will differ from the true population parameters because they are based on a randomly chosen subset of the population, rather than on a complete census of all Head Start children. This type of error is known as sampling error or variance. The precision of an estimate is measured by the standard error (defined as the square root of the variance). The calculation of the standard error must reflect not only the sample size on which the estimate is based, but the manner in which the sample was drawn. Otherwise, the standard errors can be misleading and result in incorrect confidence intervals and p-values in hypothesis testing. The FACES sampling involved stratification, clustering, and unequal probabilities of selection, all of which must be reflected in the standard error calculations.

Jackknife replication is a commonly used variance estimation method for complex surveys involving multi-stage sampling (Wolter, 1985). Replication methods work by dividing the sample into subsample replicates that mirror the design of the sample. A weight is calculated for each replicate using the same procedures as for the full-sample weight. This produces a set of replicate weights for each sampled child. To calculate the standard error of a survey estimate, the estimate is first calculated for each replicate using the replicate weight and the same form of estimator as for the full sample. The variation among the replicates is then used to estimate the variance for the full sample estimate. Replication has the advantage that it can reflect the different features of the weighting and estimation by simply repeating all steps separately for each replicate.

For each child, a set of 40 jackknife replicate weights was created for calculating standard errors. The replicate weights were created using a standard stratified jackknife (“JKn”) procedure. One Head Start program at a time was dropped (i.e., given a zero replicate weight) and the weights of children in the remaining programs in the same stratum were adjusted by an inflation factor to account for the reduction of the sample. The weights in the other strata were left unchanged. By repeating this for each of the 40 sampled programs, 40 replicate weights were obtained for each child. The replicate weights are used in the formula below to calculate the variance:

Equation to find the value of ORIGPSEL
[D]
 
where theta-hat sub g  is the estimate of θ based on the observations included in the g-th replicate (i.e., using the g-th replicate weight),
theta  is the estimate of θ based on the full sample, G is the total number of replicates formed, nh is the number of programs in stratum h,
and (nh-1)/nh is the “JKn” factor. When creating a var file in WesVar, the JKn factors must be imported using the “Attach Factors” option in the Data menu. For estimates involving data from all 40 programs, the degrees of freedom for the variance of the estimate is # programs - # strata = 40 – 10 = 30. (Six of the original 14 sampling strata were collapsed for variance estimation purposes because they contained only one eligible sampled program, resulting in 10 pseudo strata.)

A.5 Data Collection Instruments

A.5.1 Direct Child Assessment

The child assessment was an essential component of FACES. It provided direct measures of how well Head Start programs were achieving the goals of assisting children to be physically, socially, and educationally ready for success in kindergarten. The assessment battery was composed of a short series of tasks that were feasible and interesting for preschoolers and kindergartners to carry out, and that have been shown to be predictive of later school achievement or learning difficulties. The areas of the FACES assessment battery included vocabulary development, emerging literacy (recognizing letters of the alphabet, hearing similarities in the sounds at the beginnings or ends of different words, showing familiarity with printed words and story books), emerging numeracy (counting, adding, or taking away blocks to show a given number), perceptual-motor development (drawing copies of simple geometric figures), and social and communicative competence (telling basic facts about self and family to another person). These tasks were drawn or adapted from well-established and widely used instruments.

A Spanish version of the assessment battery was also developed for assessing children whose primary language was Spanish. Spanish versions of the measures, when available, (e.g., Test de Vocabulario en Imagenes Peabody, Woodcock-Muñox Pruebas de Aprovechamiento-Revisada) were employed in the Spanish battery. Otherwise, the English versions of the measures (e.g., one-to-one counting, social awareness, color names, etc.) were directly translated.

A screener was used to determine whether English-language learners were to be administered the direct child assessment battery in English or not. The screener involved information provided by teachers and assessors which was used to determine the language of administration. In Fall 2000, English-language learners who were determined to be primarily Spanish-speaking received the entire direct child assessment battery in Spanish, e.g., TVIP, Woodcock Munoz Letter-Word Identification, Applied Problems, Dictation, etc. They also were administered the PPVT and Woodcock Johnson Letter-Word Identification in English, as well. In Spring 2001 and Spring 2002 (for children who were in Head Start for 2 years), these same children received the entire direct child assessment battery in English. They were also administered the TVIP and Woodcock Munoz Letter Word Identification in Spanish for the purpose of comparison. The children who had been administered assessments in Spanish and English in Fall 2000, with some Spanish sections, in Spring 2001 and Spring 2002 (Head Start only) were administered the entire assessment in English during the Spring of their kindergarten year (either Spring 2002 or Spring 2003).

In Fall 2000, English-language learners who were determined to primarily speak a language other than Spanish did not receive any portion of the direct child assessment battery in their native languages and were assessed in English, if possible. In Spring 2001, Spring 2002, and Spring 2003 (if applicable), these same children received the entire direct child assessment battery in English, if possible.

Norm Referenced Cognitive Tests

In these assessment tasks, norms are available for a nationally representative sample of U.S. children of the same age (including children from all family income groups). Average scores, factoring in the age of the children, a.k.a. standard or scale scores, are reported throughout this report. However, most of the results detailed in Appendix A. Tables A-2 through A-18 are for raw scores only. These standard scores are adjusted for the child’s age and constructed to have an overall mean and standard deviation of specific values, e.g., a mean of 100 and a standard deviation of 15. This is useful for comparing children from Head Start programs to other children on a nationwide basis on specific school readiness measures to determine the growth and progress of these Head Start children in these domains relative to all others in the nation.

A.5.1.1 Peabody Picture Vocabulary Test – Third Edition – Revised

The Peabody Picture Vocabulary Test (PPVT-III) (Dunn and Dunn, 1997) is designed to assess children's knowledge of the meaning of words by asking them to say or indicate by pointing which of four pictures best shows the meaning of a word that is said aloud by the assessor. A series of words is presented, ranging from easy to difficult for children of a given age, each accompanied by a picture plate consisting of four line drawings. The test takes about 10 minutes to administer. It is suitable for a wide range of ages from 2½ through adulthood and has established age norms based on a national sample of 2,725 children and adults tested at 240 sites across the U.S.

The PPVT-III has been extensively revised from earlier versions of the test. These improvements were undertaken to promote easier testing and more accurate scoring. Also, new drawings have been added and dated illustrations dropped so as to achieve better gender and ethnic balance. Individual test items that showed statistical bias by race or ethnicity, gender, or region were deleted from the item pool for the scale prior to standardization.

PPVT-III scores have high reliability, with the test publisher reporting internal-consistency reliability (alpha) coefficients ranging from .92 to .98, with a median of .95, and test-retest reliability ranging from .91 to .94. The alpha coefficients for the PPVT-III results from FACES were .97 for Fall 2000, Spring 2001, Spring 2002/2003 (Kindergarten), and .96 for Spring 2002 (Head Start).

A Spanish-language test, the Test de Vocabulario en Imagenes Peabody (TVIP), is also available, but has not been updated to be directly comparable to the PPVT-III. For FACES, the TVIP was used with children whose primary language was Spanish. The TVIP was reported to be highly reliable utilizing FACES data with internal-consistency alpha coefficients of .92 for both Fall 2000 and Spring 2001, and .94 for Spring 2002 (Head Start).

A.5.1.2 Woodcock-Johnson Psycho-Educational Battery – Revised

The updated edition of the Woodcock-Johnson Battery (WJ-R) is a carefully constructed and widely used test battery. The set of individually administered tests is designed to assess the intellectual and academic development of individuals from preschool through adulthood (Woodcock and Johnson, 1989; Salvia and Ysseldyke, 1991). FACES used three subtests from the Achievement Battery that together constitute an "Early Development -- Skills" cluster, according to the test developers. The cluster is comprised of the Letter-Word Identification, Applied Problems, and Dictation tests. The same three subtests of the Spanish version (Woodcock-Muñox Pruebas de Aprovechamiento-Revisada) were used in the Spanish version of the FACES assessment battery.

Letter-Word Identification. The first five Letter-Word Identification items involve symbolic learning, or the ability to match a rebus (pictographic representation of a word) with an actual picture of the object. The remaining items measure children's reading identification skills in identifying isolated letters and words that appear in large type on the pages of the test book. As well as being part of the Early Development cluster, this subtest is also part of the Basic Reading Skills cluster. The internal reliability of the Letter-Word Identification subtest with preschool age children averages .92 (Woodcock and Johnson, 1989). The internal reliability of this subtest with FACES children averaged .84 for Fall 2000, and .86 for Spring 2001 and Spring 2002 (Head Start). The internal reliability of the Spanish version of this subtest (Woodcock Munoz) was .75 for Fall 2000, .78 for Spring 2001, and .83 for Spring 2002 (Head Start).

Applied Problems. This subtest measures children's skill in analyzing and solving practical problems in mathematics. In order to solve the problems, the child must recognize the procedure to be followed and then perform relatively simple counting, addition or subtraction operations. Because many of the problems include extraneous stimuli or information, the child must also decide which data to include in the count or calculation. As well as being part of the Early Development cluster, the subtest is also part of a Broad Mathematics cluster. The internal reliability of the Applied Problems subtest with preschool age children averages .91 (Woodcock and Johnson, 1989). The internal reliability of this subtest with FACES children averaged .90 for Fall 2000, .91 for Spring 2001, .89 for Spring 2002 (Head Start), and .88 for Spring 2002/2003 (Kindergarten). The internal reliability of the Spanish version of this subtest (Woodcock Munoz) was .85 for Fall 2000.

Dictation. The first six items in this subtest measure prewriting skills, such as drawing lines and copying letters. The remaining items measure the child's skill in providing written responses when asked to write specific upper- or lower-case letters of the alphabet. Later parts of the test ask the child to write specific words and phrases, punctuation, and capitalization. The internal reliability of the Dictation subtest with preschool age children averages .90 (Woodcock and Johnson, 1989). The internal reliability of this subtest with FACES children averaged .77 for Fall 2000, Spring 2001, and Spring 2002/2003 (Kindergarten), and .71 for Spring 2002 (Head Start). The internal reliability of the Spanish version of this subtest (Woodcock Munoz) was .77 for Fall 2000.

A.5.1.3 Leiter International Performance Scale – Revised (Leiter-R) – Attention Sustained

The Leiter-R by Roid and Miller (1997) assesses cognitive function in children and adolescents. The battery includes measures of nonverbal intelligence in fluid reasoning and visualization, as well as appraisals of visuospatial memory and attention. In Spring 2001, the Leiter-R Attention Sustained (AS) Subtest was added to the FACES direct child assessment battery to permit assessments of children’s visuospatial memory and attention. The subtest is primarily nonverbal and is administered in two subsections – the first being for those 2-3 years of age and the second being for those 4-5 years of age. Assessors provide minimal instructions throughout the administration of the Leiter-R AS. Children are presented with a series of pages containing pictures and are instructed to mark off all pictures that resemble a reference picture. The assessor times the child, with times ranging from 30 seconds to 120 seconds allotted for completion of the tasks. The internal reliability of the Leiter-R Attention Sustained subtest for 2-3 year olds and 4-5 year olds is .83 (Roid and Miller, 1997). The internal reliability of this subtest with FACES children, by age groupings, averaged .71 for 3 year olds and .81 for 4-5 year olds in Spring 2001 and .80 for 4-6 year olds in Spring 2002 (Head Start).

A.5.1.4 Test of Language Development (TOLD) – Primary – 3rd Edition – Phonemic Analysis (Kindergarten Only)

The Phonemic Analysis test from the Test of Language Development (TOLD; Newcomer and Hamill, 1997) is a supplemental subtest designed to assess children’s awareness of phonemes, that is, the significant speech sounds that comprise words. For this test, the child is presented with a compound word, and then asked to repeat part of the word’s component phonemes back to the assessor (e.g., “Say ‘popcorn.’ Now say it again without ‘pop’.”) The internal reliability of this supplemental subtest with FACES children averaged .96 for Spring 2002/2003 (Kindergarten).

A.5.1.5 ECLS-K Reading and General Knowledge Scales (Kindergarten Only)

In the Early Childhood Longitudinal Study-Kindergarten cohort (ECLS-K), t he Reading scale taps a variety of skills that indicate reading ability (including familiarity with print), recognition of letters and phonemes, vocabulary, and reading comprehension skills (e.g., children’s understanding of the text), as well as their personal reflection and critical evaluation of the text. The General Knowledge scale taps skills in the natural sciences (e.g., their conceptual understanding of why things occur as they do, and their ability to pose questions and investigate answers in the natural sciences) and social studies (e.g., their basic knowledge of History, Government, and Culture). Both scales follow the guidelines of the 1996 National Assessment of Educational Progress, have been reviewed by curriculum experts, as well as elementary school teachers, and have been found to be both reliable and valid measures of reading achievement and basic knowledge acquisition.1

The Reading assessment was administered in two stages. First, a routing test was administered to estimate the child's reading ability. Based on his/her performance on the routing test (either “high,” “medium,” or “low”), an appropriate “second stage” test was administered. The Reading assessment had three levels of second stage tests: low (red), medium (yellow), and high (blue). For the General Knowledge assessment, each child was administered only the routing test. Estimates of reliability with FACES data, as measured by Cronbach’s coefficient alpha, were as follows: 1.) for Reading – Spring 2002/2003 (Kindergarten) – Routing = .87, Red (Low Form) = .95, Yellow (Middle Form) and Blue (High Form) = .94, and 2.) for General Knowledge – Spring 2002/2003 (Kindergarten) = .77.

Criterion Referenced Cognitive Tasks

These tasks cover areas of basic knowledge, verbal, mathematical, and perceptual-motor skills that children typically learn in the preschool and kindergarten years and are often included in assessments of children’s school readiness and progress. These tasks do not have national norms. All results referenced in Table A-2a through A-3c and throughout the entire Report for these Tasks pertain to raw scores.

A.5.1.6 McCarthy Scales of Children’s Abilities

The McCarthy Scales of Children's Abilities is a widely used and well-documented test battery. FACES employed one subtest from the battery, the Draw-A-Design Task. The Draw-A-Design Task was used to assess children's perceptual-motor skills. This task asks the child to draw copies of a series of increasingly complex geometric figures. For FACES, this task was directly translated as part of the Spanish version of the assessment. The FACES reliabilities for the McCarthy Draw-A-Design measure were .58 for Fall 2000, .70 for Spring 2001, and .72 for Spring 2002 (Head Start). The FACES reliability for the Spanish version of this measure for Fall 2000 was .57.

A.5.1.7 Story and Print Concepts

The Story and Print Concepts task was an adaptation of earlier prereading assessment procedures developed by Marie Clay (1979), William Teale (1988, 1990), and Mason and Stewart (1989). In these procedures, a child is handed a children’s storybook (FACES Battery - Where’s My Teddy? (Alborough, 1992) or ¿Dónde Está Mi Osito? (Alborough, Castro, Trans. 1992)) upside down and backwards. The assessor asks a series of questions designed to test the child's knowledge of books. These include questions regarding the location of the front of the book, the point at which one should begin reading, and information relating to the title and author of the book. The assessor reads the story to the child and asks basic questions about both the mechanics (print conventions) of reading and the content (comprehension) of the story. The print conventions questions pertain to children's knowledge of the left-to-right and up-and-down conventions of reading, while the comprehension questions pertain to children's recall of key facts from the story. Additionally, for FACES, questions were added tapping rhyming awareness (e.g., "I'll say some words from the story and you tell me whether they rhyme, OK - bawl and small, etc.") and phonological awareness (e.g., "What word would be left if I took “teh” away from Ted?"). These additions were only included in the Fall 2000 direct child assessment battery. The FACES reliabilities for Fall 2000, Spring 2001, and Spring 2002 (Head Start) were as follows: Book Knowledge (.57, .59, and .61); Print Conventions (.73, .75, and .84); and Comprehension (.43, .42, and .40). The reliabilities for the Spanish version of these measures for Fall 2000 were .43 – Book Knowledge, .59 — Print Conventions, .39 – Comprehension.

A.5.1.8 Social Awareness

This measure was adapted from a subtest of the Comprehensive Assessment Program (CAP) Early Childhood Diagnostic Instrument used by Snow et al. (1995) among others to test children’s general knowledge and awareness of the social environment. The child is asked to give his/her “full name,” which includes both first and last name, his/her age (either verbally, which is given full credit or by holding up the correct number of fingers, which is given partial credit) and month/day of birth. The FACES reliabilities for the Social Awareness measure were .63 for Fall 2000, .62 for Spring 2001, .62 for Spring 2002 (Head Start), and .65 for Spring 2002/2003 (Kindergarten). The FACES reliability for the Spanish version of the Social Awareness measure for Fall 2000 was .36.

A.5.1.9 Color Names and One-to-One Counting

This was also a subtest of the CAP Early Childhood Diagnostic Instrument used by Snow et al. (1995) and developed by Marie Clay (1979), William Teale (1988, 1990) and Mason and Stewart (1989) as a battery of emergent literacy and school readiness measures. For the FACES battery, 10 teddy bears of different colors are presented randomly arranged on a page and the child is asked to point to each in turn and name the color. Following the Color Names task, the child is asked to count the bears and the assessor marks the final number the child arrives at when finished counting (correct answer is “10”). After this, the child is asked to report the total number of bears. The verbatim response is then recorded. Following these questions, the assessor must rate the child’s one-to-one counting performance using a 5-point scale. At the extremes, a score of 5 indicated that the child made no mistakes and score of 1 indicated that the child could not count or did not try to count. The FACES reliabilities for the Color Naming task were .95 for Fall 2000, .94 for Spring 2001, and .90 for Spring 2002 (Head Start). The FACES reliability for the Spanish version of the Color Naming task for Fall 2000 was .92.

A.5.1.10 Writing Name (Kindergarten Only)

The Writing Name task was designed to assess the child’s ability to write his or her first or last name correctly.

A.5.1.11 Interviewer Ratings

At the end of the one-on-one testing sessions with the children, the assessor completes a set of rating scales evaluating the child’s behavior in the test situation, including the child’s approaches to learning and problem behaviors. There are two sections to these ratings. The first consists of eight scales rating the child’s response during the assessment on eight different domains: task persistence, attention span, body movement, attention to directions, comprehension of directions, verbalization, ease of relationship, and the child’s level of confidence. Ratings use 4-point scales with descriptive anchors at each point. For example, the “task persistence” scale consists of the following anchor points: persists with task (4), attempts task briefly (3), attempts task after much encouragement (2), refuses (1). The FACES reliabilities for the Interviewer Ratings were .82 for Fall 2000, .80 for Spring 2001, .70 for Spring 2002 (Head Start), and .75 for Spring 2002/2003 (Kindergarten). The FACES reliability for the Spanish version of the Interviewer Ratings for Fall 2000 was .77.

The second section asks the assessor to indicate any special concerns regarding the child’s ability to complete the assessment: responding nonverbally, using nonstandard English such as dialect, speaking English as a second language, having limited English proficiency, experiencing difficulty hearing or seeing the assessor/test materials, or reporting the child’s speech was difficult to understand. These items use 3-point ratings to indicate the degree to which the child displayed any of these characteristics (i.e., “not at all,” “somewhat,” and “very much”).

Table A-1 summarizes the modifications to the assessment battery from the Fall 2000 through the Spring 2002-2003 kindergarten followup.

A.5.2 Classroom Observation Instruments

In FACES 2000, quality was considered to include not only the number of children and adults in each classroom, but process factors such as the availability of learning materials, the types of classroom activities, scheduling and the variety of learning opportunities provided to all children. Lead teachers2 in Head Start classrooms were also interviewed to collect teacher background information (experience and qualifications) as well as more detailed information about their curriculum, classroom activities, and attitudes and knowledge about early childhood education practices.

A.5.2.1 Counts of Children and Adults, Child-Adult Ratio

The Counts of Children and Adults provide information needed to calculate child-adult ratios and for other calculations to be used in assessing specific measures of classroom quality. Classroom observers counted the number of children, the number of adults and the number of paid staff at two separate time periods during the classroom day. The two occasions were separated by at least one hour and involved one structured (teacher-directed) and one unstructured activity. The child-adult ratio is calculated as the average number of children per adult (both paid and volunteer) across the two observations. A related measure, the child-staff ratio, was calculated using only the number of paid staff across the two observations. Higher child-adult or child-staff ratios are indicative of lower quality.

A.5.2.2 Assessment Profile for Early Childhood Programs: Research Edition I

The Assessment Profile for Early Childhood Programs: Research Edition I (Abbott-Shim and Sibley, 1987) is a structured observation guide designed to provide a quantitative assessment of classrooms and teaching practices that facilitate the learning and development of children. Three subscales were used in FACES: Scheduling, Learning Environment, and Individualizing.

Table A-1. Summary of measures administered from Fall 2000 through Spring 2003
Fall 2000
(Head Start)
Spring 2001 and Spring 2002
(Head Start)
Spring 2002 and Spring 2003(kindergarten)
Social Awareness Social Awareness Social Awareness
PPVT-III / TVIP PPVT-III / TVIP PPVT-III
McCarthy
Draw-A-Design
McCarthy
Draw-A-Design
TOLD – Primary -3rd
Phonemic Analysis
Color Names and Counting Leiter-R AS
(Attention Sustained) Subset
Reading
ECLS-K Routing and Second Stage Sections
Woodcock Johnson R (Munoz):
Letter-Word Identification
Color Names and Counting Woodcock Johnson R:
Applied Problems
Woodcock Johnson R (Munoz):
Applied Problems
Woodcock Johnson R (Munoz):
Letter-Word Identification
Woodcock Johnson R:
Dictation
Woodcock Johnson R (Munoz):
Dictation
Woodcock Johnson R:
Applied Problems
Writing Name
Story and Print Concepts Woodcock Johnson R:
Dictation
General Knowledge
ECLS-K Routing
Interviewer Rating:
Assessment Behavior
Story and Print Concepts Interviewer Rating:
Assessment Behavior
  Interviewer Rating:
Assessment Behavior
 

The Scheduling subscale assesses the written plans for classroom scheduling and how classroom activities are implemented. The appropriateness and completeness of the classroom activity plan are also noted. The subscale also assesses the balance and variety of learning contexts (e.g., individual, small group, and large group) and learning opportunities (i.e., child- vs. teacher-directed and active vs. quiet activities). The 14 observation items are scored in a yes/no format. High scores on this measure are indicative of a teacher that uses a “planful” approach to classroom activities. The reliability of the Scheduling subscale was reported as .89 for Fall 2000, .87 for Spring 2001, and .82 for Spring 2002 (Head Start).

The Learning Environment subscale focuses on the accessibility of a variety of learning materials to children in the classroom. Variety is assessed across various conceptual areas, such as science, math, language, fine motor, etc. and also within each conceptual area. The subscale also assesses how classroom space is arranged to determine whether the classroom encourages independence (e.g., whether the learning materials are located on low shelves and clearly labeled) and reflects the child as an individual. When materials are both available and accessible, and in sufficient numbers (typically a minimum of three in each group), the item is given a positive score. High scores on this 7-item measure indicate a “learning rich” environment filled with toys and learning materials that address a variety of developmental domains. The reliability of the Learning Environment subscale was reported as .68 for Fall 2000, .77 for Spring 2001, and .65 for Spring 2002 (Head Start).

The Individualizing subscale is based on a scale from the Assessment Profile for Early Childhood Programs. For FACES 2000 it was shortened to five observational items measuring whether the teacher plans classroom activities to meet the varying learning needs of each child, how the teacher keeps track of the children's work during the year through the use of individual child portfolios and whether the teacher accommodates children with disabilities through an inclusionary approach. A high score indicates that teachers are able to adjust classroom activities to meet the learning needs of individual children. The reliability of the Individualizing subscale was reported as .50 for Fall 2000, .54 for Spring 2001, and .44 for Spring 2002 (Head Start).

A.5.2.3 Early Childhood Environment Rating Scale – Revised (ECERS-R)

The Early Childhood Environment Rating Scale (ECERS) is a global rating of classroom quality based on structural features of the classroom (Harms and Clifford, 1980). It has been widely used in child development research and has predicted optimal child outcomes in a number of studies (e.g., Phillips, Voran, Kisker, Howes, and Whitebook, 1994). The revised version of the ECERS (ECERS-R) provides improvements to the items and allows for a more standardized approach to assigning scores. In addition, the ECERS-R is easier to train on and gain inter-rater reliability. The ECERS-R contains 37 items representative of classroom quality. Each item is coded on a 7-point scale with a score of 1 representing “inadequate”, a score of 3 representing “minimal quality,” a score of 5 representing “good quality,” and a score of 7 representing “excellent quality.” The internal consistency of the ECERS-R mean score for all combined items was .92 for both Fall 2000 and Spring 2001, and .89 for Spring 2002 (Head Start).

The ECERS-R items were grouped into seven subscales for usage in analyses of FACES classroom quality, each pertaining to different elements of classroom quality.3 These are as follows:

  • Personal Care Routines are measured using six items: greeting/departing, meals/snacks, nap/rest, toileting/diapering, health practices, and safety practices;

  • Furnishings is measured using four items: indoor space, furniture for routine care, play, and learning, furniture for relaxation and comfort, and room arrangement for play;

  • Language Skills are measured using four items: books and pictures, encouraging children to communicate, using language to develop reasoning skills, and informal use of language;

  • Motor Skills are measured using four items: space for gross motor play, gross motor equipment, fine motor activities, and supervision of gross motor activities;

  • Creativity is measured using six items: child-related display, art, music/movement, blocks, sand/water, and dramatic play;

  • Social Skills are measured using four items: supervision other than gross motor activity, discipline, staff-child interactions, and interactions among children; and

  • Program Structure is measured using four items: space for privacy, schedule, free play, and group time.

Five items were not incorporated into any of the subscales which are as follows: nature/science, math/numbers, use of TV, video, and/or computers, promoting acceptance of diversity, and provisions for children with disabilities. Thus there were only 32 of the 37 available items included in the subscales.

The Language Skills subscale was used as a key classroom quality measure in many of the analyses, and also contributed to the Quality Factor score (described in Chapter 4). Devised to assess the quality of the language environment in Head Start classrooms, a high score indicates a classroom with a rich language environment, in terms of the availability and use of books and printed materials, receptive and expressive language activities, language to engage logical and reasoning skills, and the informal use of language throughout the classroom day.

A.5.2.4 Classroom Observation of Teacher – Directed Activities

The Classroom Observation of Teacher-Directed Activities is a checklist completed by classroom observers of observed teacher-directed activities in 21 specific areas, e.g., reading stories, singing songs, etc. The classroom observer indicates whether observed activities were directed toward individual children (Individual Attention), a small group of children (Small Group = 3 to 8 children), or a whole group of children (Whole Group = entire classroom). Observers were instructed to mark down, only once for any item, any teacher-directed activities observed throughout the course of the classroom observation and whether these observed activities were directed toward individuals, a small group of children, or the entire classroom. This checklist was introduced in Spring 2001.

A.5.2.5 Arnett Caregiver Interaction Scale

The Arnett Caregiver Interaction Scale (Arnett, 1989) is a rating scale of teacher behavior towards the children in the classroom. The Arnett Caregiver Interaction Scale consists of 30 items and five subscales labeled Sensitivity, Harshness, Detachment, Permissiveness, and Independence. At the end of the observational period, the observer completes the scale for an individual teacher, typically the lead teacher in the classroom. For example, in evaluating whether the teacher “speaks warmly to the children,” the observer will assign a rating indicating the extent to which the statement is characteristic of the teacher, from 1 “never seen” to 4 “always or almost always.” A high score indicates greater teacher sensitivity, responsiveness and encouragement of children’s independence and self-help skills, and lower levels of punitiveness and detachment. The Cronbach Coefficient Alphas for all of the items were .94 for both Fall 2000 and Spring 2001, and .93 for Spring 2002 (Head Start).

A.5.2.6 Teacher Backgrounds, Qualifications, and Attitudes

The Lead Teacher Background Information is based on individual interviews with the lead teacher of the classrooms that were being observed. The interviews collected extensive information about the teachers’ backgrounds (e.g., age, ethnicity), experience (e.g., total years teaching, years teaching Head Start), and qualifications (e.g., whether the teacher has a BA or AA, whether the teacher had a graduate degree or some graduate school education, whether the teacher has a Child Development Associate certificate, teaching certificate, or whether the teacher took any early childhood education or development courses). Ethnicity was included in these analyses because it may be related to differences in teacher qualifications and experience and because the types of teachers in the classrooms may be influenced by the backgrounds of the families and children attending the Head Start program as well as the larger community served by the program.4 There were also questions about the nature of the curriculum used, the training and resources provided to support the curricula used, as well as questions about how teachers monitor the progress of individual children, and what accommodations the teacher makes to meet the learning needs of each student, including those with special needs.

The 24-item Teacher Beliefs Scale (Burts, Hart, Charlesworth, and Kirk, 1990) was included in the teacher interview, and consists of statements worded to reflect positive attitudes and knowledge of generally accepted practices in preschool settings, or to reflect a lack of these attitudes and knowledge. In FACES 2000, one factor comprising 9 items that explained most of the variation in scores for the entire scale was used. A high score indicates higher positive attitudes and knowledge about early childhood education practices

A.5.2.7 Quality Composite

As a result of principal components analyses of the quality measures, several of the measures–the ECERS-R Language subscale, the Assessment Profile Scheduling subscale and the Assessment Profile Learning Environment subscale–were found to be highly correlated with each other, suggesting that a greater amount of variation in quality can be explained by reducing these three quality indicators to one measure. Additional principal components analyses which included other quality measures (e.g., the Arnett Caregiver Interaction Scale, the child:adult ratio and even the ECERS-R total score) found that these measures did not add significantly to the explained variation and in some cases may have detracted from it. Thus, these other quality measures were used independently in the analyses of classroom quality. Scores from the three measures that formed the quality composite were combined to form a single factor score for quality. A higher score indicates higher levels of quality.

A.5.3 Teacher's Child Reporting Form

Teacher ratings of children were important sources of information about children’s learning and behavior because teachers see children over extended periods of time and in a variety of settings. Using a rating form known as the Teacher's Child Report (TCR), teachers were first asked to rate each child on a set of behaviors that assessed the child’s basic social skills and classroom behavior. In these two sections, the teacher is asked to indicate the extent to which a given statement (e.g., “follows the teacher’s directions”) is characteristic of the child, from 1 “never” to 3 “very often.” The items making up these ratings form two scales.

A.5.3.1 Cooperative Classroom Behavior

There are 12 ratings items for the teacher to indicate how often the child engages in cooperative classroom behaviors such as following teacher’s directions, helping put things away, complimenting classmates, and following rules when playing games. The ratings include items drawn from the Personal Maturity Scale (Alexander and Entwisle, 1988) and the Social Skills Rating System (Elliott, Gresham, Freeman, and McCloskey, 1988) to assess positive behavior such as cooperation, sharing, and expression of feelings. A summary score is created from the 3-point scale items with a range from 0 to 24, with high scores indicating more frequent cooperative behavior. The internal consistency for this measure was .88 for Fall 2000, Spring 2001, and Spring 2002/2003 (Kindergarten), and .87 for Spring 2002 (Head Start).

A.5.3.2 Total Behavior Problems

The Behavior Problems scale is based on measures of negative child behaviors that are associated with learning problems and later grade retention. Items come from an abbreviated adaptation of the Personal Maturity Scale (Alexander and Entwisle, 1988), the Child Behavior Checklist for Preschool-Aged Children, Teacher Report (Achenbach, Edelbrock, and Howell, 1987) and The Behavior Problems Index (Zill, 1990). The items ask about the frequency of aggressive behavior (e.g., hits/fights with others), hyperactive behavior (e.g., is very restless), and anxious or depressed and withdrawn behavior (e.g., is unhappy). The summary score from the scale’s 14 behavior items ranges from 0 to 28, with higher scores representing more frequent or severe negative behavior.

The Total Problem Behavior items were grouped into three subscales for usage in analyses of children’s behavior, each pertaining to different types of problem behavior: Aggressive, Hyperactive, and Withdrawn behavior. The Aggressive behavior subscale, comprised of four items, assesses the frequency of disobeying rules or requests, disrupting ongoing activities, hitting or fighting with others, and having temper tantrums. The Aggressive behavior scale score ranges from 0 to 8, with higher scores representing more frequent or severe aggressive behavior. The Hyperactive behavior subscale, comprised of three items, assesses the frequency of being unable to concentrate or pay attention for long; being nervous, high strung, or tense; and being very restless. The Hyperactive behavior scale score ranges from 0 to 6, with higher scores representing more frequent or severe hyperactive behavior. The Withdrawn behavior subscale, comprised of seven items, assesses the frequency of acting too young for age; being hard to understand; keeping to self; lacking confidence in learning new things or trying new activities; often seeming sleepy or tired in class; often seeming unhappy, sad, or depressed; and worrying about things for a long time. The Withdrawn behavior scale score ranges from 0 to 14, with higher scores representing more frequent or severe withdrawn behavior.

The reliabilities (internal consistency) for these measures for Fall 2000, Spring 2001, Spring 2002 (Head Start), and Spring 2002/2003 (Kindergarten) were as follows: Total Problem Behaviors - .86, .86, .87, and .86; Aggressive behavior - .83, .85, .83, and .85; Hyperactive behavior - .72, .72, .75, and .74; and Withdrawn behavior - .77, .76, .76, and .77.

The teacher is then asked to rate the child's problem solving skills and initiative, social relationships, creative representations, music/movement skills, and language/math skills. The teacher is asked to rate the child’s highest level of behavior in each of the above domains observed in the past week. Scale points for each item are described on paper and there is a glossary that provides concrete examples of each anchor point. For the purpose of FACES, 14 items from the Child Observation Record (COR; High/Scope Educational Research Foundation, 1992) were selected with a demonstrated reliability of .94 for both Fall 2000 and Spring 2001, and .93 for Spring 2002 (Head Start). These 14 items were further divided into the following scales: social relationships, creative representations, music and movement, and cognitive.

A.5.3.3 Social Relationships (3 Items)

A composite score was based on teacher’s ratings of how well the child makes friends, works with other children, and understands and expresses feelings. Each item is rated on a five-point scale with higher scores representing greater skill in coping with social situations and expressing feelings appropriately. The summary score is the average of the three items and ranges from one to five. The measure shows good reliability with the FACES study, with Alpha Coefficients of .83 for both Fall 2000 and Spring 2001, and .80 for Spring 2002 (Head Start).

A.5.3.4 Creative Representations (3 Items)

A composite score was based on the teacher’s ratings of how well the child uses creative materials for self-expression in making and building things, drawing and painting, and engaging in pretend play. Each item is rated on a five-point scale with higher scores representing greater proficiency. The summary score is the average of the three items and ranges from one to five. The measure shows good reliability with the FACES study, with Alpha Coefficients of .80 for both Fall 2000 and Spring 2002 (Head Start), and .81 for Spring 2001.

A.5.3.5 Music and Movement (4 Items)

A composite score was based on teacher’s ratings of how well the child can imitate movements to a steady beat, follow music and movement directions, exhibit body coordination, and manipulate small objects and perform precise actions. Each item is rated on a five-point scale with higher scores representing greater proficiency. The summary score is the average of the four items and ranges from one to five. The measure shows good reliability with the FACES study, with Alpha Coefficients of .88 for both Fall 2000 and Spring 2001, and .86 for Spring 2002 (Head Start).

A.5.3.6 Cognitive (4 Items)

A composite score was based on teacher’s ratings of how well the child can solve problems, engage in complex play, show interest in reading, and exhibit classification skills by sorting objects. Each item is rated on a five-point scale with higher scores representing greater proficiency. The summary score is the average of the four items and ranges from one to five. The measure shows good reliability with the FACES study, with Alpha Coefficients of .82 for Fall 2000, .83 for Spring 2001, and .80 for Spring 2002 (Head Start).

A.5.4 Parent Interview

Data from the FACES Parent Interview, administered in Fall 2000, Spring 2001, Spring 2002, and Spring 2003, provide Head Start with a comprehensive understanding of the families it serves or served, including the characteristics of households and household members, levels and types of participation in the program and in other community services, involvement with their children, and understanding of their children's development.

Parents were also asked to rate their child on a set of behaviors that assessed the child’s basic social skills and behavior problems. In this section, the parent is asked to indicate the extent to which a given statement (e.g., “makes friends easily”) is characteristic of the child, from 1 “not true” to 3 “very true or often true.” The items making up these ratings were drawn from two well-known measures of children's positive behavior and behavior problems: the Entwisle Scale of Personal Maturity (Entwisle, Alexander, Cadigan, and Pallis, 1987) and the Child Behavior Checklist for Preschool-Aged Children (Achenbach, Edelbrock, and Howell, 1987). Two scales were formed to assess children’s social competence.

A.5.4.1 Social Skills and Positive Approaches to Learning

Parents were asked to rate their child’s social skills and positive approaches to learning by describing their children’s skills in making friends and accepting their ideas, as well as enjoying learning and trying new things. A summary score based on the scale’s seven items ranges from 0 to 14, with higher scores representing more positive behavior. Tables A-23 through A-30 show the reliabilities for the Social Skills measure for Fall 2000, Spring 2001, and Spring 2002 (Head Start), and Spring 2002/2003 (Kindergarten).

A.5.4.2 Total Problem Behaviors

Parents were also asked to rate their children on negative behaviors that are relatively common among preschool children and that are associated with adjustment problems in elementary school. Parents were asked about three domains of problem behavior: hyperactive behavior, aggressive behavior, and depressed or withdrawn behavior. The 12 behavior items were combined in a summary score ranging from 0 to 24, with higher scores representing more frequent or severe negative behavior.

The 12 total problem behavior items were grouped into three subscales for usage in analyses of children’s behavior, each pertaining to different types of problem behavior: Aggressive, Hyperactive, and Withdrawn behavior. The Aggressive behavior subscale, comprised of four items, assesses the frequency of having temper tantrums, hitting or fighting with others, not getting along with other kids, and being disobedient at home. The Aggressive behavior scale score ranges from 0 to 8, with higher scores representing more frequent or severe aggressive behavior. The Hyperactive behavior subscale, comprised of three items, assesses the frequency of being unable to concentrate or pay attention for long; being very restless; and being nervous, high strung, or tense. The Hyperactive behavior scale score ranges from 0 to 6, with higher scores representing more frequent or severe hyperactive behavior. The Withdrawn behavior subscale, comprised of five items, assesses the frequency of being unhappy, sad, or depressed; worrying about things for a long time; and feeling worthless and inferior. The Withdrawn behavior scale score ranges from 0 to 10, with higher scores representing more frequent or severe withdrawn behavior.

A.5.4.3 Other Parent Interview Scales/Measures Referenced in the Report

Other parent interview scales/measures referenced in the report are listed below.

Names and sources for other parent interview scales/measures referenced in the report

Name

Source

Description

Pearlin Mastery Scale
(Locus of Control)

Pearlin, L.I. and Schooler, C. (1978). The structure of coping. Journal of Health and Social Behavior, 22, 337-356.

 

Seven items measured the degree to which parents feel they have control over their own lives and their self-confidence in their abilities to solve life’s problems.

 

CES-D Depression Scale

Radloff, L.S. (1977). The CES-D: A self-report depression scale for research in the general population. Applied Psychological Measurement, 1, 385-401.

 

Twelve items measured levels of depression among primary caregivers.

 

Family Activities with Children

National Household Education Survey - FACES Research Team

 

Eleven items asked respondents to indicate which family activities (such as telling a story; teaching letters, words, or numbers; teaching songs; or going on errands) were undertaken by the family members with the children in the past week. Another set of 11 item asked respondents to indicate which types of outings, such as visiting a library, a zoo, or a mall, they participated in with their children in the past month.

 

Parental Involvement in Head Start

Head Start Quality Research Consortium (QRC)

 

Asked in the Spring parent interview, 15 items asked parents to indicate how often they participated in various activities (e.g., volunteered in classroom, prepared food or materials for special events) at their child’s Head Start center.

 

Exposure to Violence

FACES Research Team

 

Five items tapped the extent that the respondent has been exposed to violent crime in his/her neighborhood or home.

 

Domestic Violence Screener

Feldous, K.M., Koziol-McLain, J., Amsbury, H.L. et. al. (1997). Accuracy of three brief screening questions for detecting partner violence in the emergency room. JAMA, 227(17), 1357.

 

Three items ask the respondent to indicate whether he/she has been a victim of domestic violence, and whether the child has either been a victim or witness of domestic violence.

 

Substance Abuse Screener

Administration for Children and Families (1997). National Impact Evaluation of the Comprehensive Child Development Program. Washington, DC: U.S. Department of Health and Human Services.

 

Three items ask the respondent to indicate the number of times a member of the household, who uses alcohol, (including self) has gotten into trouble with family or friends, has gotten in trouble with the police, or has missed work or school because of alcohol use. The same three items were asked for drug use if there is a member of the household who uses drugs.

 

Involvement with Criminal Justice System

FACES Research Team

 

Three items ask the respondent to indicate whether he/she or another household member have been arrested or charged with any crime since the child was born; and whether he/she/another household member spent time in jail.

 

Parenting Style

National Longitudinal Study of Youth (NLSY), Early Head Start Evaluation (EHS), QRC

 

Parents were also asked to rate a series of statements that address how they were raising their child at home. Four statements formed a scale that assesses whether the parent had an authoritative parenting style. Three statements form a scale that assesses whether the parent has an authoritarian parenting style.

 

Tables A-23, A-27, and A-28 show the reliabilities for both parent and teacher reported behavior problem measures for Fall 2000, Spring 2001, Spring 2002 (Head Start), and Spring 2002/2003 (Kindergarten).

A.6 Field Staff Training

A weeklong training was conducted prior to each data collection period to prepare field staff for successful completion of data collection. The field staff members consisted of experienced, professionally trained staff working for the FACES data collection efforts. The training included a wide variety of activities covering all the procedures, techniques, and contents required to carry out successful data collection in the Head Start centers, over the telephone, and elsewhere, e.g., homes:

  • Lecture, incorporating slides, overheads, and videotapes;

  • Exercises that simulate various procedures such as assessing classroom scheduling;

  • Video demonstration of assessment techniques and components of classroom scoring procedures;

  • Exercises to achieve pre-established levels of inter-rater reliability;

  • Participatory involvement of all trainees in small groups so that trainers may evaluate individual performance;

  • Multiple occasions of practice in real classroom settings that simulate what they are expected to do in the field, with the presence of a trainer and a small group of trainees to discuss the classroom ratings and provide valuable guidance on scoring reliability and agreement; and

  • One-on-one practice and role-play in the administration of child assessment procedures under supervision of training staff.

The field procedures manual contained information about working with a research team, appropriate behaviors within Head Start classrooms and children’s homes, and how to orchestrate Head Start center or other visits. Moreover, the manual covered an overview of all data collection instruments and administrative procedures. Complete scoring rules and question-by-question specifications for the child assessment, parent interviews, and classroom observation instruments were also discussed in the manual.

During the training, trainees were introduced to the purpose and goals of the study and provided background information on Head Start. Trainees were also introduced to the data collection materials and general issues regarding children and early childhood learning environments. Each day of training included a question and answer period. For administering child assessments in Spanish, a special training for English-Spanish speaking bilingual trainees was held. The bilingual trainees had an opportunity to practice assessments with Spanish-speaking children.

A.7 Data Collection Procedures

A.7.1 Site Visit Arrangements

The FACES research team obtained feasible dates for the 2-week site visit from each of the sampled Head Start programs. Site visit dates for each program were coordinated within the data collection period and programs were notified about the visit dates. Three weeks before the site visit, a scheduling packet which contained the final visit schedule, a list of sampled children by classroom, a reminder list, and a request for maps and directions to aid the research team was sent to the on-site coordinator (OSC). OSC’s are members of the Head Start program staff specially designated to coordinate the data collection efforts by scheduling parent interviews, classroom visits with teachers and obtaining consent forms.

When a sampled child was a kindergartner (either Spring 2002 or Spring 2003), a field staff member would contact the parent(s), in advance, to arrange for a time to conduct the Parent Interview over the telephone. An arrangement also would be made for conducting the child assessment at a location chosen by the parent(s), usually the home. If the Parent Interview could not been conducted over the telephone, for whatever reasons, the field staff member would try to administer the Parent Interview during the visit to conduct the child assessment. Teacher-Child Report and Lead Teacher questionnaires would be sent to the child’s school to be filled out and returned by the child’s lead teacher.

A.8 Quality Control Visits

In FACES, Quality Control (QC) visitswere built into every step of the data collection to ensure the highest quality data possible. The QC visitors consisted of the FACES research project staff who had been involved in designing the instruments, preparing the training materials, and conducting the training. The QC visitors were trained in both observation and assessment data collection and also served as technical consultants in the field. During the Fall 2000 data collection, one 3-day QC visit to several selected program sites was made.

A.9 Data Preparation and Data File Creation

A.9.1 Data Entry

Key entry and verification were performed on the study instruments using a sophisticated production data entry system. This system provides entry form layout, application of edit specifications, data verification control, and provides data entry quality and production reports.

A.9.2 Frequency Review

The frequencies of responses to all data items (both individually and in conjunction with related data items) were reviewed to ensure that appropriate skip patterns were followed. Members of the data preparation team checked each item to make sure the correct number of responses was represented for all items. If a discrepancy was discovered, the problem case was identified and reviewed.

A.9.3 Data Edit

To code and edit questionnaire data, an integrated collection of software was utilized. Through this system of software, coding manuals and codebooks were developed, data editing was performed, and SAS source code was generated.

A.9.4 Data file Creation

Data files were created and analyses performed to provide summaries and assessments of Head Start children and their families during this period and to assess the reliability and validity of information contained within the data collection instruments. Numerous derived variables were created to increase the magnitude and scope of analytical capabilities. The coding for these derived variables may be obtained upon request.

A.10 Reliability and Data Summary

In FACES, various data collection instruments were used to assess the accomplishments and behaviors of children in Head Start programs, as well as the educational and familial support that is provided to them. As noted in Section IV: Data Collection Instruments, these instruments are widely used and report mostly high reliabilities. The reliabilities for each data collection instrument and summaries for these data collection instruments are provided in the Appendix A Table A-2 through Table A-30.

Table A-2. Reliability of Fall 2000 FACES child assessment data-English language and Spanish language assessments
Scales Fall 2000 (English language assessments) Fall 2000 (Spanish language assessments)
Number of items Number of cases Cronbach alphas Number of items Number of cases Cronbach alphas
Social Awareness 5 2,068 .63 5 385 .36
PPVT-III* 1 144 2,508 .97 - - -
TVIP* - - - 125 392 .92
McCarthy: Draw-A-Design 9 2,068 .58 9 375 .57
Color Names 10 2,055 .95 10 378 .92
WJR: Letter-Word Identification* † 23 1,273 .84 - - -
WM: Letter-Word Identification* - - - 18 219 .75
WJR: Applied Problems* 23 1,054 .90 - - -
WM: Applied Problems* - - - 23 219 .85
WJR Dictation* 12 1,054 .77 - - -
WM: Dictation* - - - 12 219 .77
Story and Print Concepts: Print Conventions 2 2,116 .73 2 392 .59
Story and Print Concepts: Book Knowledge 5 2,116 .57 5 392 .43
Story and Print Concepts: Comprehension 2 2,116 .43 2 392 .39
Interviewer Rating: Assessment Behavior 8 2,021 .82 8 372 .77
* Raw scores were used.

1 For Fall 2000, Spanish-speaking English Language Learners were administered the PPVT-III in English and their results are incorporated with those of other children as reflected under Fall 2000 (English Language Assessments).

† For Fall 2000, Spanish-speaking English Language Learners were administered the WJR: Letter-Word Identification in English and their results are incorporated with those of other children as reflected under Fall 2000 (English Language Assessments).
 

Table A-3. Reliability of Spring 2001 and Spring 2002 (Head Start) FACES child assessment data-English language assessments including additional Spanish language measures for Spanish-speaking English language learner children
Scales Fall 2000 (English language assessments) Fall 2000 (Spanish language assessments)
Number of items Number of cases Cronbach alphas Number of items Number of cases Cronbach alphas
Social Awareness 5 2,304 .62 5 931 .62
PPVT-III* 144 2,344 .97 144 956 .96
TVIP* 1 125 364 .92 125 113 .94
McCarthy: Draw-A-Design 9 2,298 .70 9 928 .72
Leiter-R AS – Age 3 4 406 .71 - - -
Leiter-R AS - Ages 4 to 5 4 1,763 .81 - - -
Leiter-R AS - Ages 4 to 6 - - - 4 920 .80
Color Names 10 2,298 .94 10 942 .90
WJR: Letter-Word Identification* 23 1,902 .86 23 955 .86
WM: Letter-Word Identification* † 18 307 .78 18 113 .83
WJR: Applied Problems* 23 1,902 .91 23 955 .89
WJR Dictation* 12 1,902 .77 12 955 .71
Story and Print Concepts: Print Conventions 2 2,344 .75 2 956 .84
Story and Print Concepts: Book Knowledge 5 2,344 .59 5 956 .61
Story and Print Concepts: Comprehension 2 2,344 .42 2 956 .40
Interviewer Rating: Assessment Behavior 8 2,254 .80 8 914 .70
* Raw scores were used.

1 For Spring 2001 and Spring 2002 (Head Start), Spanish-speaking English Language Learners were administered the TVIP in Spanish to gauge progress in Spanish receptive vocabulary knowledge from earlier data collection periods.

† For Spring 2001 and Spring 2002 (Head Start), Spanish-speaking English Language Learners were administered the WM: Letter-Word Identification in Spanish to gauge progress in Spanish letter/word knowledge from earlier data collection periods.
 

Table A-4. Reliability of Spring 2002/2003 (kindergarten) FACES child assessment data-English language assessments
Scales Spring 2002/2003 (kindergarten)
Number of items Number of cases Cronbach alphas
Social Awareness 7 1,833 .65
PPVT-III* 204 1,913 .97
TOLD – Phonemic Analysis 14 1,913 .96
Reading – ECLS-K Routing* 20 1,913 .87
Red (Low Form)* 18 1,913 .95
Yellow (Middle Form)* 29 1,913 .94
Blue (High Form)*