Table of Contents | Previous | Next |
Appendix 1.2: Calculating Analytical Sampling Weights for Fall 2002 and Spring 2003
Overview
Sampling weights were calculated for each child and parent to allow estim ates based on the sample to represent the population of newly entering Head Start participants. Because children were randomly assigned to Head Start and non-Head Start groups within each Head Start center, each group represents the same Head Start population of newly entering children when appropriately weighted. The only difference, theoretically, is that the Head Start group was assigned to attend Head Start at the time of random assignment, while the non-Head Start group was not. Children who were sampled as Head Start group members or non-Head Start group members were assigned base weights that reflected their overall probability of selection, including the sampling of broad geographic areas used as primary sampling units (PSUs), Head Start grantees/delegate agencies, and centers. These base weights were adjusted for omission of programs and centers in communities saturated by Head Start and nonresponse to the fall 2002 and spring 2003 child assessment and parent interview separately to produce fall 2002 and spring 2003 child and parent weights, respectively. The nonresponse-adjusted weights of children in the 4-year-old group were poststratified to the Head Start National Reporting System (HSNRS) newly entering enrollment totals for 4-year-olds (comparable totals for 3-year-olds were not available). Extremely large weights were then trimmed for both age groups. The final child and parent weights are the product of the overall base weight, a nonresponse adjustment factor, a poststratification factor, and a trimming factor. For variance estimation, a set of 76 jackknife replicate weights was created for each child and parent.
Spring 2003 weights are used for most analyses in this report; the analyses focus on impacts at that time and include only children and families for whom spring data are available. Fall 2002 weights are used to examine distributions of child and family characteristics at the beginning of the analysis period, in fall 2002.
Primary Sampling Unit (PSU) Weights
The frame of 161 PSUs, or geographic clusters, was classified into 25 approximately equal-sized strata based on the level of services for low-income preschool children in the state, percentage of minority Head Start enrollment in the PSU, Head Start region, and percentage of Head Start enrollment in an MSA (a U.S. Census Bureau metropolitan statistical area). One PSU in each stratum was sampled with probability proportional to the total Head Start enrollment of 3- and 4-year-olds in the PSU. The source of enrollment was the 1999-2000 PIR. The PSU weight is the inverse of the PSU probability of selection:
PSU weight = (Total Age 3 & 4 Enrollment in Stratum h) / (Total Age 3 & 4 Enrollment in PSU) where h = 1, 2, ….25. There was one certainty PSU whose probability of selection was 1 due to its large Head Start enrollment.
Head Start Program Weights
Program Sampling
There were two stages of sampling within most PSUs, and three stages within three extremely large PSUs. Prior to sampling, small programs were collapsed into groups consisting of two to four programs. These were sampled as a unit; thus, the within-PSU probability of selection for each program in a given group is the same.
Prior to telephone screening, programs and program groups (referred to henceforth simply as program groups, although most “groups” consisted of a single grantee or delegate agency) were sampled within the three large PSUs to reduce screening costs. In each of these three PSUs, 12 program groups were sampled with probability proportional to total age 3 and 4 enrollment from the 1999-2000 PIR. All programs in the sample PSUs underwent screening, during which study staff collected information on additional characteristics of each program and its community (except in the three large PSUs, where only the 12 sampled program groups were screened). A major purpose of this screening was to identify situations in which Head Start “saturated” the community, i.e., where the local program was large enough that all of the interested and eligible families in the community could be enrolled, making selection of a non-Head Start study group impossible without simultaneously leaving some of the program’s capacity unused. After screening, program groups were sampled within the 25 PSUs from among those determined to be neither “saturated” nor closed. Within each PSU, four program groups were sampled with probability proportional to the total newly entering children ages 3 and 4 enrollment. From these, three program groups were subsampled with equal probabilities to be the main sample, and the remaining program group was assigned as a reserve sample. The main sample consisted of 76 program groups, which comprised 90 individual programs. The reserve sample consisted of 30 programs.
Program Base Weights, Adjustments for Saturation, Raking
Each of the 90 programs in the main sample received a base weight. The program base weight is the inverse of the overall probability of selection for the program, including the PSU probability of selection and the sampling of program groups within the PSU.
The base weights were adjusted for undercoverage due to the deletion from the frame of eight Head Start programs involved in the most recent FACES study and 28 programs discovered to be “saturated” during the screening. Because these programs had no chance of selection, an undercoverage adjustment was needed to correct for bias, in case the deleted programs were systematically different from those retained on the frame (see Appendix 2.1 for an examination of this question) and to prevent weighted enrollment totals from the sample from being too low. The undercoverage adjustment factor was calculated as the ratio of the estimated total newly entering enrollment in the PSU to the estimated newly entering enrollment from the sampled programs in the PSU, using enrollment information collected during the telephone screening. This adjustment corrected for differences between saturated and non-saturated programs on broad geographic factors but not for differences between the two types of programs within PSUs—differences that could result in larger or smaller Head Start impacts in the studied sites than in the nation as a whole.
The adjusted program weights for all 90 main sample programs were raked to marginal ages 3 and 4 enrollment totals from the 1999-2000 PIR. The raking dimensions were urban status (central city, noncentral city, rural), Head Start region (Northeast, North Central, South, Plains, West), and level of pre-K services in the state (state has Head Start-like programs, state has other types of programs, state has no programs). This procedure served to further match the analysis sample to the full national Head Start program on these factors. Since the number of sampled programs in each cross-classification is generally small, raking, or iterative proportional fitting (Oh & Scheuren, 1987), rather than poststratification was used. In raking, the weights are consecutively ratio-adjusted to marginal non-Head Start totals until the resulting weighted totals converge to the non-Head Start totals for each dimension. The adjustment factor at each iteration is the ratio of the PIR non-Head Start total for the marginal dimension to the sample estimate of the same total, where the weight in the sample estimate is the program weight from the previous raking iteration. This ratio adjustment reduces the sampling error associated with the sampling of PSUs and programs for estimates of Head Start children by urban status and Head Start region (Cochran, 1977). However, it is not intended to result in sample estimates that will agree with non-Head Start totals of newly enrolled Head Start children, since no such counts exist.
After these undercoverage and raking adjustments were performed, the program weights in two PSUs were further adjusted to compensate for dropping two eligible programs from the sample because of their participation in a QRC study and for dropping three programs because they were found to be saturated after sampling. Another program was discovered to have closed, reducing the number of participating programs to 84. The adjustment factor was calculated as the ratio of estimated total newly entering enrollment in the PSU based on the entire sample of programs in the PSU to the weighted newly entering enrollment for the sampled nonsaturated, non-QRC programs in the PSU. None of the programs refused to participate, thus no nonresponse adjustment or reserve programs were needed.
Final Program Weight
Eighty-four programs received a final program weight. The final program weight can be written as:
(Final program weight = PSU weight x (1/ P1) x (1/ (1-PFACES)) x (1/ P2) x (1/ P3) x FSat1 x FRK x FQRC, Sat2
where,
PFACES = probability of selection in FACES,
P1 = probability of being subsampled prior to telephone screening in three large PSUs,
P2 = probability of being sampled in PSU,
P3 = probability of being subsampled for main sample,
FSat1 = adjustment factor for dropping 28 saturated programs from frame before sampling,
FRK = raking adjustment factor to reduce sampling error,
FQRC, Sat2 = adjustment factor for dropping two programs participating in QRC and three saturated programs from the sample,
where,
P1 = 12*(Total Age 3 & 4 Enrollment in Program Group)/(Total Age 3 & 4 Enrollment in PSU),
P2 = 4*(1st Yr Age 3 & 4 Enrollment in Program Group)/(1st Yr Age 3 & 4 Enrollment in PSU),
The final program weights for the sample of 84 programs sum to 1,216 with a 95% confidence interval of [959, 1,472].
Head Start Centers
Center Sampling
Within each program, a list of the centers was obtained, and the centers were screened using a Center Information Form to collect various statistical data. The centers that were determined to be “saturated” were dropped from the frame in each program. Prior to sampling, small centers were combined into groups that ranged from two to eight centers and were treated as a unit for sampling purposes. Therefore, each center in a given group has the same probability of selection, namely that of the group. An initial sample of center groups was selected with probability proportional to newly entering age 3 and 4 enrollment in the center group. The initial sample of center groups was then subsampled with equal probabilities. The subsample was retained as the main sample in each program, while the remaining center groups formed a reserve sample. In general, three center groups per program (or program group) were selected for the main sample and two for the reserve. However, in very large programs four to six center groups were allocated for the main sample and three for the reserve. Within a program group, the total number of centers was allocated proportionally to the programs based on their newly entering enrollments. A total of 448 main sample and 237 reserve centers were selected in this way.
Center Base Weights and Adjustments for Saturation and Nonresponse
The center base weight is calculated as the inverse of the overall probability of selection for each center, including the sampling of PSUs, programs, and centers within programs. The center base weights were adjusted for deleting 161 saturated centers and 2 centers participating in a QRC study from the frame prior to center sampling. These adjusted weights were further adjusted for the refusal of 5 sampled centers to participate in the study, and for the loss of 56 centers discovered to be saturated after sampling. In these centers, no sampling of children was possible. In addition, 6 centers had closed, and 13 were ineligible for other reasons, such as merging with another center. For the merged centers, where appropriate, an adjustment was made to the base weight of the newly merged center to account for its increased probability of selection, since the individual centers had been listed separately on the center frame.
The adjustment factor for dropping saturated centers from the frame was calculated as the ratio of the estimated total newly entering enrollment in the program to the newly entering enrollment estimated from the sampled centers in the program. The newly entering enrollment was collected on the Center Information Form during center screening and updated during October through December 2002 for all centers where possible. The adjustment factor was calculated separately for each program, unless this resulted in a very large adjustment, in which case the factor was calculated for the PSU.
The adjustment factor for the loss of five refusing and 56 saturated centers was calculated as the ratio of the weighted newly entering enrollment for the entire center sample in the program (excluding those that had closed or merged) to the weighted newly entering enrollment for the nonsaturated, cooperating centers in the program. Overall, these procedures adjusted for differences between included and excluded centers that emanate from the particular grantee or delegate agency that runs the excluded centers but not for other differences across centers that might lead to different-sized impacts in the omitted sites.
Final Center Weight
The final center weight can be written as:
Final Center Weight= Final Program Weight x (1/Pc1) x (1/Pc2) x FQRC x FSat1 x FRefusal , Sat2,
where,
PC1 = probability of selection for initial center sample (both main and reserve),
PC2 = probability of selection for main center sample,
FQRC = adjustment factor for dropping two centers participating in QRC from frame,
FSat1 = adjustment factor for dropping 161 saturated centers from frame,
FRefusal, Sat2 = adjustment factor for dropping 56 saturated centers and 5 refusing centers from sample,
PC1= (Newly Entering Age 3 & 4 Enrollment in Center Group)/ [(Newly Entering Age 3 & 4 Enrollment in Program for Eligible, Nonsaturated Centers)/nM+R]
PC2 = (nM)/(nM+R) = (#center groups subsampled for main sample in the program)/(#center groups sampled for both main, reserve in the program)
and the final program weight reflects the PSU and program probabilities of selection. In four programs, all reserve centers were brought into the sample when the original centers were found to be saturated or partially saturated and hence unable to provide the planned number of non-Head Start sample children. In these centers, PC2 was set to one in the above formula. When this resulted in a census of eligible centers in the program, both PC1 and PC2 were set to one. In six programs where some, but not all, of the reserve centers were activated to offset saturation in the main sample, nM includes the reserves that were activated as well as the main sample centers. In this situation, centers were randomly subsampled from among the reserve centers selected for that particular program or program group. The total number of centers in the final sample, including main sample and activated reserves is 458. The sample was reduced to 378 after losing 19 centers identified following selection as ineligible (closings, mergers), 5 identified as noncooperating, and 56 found to be saturated.
Because reserve centers were picked at random from the same pool as the main sample centers, utilization of the reserve sample will bias study results only to the extent that the centers they replaced were atypical. Hence, recourse to reserve sampling represents another part of the study’s overall undercoverage of communities saturated by Head Start.
The final center weights for the 378 centers sum to 12,705 with a 95% confidence interval of [10,290, 15,119].
Child Weights
Random Assignment of Children Within Centers
Children were sampled in two stages within each center. At the first stage, the applicant list was sorted based on child need, and the list was truncated at exactly the number of children needed to both fill the center’s slots and supply a non-Head Start group sample of the desired size for the study. A sample of children was then randomly selected with equal probabilities from the truncated list to fill the center’s slots. Those not selected to fill a slot were assigned to the non-Head Start group. At the second stage, the children sampled to fill the center’s slots were subsampled to obtain the targeted number of Head Start group children. Thus, there were four categories of children: 1) those sampled to attend the Head Start program but not for participation in the study, 2) those sampled for the study’s Head Start group, 3) those sampled for the study’s non-Head Start group, and 4) those on the waiting list who had no chance of selection for either study sample but who could enter the Head Start program later (once sampling ended) to replace children who dropped out of the program over the course of a year. The targeted number of Head Start and non-Head Start group children was 16 and 11, respectively, at most centers and center groups, cumulating to an average of 48 Head Start group members and 32 non-Head Start group cases for each sampled program group. In center groups, the 16 Head Start and 11 non-Head Start were proportionally allocated to the centers in the group based on newly entering enrollment. In 3 of the 84 programs, children applied directly to the program rather than the center, so it was necessary to randomly assign children at the program level and sample 48 Head Start and 32 non-Head Start cases to obtain 80 children for the program in total. The total target sample size was approximately 3,600 Head Start and 2,400 non-Head Start children.
The random assignment of children was spread out over the summer/fall 2002, because most centers took applicants on a flow basis and preferred to let their families know soon whether their child had been accepted to attend the Head Start program. This meant children were sampled in batches or rounds, and the two-stage sampling process described above took place more than once in most centers. An additional complication was that stratification by program option was used in many centers. The allocation of the total number of Head Start and non-Head Start children across program options and rounds at each center was approximately proportional to the newly entering enrollment in each program option and the number of slots filled in each round. The actual probabilities of selection for each child were stored electronically for weighting purposes. However, the probabilities can vary greatly because of the difficulty in allocating across rounds. There were many rounds where children were sampled to fill slots but no Head Start or non-Head Start children were selected because the target sample sizes of Head Start and non-Head Start children had already been obtained. None of these children had a chance of selection for the study, meaning child weights based on the actual probabilities of selection would underestimate the size of the first year Head Start population.
Child Base Weights
The within-center child base weight was calculated as:
(Newly Entering Age 3 & 4 Enrollment in Center)/(# treatment children sampled in center)
for the sampled Head Start group children, and as
(Newly Entering Age 3 & 4 Enrollment in Center)/(# control children sampled in center)
for the non-Head Start group children. Note that the numerator is the same for both groups, since estimates are to be made for the universe of newly entering Head Start children using either sample. For centers where the updated fall 2002 newly entering enrollment was not obtained, the newly entering enrollment figure for the previous program year was used. When this was missing, and for three programs where children were randomly assigned at the program level rather than at the center level, the inverse of the actual probability of selection for children in the center was used as the base weight.
The overall child base weight reflecting all stages of sampling can be written as:
Overall Child Base Weight = (Final Center Wt) x (Within-Center Child Base Wt.)
where the final center weight reflects the PSU and program probabilities of selection and includes an adjustment for centers where no children were sampled because of center noncooperation or saturation.
Nonresponse Adjustments
Nonresponse adjustments were performed separately for fall 2002 and spring 2003, using three definitions of a respondent for the fall 2002 data collection and two definitions for spring 2003. The three definitions for fall 2002 were (1) child is considered a complete for the fall 2002 child assessment, (2) child has a complete fall 2002 parent interview, and (3) child is considered a complete for both the fall 2002 child assessment and parent interview. The two definitions for spring 2003 were (1) child is considered complete for the spring 2003 child assessment and (2) child has a completed spring 2003 parent interview. This resulted in three nonresponse-adjusted child weights for fall 2002 and two for spring 2003.
The nonresponse adjustment helps non-Head Start nonresponse bias by compensating for different data collection response rates across various demographic and geographic groups of children. This is due to the fact that the nonresponse adjustment factor is calculated within nonresponse adjustment cells formed by the demographic and geographic variables. The nonresponse adjustment factor spreads the weight of the nonresponding children over the responding children in that cell, so that they represent not only children who were not sampled, but also the nonresponding sampled children. This maintains the same mix of the sample across cells as would have been present had there been no nonresponse.
To capture the variation in response rates, we form cells based on characteristics that correlate with response rates. For the fall 2002 nonresponse adjustments, a nonresponse analysis using chi-square tests and logistic regression in WesVar showed high correlation between response rates and Head Start versus non-Head Start assignment and program option for the non-Head Starts. This result, combined with a desire to capture individual Head Start program differences as much as possible, led to nonresponse adjustment cells formed by crossing PSU x state x program for the Head Start group, and PSU x program option x state x program for the non-Head Start group. Collapsing across program and state was done as needed to prevent weight adjustment factors of 2.0 or larger.
To determine the nonresponse adjustment cells for spring 2003, an unweighted nonresponse analysis was done using a software package called CHAID (Chi-squared Automatic Interaction Detector), to determine what variables are correlated with propensity to respond. The following variables were used as candidates in the analysis:
- Head Start versus non-Head Start group,
- Child race,
- Child language,
- Language spoken at home,
- Child’s gender,
- Program option applied for (full-day, part-day, both, home-based),
- Child’s age,
- Metro status for county containing Head Start program office,
- Level of pre-K services in the state,
- Head Start region,
- State,
- Response status for fall 2002 child assessment,
- Response status for fall 2002 parent interview,
- Program, and
- PSU.
A small number of missing values for the variables used in the nonresponse analysis were imputed via hot deck imputation using procedures described in Appendix 4.1. Variables with missing values were child language, home language, child race, and gender. Weighted logistic regression and chi-square tests were also run in WesVar to confirm the CHAID results.
The tree structure identified by CHAID was used in creating the nonresponse adjustment cells for spring 2003. For the child assessment nonresponse adjustment, CHAID used the following variables to create nonresponse adjustment cells:
- Head Start versus non-Head Start indicator,
- Fall 2002 child assessment response status,
- Level of pre-K services in state,
- PSU,
- Head Start region,
- Child's gender
- Metro status, and
- Child's race.
For the parent interview, the nonresponse adjustment cells were created using:
- The Head Start versus non-Head Start indicator,
- Fall 2002 parent interview response status,
- Level of pre-K services in state,
- PSU,
- Head Start region,
- Child’s gender,
- Metro status,
- Child’s age, and
- Child’s race.
Some collapsing of cells was required to prevent excessively large nonresponse adjustment factors, which cause the weights to become more variable and the variance of most estimates from the data to increase. The coefficient of variation of the nonresponse-adjusted child weights was computed under various cell-collapsing scenarios for the child assessment and parent interview nonresponse adjustment for spring 2003. A final set of collapsed cells for each nonresponse adjustment was chosen based on a compromise between limiting the increase in weight variability and the need to control for non-Head Start for nonresponse bias by limiting the amount of cell collapsing.
Poststratification
To reduce the sampling error for estimates of the newly entering Head Start population, the nonresponse-adjusted child weights for children in the 4-year-old group were poststratified to fall 2003 HSNRS newly entering enrollment totals by race/ethnicity. (The HSNRS is a census of Head Start programs, so there should be no sampling error associated with its enrollment totals. However, race reporting may differ somewhat between the HSNRS and the current study, as the Head Start programs were given no specific instructions on how to code the variable in the HSNRS.) Comparable enrollment totals were not available for 3-year-olds. The three race/ethnicity categories were Hispanic, non-Hispanic, Black, and White/other. An adjustment factor was calculated for each category, and the appropriate factor applied to each child weight depending on the race of the child, as reported on the NHIS child roster. The numerator of each factor was the proportion of HSNRS total newly entering age 4 enrollment in the race/ethnicity category; the denominator was the sample estimate of this proportion using the 84 programs sampled for the current study, the final program weight, and the HSNRS first year age 4 enrollment reported for each program. The poststratification factors were 0.80 for Hispanic, 1.45 for Black, and 1.036 for White/other, indicating an overrepresentation of Hispanic children and underrepresentation of Black children in the current study sample as compared to the HSNRS. Appendix 2.3 provides a detailed analysis of the race/ethnicity composition of the sample and its comparison to national Head Start data.
Trimming
A final trimming adjustment was made for inordinately large child weights. Very large weights can substantially increase sampling error, so weights were trimmed back to four times the average weight to avoid large sampling errors, even though this introduces a small amount of bias into the survey estimates. For the fall 2002 child weights, 76 weights (2.0%) were trimmed for the child assessment completes, 79 (2.0%) for the parent interview completes, and 75 (2.0%) for children having both a complete child assessment and parent interview. For the spring 2003 child weights, 84 weights (2.2%) were trimmed for the child assessment completes and 86 (2.2%) for the parent interview completes. An analysis of the trimmed cases showed that most extremely large weights were primarily due to some large centers being undersampled, i.e., only a few children were sampled, perhaps due to near-saturation.
The final child weight can be written as:
Final Child Weight = (Overall Child Base Wt) x (Child Nonresponse Adjustment Factor) x (Poststratification Factor) x (Trimming Factor)
where the overall child base weight reflects the probability of selecting the PSU, program, center, and child within center. When the final child weight is applied, the Head Start and non-Head Start groups each separately represent the entire first year Head Start population. Sample estimates of the size of the first year Head Start population are given in Exhibit A.1.2.1 in the “Sum of Final Weights” column.
| Number of Respondents | Sum of Final Weights | 95 Percent Confidence Interval | Coefficient of Variation of Final Weights | |||
|---|---|---|---|---|---|---|
| Final Fall 2002 Child Weights | Child Assessment | Head Start | 2,360 | 422,686 | (352,936, 492,437) | 0.860 |
| Non-Head Start | 1,363 | 413,258 | (345,160, 481,356) | 0.770 | ||
| Parent Interview | Head Start | 2,489 | 423,086 | (353,623, 492,548) | 0.850 | |
| Non-Head Start | 1,526 | 414,214 | (346,413, 482,016) | 0.780 | ||
| Both Child Assessment and Parent Interview | Head Start | 2,339 | 422,818 | (353,030, 492,606) | 0.860 | |
| Non-Head Start | 1,361 | 413,064 | (345,221, 480,907) | 0.770 | ||
| Final Spring 2003 Child Weights | Child Assessment | Head Start | 2,441 | 426,834 | (357,492, 496,177) | 0.860 |
| Non-Head Start | 1,457 | 418,907 | (352,648, 485,166) | 0.880 | ||
| Parent Interview | Head Start | 2,404 | 427,536 | (358,628, 496,444) | 0.860 | |
| Non-Head Start | 1,483 | 419,772 | (353,164, 486,381) | 0.880 | ||
Reweighting Non-Head Start Group Observations After Deleting Crossovers
A crossover is defined as a child who was randomly assigned to the non-Head Start group but participated in Head Start. Of the 227 crossovers in the sample, 212 were respondents for the spring 2003 child assessments, and 211 had a completed parent interview in spring 2003.1 To develop alternative “crossover-adjusted” estimates of Head Start’s impact to supplement the main findings, these cases were dropped from the analysis sample, and weights for the remaining non-Head Start group members were recalculated. In effect, this procedure treated crossovers as a second set of nonrespondents to the spring 2003 data collection.
This additional nonresponse adjustment took as its starting point the previously nonresponse-adjusted child assessment and parent interview spring 2003 child weights. It then inserted an additional stage of nonresponse adjustment for the crossovers just prior to the poststratification to the HSNRS totals. A CHAID analysis was run using demographic characteristics of the child and parents, household income variables, and health-related questions from the parent interview as inputs. A minimum cell size of 30 was required and a minimum p-value of 0.05 (with a Bonferroni adjustment) was required for retention in the tree. At the top of the tree, age group (age 3 or 4) was forced to be the first variable because the cohorts were analyzed separately and because a logistic regression analysis of crossover patterns indicated a significant age by gender interaction.
For the 3-year-old group, CHAID identified five groupings of PSUs with similar unweighted crossover rates. It then split one of these groupings by father’s immigration status, another by parent-reported emergent literacy scale for the child in fall 2002 and food stamp receipt, a third by mother’s employment status in the fall, and a fourth by teen birth status of the mother—creating 10 cells in total. For the 4-year-old group, CHAID identified four groupings of PSUs with similar unweighted crossover rates. It then split one of these groupings by child’s gender, creating a total of 5 cells. (No other correlates with crossover rate were identified for the remaining PSU groups.)
A crossover “nonresponse” adjustment factor was then calculated for each cell to spread the weight of deleted crossover cases over the remaining non-crossover observations in that cell, so that the latter could represent crossover-like children in a “non-treated” state. For each non-crossover non-Head Start in the cell, the crossover adjustment factor was multiplied by the pre-existing nonresponse-adjusted weight for that person. The resulting weights were then poststratified and trimmed as before. Separate crossover nonresponse adjustments were done in this manner for spring 2003 child assessment outcomes and spring 2003 parent interview outcomes.
Analysis weights for randomly assigned Head Start group children remained unchanged when conducting the analysis of crossover-adjusted impacts.
Importance of Using Weights
The formulas for producing weights are quite complex and can result in substantial differences in weights among sample children. If certain types of children tend to have much larger weights than other types of children, and if the weights are not used in the analysis, then the types of children with large weights will be underrepresented in the analysis relative to the population of all newly entering Head Start children. This can lead to serious bias in impact estimates. Thus, we strongly recommend that weights be used in all analyses.
Calculating Correct Standard Errors
Estimates obtained from the Head Start Impact Study will differ from the true population parameters because they are based on a randomly chosen subset of the population, rather than on a complete census of all newly entering Head Start children. This type of error is known as sampling error or variance. The differences between the estimates and the true population values can also be caused by nonsampling error. Nonsampling errors can result from many causes, such as measurement error, nonresponse, sampling frame errors, respondent error, and differences among interviewers. In general, the magnitude of nonsampling error is difficult to assess from the sample. The precision of an estimate is measured by the standard error (defined as the square root of the variance). The calculation of the standard error must reflect not only the sample size on which the estimate is based, but the manner in which the sample was drawn. Otherwise, the standard errors can be misleading and result in incorrect confidence intervals and p-values in hypothesis testing. The study’s sampling involved stratification, clustering, and unequal probabilities of selection, all of which must be reflected in the standard error calculations.
Two commonly used variance estimation methods for complex surveys involving multistage sampling are replication and linearization (Wolter, 1985). Replication methods work by dividing the sample into subsample replicates that mirror the design of the sample. A weight is calculated for each replicate using the same procedures as for the full-sample weight. This produces a set of replicate weights for each sampled child. To calculate the standard error of a survey estimate, the estimate is first calculated for each replicate using the replicate weight and the same form of estimator as for the full sample. The variation among the replicates is then used to estimate the variance for the full sample estimate. In the linearization approach, a nonlinear estimator is approximated by a linear function and a formula derived for the variance of the linear approximation. Replication has the advantage that it can reflect the different features of the weighting and estimation by simply repeating all steps separately for each replicate. For linearization, a specific formula is needed for each estimator, and the formula will differ depending on the type of estimator and sample design. On the other hand, finite population correction factors are often easier to account for using linearization estimators. However, for linear estimators, or nonlinear estimators that are formed by combinations of linear functions, replication variance estimators are often little different numerically from linearization variance estimators.
For the current study, a set of jackknife replicate weights was created for each child for use in the calculation of standard errors. Normally, stratified jackknife replicate weights are created by dropping out one PSU at a time, setting the replicate weights for sampled units in the dropped PSU to zero, multiplying the full-sample weights of sampled units in the remaining PSUs in the stratum by a factor of nh / nh-1, where nh is the number of PSUs in the h-th stratum, and leaving the full-sample weights for sampled units in the remaining strata unchanged. However, because only 25 PSUs were sampled at the first stage (one per stratum), only 27 replicate weights could be created (in the one certainty PSU, two additional replicates could be formed based on program groups). To improve the stability of the variance estimates, the second-stage sampling units, namely Head Start program groups, were used as the “drop unit” in creating replicates. This resulted in 76 replicate weights per child and 51 degrees of freedom for variance estimation (i.e., 76 PSUs – 25 strata). Because the between-PSU component of variance is being ignored in doing this, the resulting variance estimates will be slight underestimates, if the between-PSU variability is small relative to the within-PSU variability.
The validity of this hypothesis was investigated by creating a second set of 27 replicate weights based on the 25 PSUs, which includes the between-PSU component, but has fewer degrees of freedom. By calculating the average ratio of the variance from the set of replicate weights based on the 25 PSUs to the variance from the set based on the 76 program groups, we were able to estimate the relative size of the between-PSU component. The ratio of variances was calculated for several child assessment means (PPVT, Elision, Woodcock-Johnson Applied, Oral Comprehension, Spelling, and Letter-Word) by age and gender within the test language groups English and Spanish and averaged them. (However spring 2003 variance estimates could not be produced separately for the Spanish group because the completed Spanish assessments were all from only three sampled Head Start programs, resulting in insufficient degrees of freedom to estimate the variance.) For fall 2002 scores, the between-PSU component was estimated to be 15 percent of the total variance, and for spring 2003 scores, this component was estimated to be 28 percent of the total variance. Therefore, estimates of fall 2002 standard errors from the fall 2002 replicate weights should be multiplied by the square root of 1.15 (=1.07) to prevent underestimates of the variance. Similarly, standard errors for spring 2003 based on the spring 2003 76 replicates should be multiplied by the square root of 1.28 (=1.13).
Incorporating Weights and Standard Errors in the Impact Analyses
The easiest way for analysts to incorporate the weights and correct standard errors into their analyses will be to use software designed for analysis of complex survey data. Such software packages include Wesvar, SUDAAN, Stata, and the new survey procedures (proc surveymeans, proc surveyreg) in SAS version 8. SAS version 9 will add a logistic regression procedure for survey data. Most estimation and modeling can be done with one of these packages, with the possible exception of hierarchical linear modeling (HLM). WesVar uses replication methods (jackknife, BRR), and Stata and SAS version 8 use linearization. SUDAAN uses both linearization and replication.
REFERENCES
Cochran, W. G. (1977). Sampling Techniques. New York: Wiley & Sons, ch. 6.
Oh, H.L. & F.J. Scheuren. (1987). “Modified raking ratio estimation.” Survey Methodology, 13, 209-219.
Research Triangle Institute. (2001). SUDAAN User's Manual, Release 8.0. Research Triangle Park, NC: Author.
StataCorp. (2001). Stata Statistical Software: Release 7.0. College Station, TX: Author.
Wolter, K. (1985). Introduction to Variance Estimation. New York: Springer-Verlag.
WesVar (2003). WesVarTM 4.2 User’s Guide. Rockville, MD: Westat.
1 The overall weighted crossover rate for the non-Head Start group was 17.6 percent. (back)
| Table of Contents | Previous | Next |

