Table of Contents | Previous |
Technical Report #6
Missing Data within the National Transition Demonstration Study
A CONSIDERATION OF MISSING DATA
“... estimation techniques ... assume missing at random data, tests of the assumption would seem necessary. However, an exact test would require the missing values to determine if indeed there was an association between the values of a variable and whether or not it was missing. If this information were available, then, the data would not be missing.” Rovine, M.J. and Delaney, M., 1990Introduction
Even in the best designed and most closely monitored study, inadvertent omissions occur. Individual data points are missing -- children may be absent on the date of testing, a respondent inadvertently skips a question, an examination booklet may be ruined, or the subject may be unwilling to respond to a particular item or series of items. What this failure to follow a subject at all time points means to a repeated measures analysis is that missing values may arise at any of the cross-sectional time points of measurement. In other words, missing values arise whenever one or more of the sequences of measurements from study units are incomplete in the sense that intended measurements are unavailable. The loss of these intended measures results in an incomplete or unbalanced data matrix with unequal numbers of measures for each subject. An unbalanced data matrix raises any number of technical difficulties based on the criteria underlying a particular analysis. Both parameter estimation and tests of linear hypotheses become more difficult because bias may be introduced. In addition to the statistical definition of bias, other technical and conceptual issues must be considered. From a data analysis perspective, any or all of a series of questions should be asked, such as: what are the patterns of missing data; is this condition of missingness random or nonrandom; or why are the data missing. We will begin to answer these questions following some background material.Background
Until recently, available analytical methods focused on the removal of missing values. This removal was accomplished in one of two ways. The standard method used by most statistical packages is complete-case analysis. In this process only those cases where the data matrix is complete are analyzed, while the remaining cases with missing data values are discarded (This method is also called listwise deletion). As noted above, bias may be introduced as a result of those subjects having complete cases not being representative of the sample. More on this will be said later. The alternative procedure is to substitute reasonable values for the missing items. The classic substitution is that of the mean. This method is frequently used to complete missing scale items when the items are assumed to be drawn from a single construct domain. In addition to the mean, regression predictions are often used. Both these methods have the advantage of easy implementation; however, they have well known disadvantages (Little& Rubin, 1987). In the case of multivariate analyses using a large number of variables, complete case analysis’ rejecting incomplete cases could result in losing an extraordinarily high number of cases. Consider the following examples where we construct several scenarios. Suppose you have measured 10 variables on 100 subjects, and wish to perform an analysis requiring all 10 variables. If 5 percent of the sample are missing data for a single variable, we still have 95 percent of the cases with complete data. On the other hand, if 5 percent of the sample have missing data on each of the 10 variables, and the pattern of missing data is such that no one subject is missing data on more than a single variable, then this lack of overlap in missing values causes half the cases to become candidates for deletion. This appears to be an extreme, with the number of variables equal to one-tenth the number of subjects; but is it so extreme? What happens where we are missing only 2 percent of the data, yet we have 20 variables measured on 5,000 subjects? Applying the same condition of mutual exclusivity of missing data in which no subject is missing data on more than one variable, this minimization of overlap causes the loss of over 2,000 subjects, or 40 percent of the cases. The following table summarizes these two extreme facets of the missing data problem. Please note that had the rightmost column of Table 1. been allowed to reach 5 percent, the entire data sample would have been lost.| Initial Sample | data missing across a single variable at the indicated rates | data missing independently across 20 variables | ||||
|---|---|---|---|---|---|---|
| Size (N) | 1% | 2% | 5% | 1% | 2% | 3% |
| 1000 | 990 | 980 | 950 | 800 | 600 | 400 |
| 3000 | 2970 | 2940 | 2910 | 2400 | 1800 | 1200 |
| 5000 | 4950 | 4900 | 4750 | 4000 | 3000 | 2000 |
| 7000 | 6930 | 6860 | 6650 | 5600 | 4200 | 2800 |
| 9000 | 8910 | 8820 | 8550 | 7200 | 5400 | 3600 |
With these worst case inefficiencies of complete case analysis firmly in mind, let’s look toward determining the patterns of missing data. First, we will examine formal definitions of missing data, and their associated mechanisms, and then look at illustrative examples from the Transition data.
Formal Definitions of Randomly Missing Data
A formal mathematical statement of random missingness has been elegantly made (Rubin, 1976; Little & Rubin, 1987; Little 1988). The following notation and definitions are taken from Little(1988). Let y denote a (n x p) data matrix of n observations on p variables, and r denote an (n x p) missingness indicator matrix such that rij = 1 if yij is missing, and 0 otherwise. The data and missing data mechanism may be explained mathematically. A complete model for the data and the missing-data-mechanism specifies a distribution f ( y | ? ) for y, indexed by unknown parameters ?, and a distribution of f ( r | y, ? ) for r, given y, indexed by the unknown parameters ? . Write y = (yobs, ymis) where yobs represents the observed values of y and ymis represents the missing values. Rubin (1976) defined the missing data as Missing Completely at Random (MCAR) if f ( r | yij, ymis, ? ) = f (r | ? ) for all yij and ymis; that is, missingness does not depend on the observed or missing values)of y. Rubin also defined a weaker condition for the missing-data-mechanism, calling the missing data missing at random (MAR) if f ( r | yobs, ymis, ? ) = f ( r | yij, ? ) for all ymis; that is missingness does not depend on observed values in the data set.Patterns of Missing Data
Regression is one of the most powerful tools available to the applied researcher. Given the prominence of this statistical tool, at least four explicit patterns of missing data have been identified (Little,1992). Suppose a random sample of N individuals is selected. For each of these individuals p+1observations are desired: X1 , X2 , ... , Xp , Y. However, for some of the individuals, one or more, but not all of the X’s are missing. The following is assumed:(E(Y|X1,X2,...,Xp=B0+B1X1+B2X2+... + BpXp and estimates of Bj and E(Y| X1 , X2 , ... , Xp) are desired.
It is legitimate to ask is there a pattern to the missing data? Consistent with the discussion of the lefthand side of Table 1 above, the following pattern of univariate missing (data (see Fig. 1) is offered. In this case, the missing data are confined to a single variable X1
This is a special instance of monotone or nested data, where the columns can be arranged so that Xj+1 is observed for every case where Xj is observed, for all j = 1 , ..., p. Figure 3 displays a pattern where two (possibly more) of the variables are never observed together. This condition arises when two different files are merged. A specific Transition example would be where the Teacher’s Rating of Academic Competence from the Teacher Questionnaire, Part B is merged into a single file with a child’s scores for the Woodcock-Johnson’s Broad Reading and Broad Mathematics.
Finally, Figure 4 is presented. What you see is a generic pattern with no apparent structure underlying the arrangement of the observed and missing data.
|
|
|
Integral to these notes on missing data patterns is whether missingness is related to the data values. For example, given the univariate (missing data in Figure 1, the probability that X1 might be missing for a particular case may (a) be independent of data values, (b) depend on the (value of .X1 for that case, or (c) depend on the (value of X2, ... , Xp for that case. This list of possibilities is by no means exhaustive; it is meant to be illustrative. If the variable were a test score, the child might have been absent. In that case, (a) might apply. If the variable happened to be education level, and persons who dropped out of school are less likely to respond, then (b) might apply.
At this point, we restate the mathematical definitions in light of the
above discussion of patterns of missing data. Data are missing at random
(MAR) when the distribution of missing data indicators depends on the data
only through the observed values. Data are missing completely at random
(MCAR) if the distribution of missing data indicators does not depend on
either the observed or missing data. Thus, mechanism (a) is MCAR, while
mechanisms (a) and (c) are MAR. Finally, mechanism (b) is not-MAR since
the variable’s missingness is a result of its value. With a solid grasp
of the missing-data mechanism, attention will now be focused on a review
of methods available.
Currently, there are at least four broad categories of methods for handling
missing data (Timm & Mieczkowski, 1997). Three of these are based, in
part, on either implicit or explicit models for the data and missing-data
mechanism (Little, 1992). The four broad classes of method may be summarized
as follows:
- refined least squares methods
- multiple imputation (MI) techniques
- methods using maximum likelihood (ML) or restricted maximum likelihood (REML)
- Bayesian methods
A common theme to the latter three methods is that they are model-based. The approach taken in the remainder of this report will be one of application to research; a review of the mathematical theory of these methods would be beyond the scope of this technical report.
With recent advances in software packages and the power of desktop computing, the applied researcher now has new tools to aid in the analysis of unbalanced longitudinal data. These tools have only become available since the beginning of the Transition Project. Two such advanced tools are PROC MIXED (SAS, 1992) and SPSS Missing Value Analysis (Hill, 1997). They reflect a subtle shift of emphasis in statistical viewpoint. This shift might be verbalized as: “It used to be missing data was something undesirable to be removed (listwise or pairwise deletion); now, missing data are actually information that may be averaged across(expectation maximization).” The SPSS Missing Value Analysis Module contains multiple ways to deal with missing data. This software module allows the researcher to perform three distinct tasks. First, it allows the researcher to describe the patterns of missing data. (The word pattern as used here indicates the dichotomized version of a random variate -- that is, a binary distribution where each value is missing or present). Second, estimations of means, standard deviations, covariances, and correlations are computed within the chosen method of analysis. The estimations of this second step may be computed using one or more methods: all values, listwise deletion, pairwise deletion, linear regression, or expectation maximization (EM). Third, imputation of missing values may be accomplished using either the EM algorithm, or the least squares method. Finally, by following a specific process of least squares regression, the researcher may also emulate multiple imputation (Hill, 1997). Only recently have new computational algorithms and software become available to implement multiple imputation in complex multivariate settings (Schafer & Olsen, 1999).
Description of the Data Set
Two cohorts of former Head Start children and families were recruited
at 31 sites to participate in the study. Cohort I includes 3,540 children
and their families, enrolled shortly before or after entry into kindergarten
in the Fall of 1992; Cohort II includes 3,975 children and their families
whose children entered kindergarten in the Fall of 1993.
As noted, the data sets for this study are derived from source documents providing data on either children or families, or both children and families. Toward this end, there are four documents used as data sources. These documents are the Family Interview, the Child Instrument, the School Archival Records Search (SARS), and the Teacher Questionnaire, Part B. Data from these four sources were merged on a key variable representing the child’s identification. A single record’s source could be as few as one of these files, or as many as all four, or any combination of the four sources. This provides a total of 14 component parts to the analytical construct of a family unit. The family unit may be visualized as a tetrahedron, i.e., a four-sided closed figure whose faces are equilateral triangles. Further, the volume of the figure would be the tetrahedron itself. The component parts of the tetrahedron and their mapping to the pieces of the family unit are shown in Table 2.
Search for Patterns and At-Random Missingness
As a principle, missing values of key variables are reviewed prior
to any and all analyses of data from the National Transition Demonstration
Study. These analysis-specific reviews concentrate on those variables chosen
for inclusion in the specific analysis, including both dependent and independent
variables. A more general review of missing data was undertaken to (1) identify
the quantity of missing values on a set of key variables and (2) the patterns
of missing values on those variables. The key variables reviewed included:
(1) child gender (male, female); (2) child has a health condition that interferes
with school attendance (yes, no); (3) child’s receptive language ability,
as measured by the Peabody Picture Vocabulary Test (Rasch-Wright score;
continuous); (4) child’s social skills, as rated by the primary caregiver
using the Social Skills Rating System (standard score; continuous); (5)
caregiver ethnicity (white/non-Hispanic, African American, Hispanic/Latio,
Asian/Pacific Islander, other); (6) caregiver is high school graduate (yes,
no); (7) caregiver is foreign-born (not born in US; yes, no); (8) caregiver
reports having a chronic health condition that interferes with ability to
care for the child (yes, no); (9) caregiver employed full-time (yes, no);
(10) father or father figure (male stepparent or grandfather) is present
in home (yes, no); (11) mother is present in home (yes, no); (12) family
mobility (0, 1-2, or 2 or more moves within past year); (13) family receives
AFDC (yes, no); (14) family receives SSI (yes, no); and (15) family speaks
a language other than English in the home (yes, no). These variables were
chosen because they were either (1) critical outcome variables for primary
analyses or (2) thought to represent important characteristics of the child,
caregiver, or family that could influence program participation or benefit.
They are the same variables utilized in assessments of differential attrition
patterns and to review the comparability of treatment groups and cohorts
at baseline.
Descriptive analyses were completed in two phases. In the first phase, the sources (site, cohort, treatment condition) of missing interview forms were identified to determine whether differential patterns of missing forms were evident. In the second phase, individual variables were reviewed to determine the quantity of missing data in kindergarten and in third grade. For those variables with more than one percent of the data missing at a given time point, additional review identified the patterns of missing values across treatment groups, across sites, and across cohorts. No statistical comparisons were completed.
| Attribute of Tetrahedron | Source of Data Record | Years in School | ||||
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 1 and 3 or 4 | ||
| 4 surfaces | Child Instrument (ci) | 7325 | 6065 | 5788 | 5740 | 6193 |
| Family Interview (fi) | 7078 | 5493 | 5284 | 5196 | 5903 | |
| SARS | 6535 | 5675 | 5383 | 5390 | 5900 | |
| Teacher, Part B. (tb) | 6501 | 5331 | 4821 | 4572 | 5445 | |
| 6 edges | child-family | 6894 | 5289 | 5037 | 4907 | 4276 |
| child-SARS | 6435 | 5460 | 5153 | 5145 | 4606 | |
| child-teacher b | 6412 | 5188 | 4714 | 4484 | 3854 | |
| family-SARS | 6253 | 4945 | 4616 | 4583 | 3901 | |
| family-teacher b. | 6208 | 4688 | 4229 | 4015 | 3308 | |
| SARS-teacher b. | 6054 | 5037 | 4620 | 4438 | 3800 | |
| 4 vertices | ci-fi-SARS | 6154 | 4859 | 4539 | 4452 | 5200 |
| ci-fi-tb | 6124 | 4609 | 4177 | 3962 | 4886 | |
| ci-SARS-tb | 5991 | 4947 | 4560 | 4385 | 5206 | |
| fi-SARS-tb | 5798 | 4492 | 4091 | 3929 | 4807 | |
| Volume | Total Family Units | 7515 at Baseline | ||||
Results
As can be seen in Table 3, a total of 437 families (5.8%) who were enrolled in the study as the children entered kindergarten did not complete a family interview in either the fall or the spring of the kindergarten year. Of these 437 families, 44 percent were enrolled in the demonstration (treatment) group and 56 percent in the comparison (control) group. Slightly more than half (53%) of the families were enrolled in Cohort 1. Each of the 30 sites had at least one family with a missing family interview, but the bulk of the missing family interviews (40%) were concentrated in four sites, each of which accounted for approximately 10 percent of the missing interviews. Of these four sites, one was Florida, where Hurricane Andrew seriously disrupted both program and evaluation activities for a period of more than one year. The other three sites were Nevada, New York, and Virginia. Five additional sites -- Alabama, Massachusetts, Maryland, Ohio, and Oregon -- account for another 35 percent of the missing family interviews. Thus, while every site had some families who did not receive an interview in the kindergarten year, the majority of the missing interviews (75%) can be attributed to only nine sites.| n = 7,515 | Missing Family Interview | Missing Child Assessment | Missing School Archival Records Search |
|---|---|---|---|
| Total Missing | 437 | 190 | 980 |
| Total Missing (%) | 5.8% | 2.5% | 13.1% |
| Cohort affiliation | |||
| Cohort 1 | 231 | 99 | 495 |
| Cohort 1 (%) | 52.9% | 52.1% | 50.5% |
| Cohort 2 | 206 | 91 | 485 |
| Cohort 2 (%) | 47.1% | 47.9% | 49.5% |
| Treatment Group assignment | |||
| Demonstration | 194 | 87 | 491 |
| Demonstration (%) | 44.4% | 45.8% | 50.1% |
| Comparison | 243 | 103 | 489 |
| Comparison (%) | 55.6% | 54.2% | 49.9% |
total of 190 families (2.5%) did not successfully complete a child assessment in either fall or spring of kindergarten. Of these families, slightly more than half (54%) were enrolled in the comparison group, and approximately half (52%) were enrolled in the first cohort. Nearly 75 percent of the families were located in only five sites -- Alaska (14.2%), Massachusetts (19.0%), Nevada (13.7%), New Jersey (7.4%), and Virginia (17.9%). In the majority of those five sites, missing interviews were equally distributed across both treatment conditions. In two sites (Alaska and Virginia), missing interviews were primarily comparison children. (Note: Virginia did not administer the academic standard instruments, but did administer the interview for children. Thus, child assessment forms exist for Virginia but include a larger percentage of missing data than other sites.) Additionally, in Alaska and Nevada, missing interviews were predominantly Cohort 1 children, while in Massachusetts and Virginia missing interviews were primarily Cohort 2 children.
A total of 980 (13.1%) of families did not have a School Archival Records Search form in kindergarten. These families were equally distributed across treatment condition (50.1% demonstration, 49.9% comparison) and across cohorts (50.5% Cohort 1, 49.5% Cohort 2). Approximately half (49.6%) of the missing forms were located in four sites: Massachusetts (8.4%), New York (17.2%), Texas (18.1%), and Ohio (5.9%). The remainder were distributed across the remaining 26 sites.
Table 4 summarizes the quantity of missing data found for specific variables after taking into account (deleting from the denominator) the number of observations without a valid interview or assessment form. As can be seen, nine of the 16 key variables have missing values for less than one percent of the study participants with valid interview forms.
Further investigation of the seven variables with more than one percent missing values indicated substantial differences in proportions of missing values between demonstration and comparison groups for only one variable – i.e., child has current IEP. No apparent explanation exists for this difference. It is noted that the majority (139 of 238, 58%) of the missing values for this variable are found in the New York site. Another 14 percent (33) of the missing values are associated with the Arkansas site, and eight percent (18 values) were located in the Michigan data. The remaining missing values show no clear pattern of association with any one site. In both Arkansas and Michigan, the missing values were all in the comparison group, although there is no obvious explanation for this occurrence.
The majority of the missing values for the Peabody Picture Vocabulary Test (PPVT) scores are attributable to the Virginia site, in which neither the PPVT or the Woodcock-Johnson Tests of Achievement were administered in any data collection period. This circumstance accounts for 344 (96%) of the 358 missing values. Excluding the records from Virginia, there are only 14 children with missing values for the PPVT score (0.2% of the total number with existing kindergarten child assessments).
The greatest proportion of missing values was found on the school impairment variable (child has a health condition that interferes with school attendance). Thirty percent of the 571 missing values were found in three sites (Nevada, New York, and Ohio). Another 13 percent were found in two additional sites (Indiana and Maryland). No specific circumstances were identified that might explain the omission of these data in any of those sites. The remaining missing values were distributed across the remaining 24 sites, with no discernible patterns of loading.
The missing values associated with the social skills rating by the caregiver were also distributed across a total of 29 sites (only Tennessee had no missing values). Approximately 30 percent of the missing values were found in three sites (Nevada, New York, and Ohio). Another 22 percent of the missing values were located in four additional sites -- Arizona, Indiana, Maryland, and Massachusetts -- but no specific explanation of the clustering of missing values in these seven sites is apparent.
Caregiver ethnicity was missing on nearly three percent of the family interviews. Missing values were distributed across 27 of the 30 sites. Six sites -- Indiana, New York, Ohio, Oregon, Wisconsin & Maryland -- accounted for nearly half (49.0%) of the missing values, and three sites (Illinois, Michigan, and Texas) contributed another 12 percent of the missing values.
| Total Missing | Missing Demonstration | Missing Comparison | |
|---|---|---|---|
| Child characteristics |
|||
| Gender* |
81 | 49 | 32 |
| 1.10% | 1.30% | 1.00% | |
| Health impairment that interferes with
school attendance* |
571 | 320 | 251 |
| 8.10% | 8.60% | 7.50% | |
| Receptive language ability (PPVT Rasch-Wright
score)** |
358 | 187 | 171 |
| 4.90% | 4.90% | 4.90% | |
| Social skills (Social Skills Rating
System, Standard score)* |
364 | 193 | 171 |
| 5.10% | 5.20% | 4.90% | |
| Child has current IEP*** |
238 | 79 | 159 |
| 3.60% | 2.30% | 5.10% | |
| Caregiver characteristics |
|||
| Ethnicity* |
192 | 111 | 81 |
| 2.70% | 3.00% | 2.40% | |
| High school graduate* |
545 | 304 | 241 |
| 7.70% | 8.10% | 7.20% | |
| Born outside the US* |
241 | 136 | 105 |
| 3.40% | 3.60% | 3.10% | |
| Caregiver has chronic health condition
that interferes with ability to care for child* |
10 | -- | -- |
| 0.1% | -- | -- | |
| Family characteristics |
-- | -- | |
| Caregiver employed full-time* |
40 |
-- | -- |
| 0.6% | |||
| Father figure present in home* |
9 |
-- | -- |
| 0.1% | |||
| Mother present in home* |
8 |
-- | -- |
| 0.1% | |||
| Family mobility* |
17 |
-- | -- |
| 0.2% | |||
| Family receives AFDC* |
21 | -- | -- |
0.3% |
-- | -- | |
| Family receives SSI* |
22 |
-- | -- |
| 0.3% | -- | -- | |
| Family speaks language other than English
in home* |
9 | -- | -- |
| 0.10% | -- | -- |
** n = 7,325 (records with a valid child assessment instrument)
*** n = 6,535 (records with a valid School Archival Records Search form)
Educational level was missing for eight percent (545) of the families. Of those families, 82 percent were interviewed only in the spring, when the educational variable was not included in the interview. Slightly more than 65 percent of the fall interviews with missing values were clustered in a single site -- Minnesota -- where there was a large minority (Hmong) population. Each of those records with missing values for education were associated with informants who were born outside the U.S. This suggests that these caregiver informants may have been educated outside of the United States and unable to answer the question (even though an appropriate response option was available for this circumstance).
Concerning the immigrant status of the primary caregiver, more than three percent of the data were missing. Approximately 50 percent of these missing values were clustered in five sites: Arizona (10.0%), Indiana (15.4%), New York (10.4%), Ohio (6.2%), and Wisconsin (7.9%). No other patterns of distribution were noted -- i.e., missing values on U.S. births were not associated with particular ethnic groups or any other characteristic of the family or caregiver.
Second and third grade data were combined into a single endpoint data set and missing values of the identified key variables were reviewed in a similar manner. The results of this review are summarized in Table 5 below.
The results of these variable reviews indicate that, after taking into account (deleting from the denominator) the number of observations without a valid interview or assessment form, eight of the 15 variables have missing values for less than one percent of the study participants with valid interview forms.
Further investigation of the seven variables with more than one percent missing values indicated that none of the variables had important differences between demonstration and comparison groups, in terms of the percentages of missing values. The largest percentages of missing values were noted for the caregiver ethnicity and foreign birth (approximately 85% of the existing forms had missing values). The largest percentage (90%) of these missing values are attributable to a defined skip pattern for persons who had been interviewed previously. Thus, the data are available from previous rounds but are not available on this single interview form. The remaining 10 percent of the missing values are also thought to be related to the defined skip pattern, since the “interviewed previously” variable is missing for those records but all of the variables in the skip pattern also show missing values. Thus, it is hypothesized that in those 485 records, the interviewer failed to complete the “previously interviewed” variable but skipped the relevant items because the respondent had been previously interviewed. Approximately 60 percent of these “errors” were located in seven sites (Alabama, Illinois, Massachusetts, Maryland, New York, Rhode Island, and Texas).
| Total Missing | Missing Demonstration | Missing Comparison | |
|---|---|---|---|
| Child characteristics |
|||
| Gender* |
8 | 5 | 3 |
| 0.10% | 0.20% | 0.10% | |
| Health impairment that interferes with
school attendance* |
|||
| Receptive language ability (PPVT Rasch-Wright
score)** |
310 | 153 | 157 |
| 5.00% | 4.80% | 5.30% | |
| Social skills (Social Skills Rating
System, Standard score)* |
152 | 88 | 64 |
| 2.60% | 2.80% | 2.30% | |
| Child has current IEP*** |
32 | 19 | 13 |
| 0.50% | 0.60% | 0.50% | |
| Caregiver characteristics |
|||
| Ethnicity* |
4965 | ||
| 84.10% | |||
| Born outside the US* |
4983 | ||
| 84.40% | |||
| Caregiver has chronic health condition
that interferes with ability to care for child* |
4 | 2 | 2 |
| 0.10% | 0.10% | 0.10% | |
| Family characteristics |
|||
| Caregiver employed full-time* |
41 | 10 | 22 |
| 0.70% | 0.60% | 0.80% | |
| Father figure present in home* |
64 | 38 | 26 |
| 1,1% | 1.20% | 0.90% | |
| Mother present in home* |
63 | 37 | 26 |
| 1.10% | 1.20% | 0.90% | |
| Family mobility* |
17 | 8 | 9 |
| 0.30% | 0.30% | 0.30% | |
| Family receives AFDC* |
3 | 2 | 1 |
| 0.05% | 0.06% | 0.04% | |
| Family receives SSI* |
3 | 2 | 1 |
| 0.05% | 0.06% | 0.04% | |
| Family speaks language other than English
in home* |
17 | 7 | 10 |
| 0.30% | 0.20% | 0.40% |
* n = 5,903 (records with a valid family interview instrument)
** n = 6,193 (records with a valid child assessment instrument)
*** n = 5,900 (records with a valid School Archival Records Search form)
The next most commonly missing variable was the receptive language score, with five percent of the values missing overall. Of the 310 missing values, 270 (87%) were attributable to the Virginia site, where the Peabody Picture Vocabulary Test was not administered at all (in any round of data collection). With that site deleted from the dataset, there were only 40 missing values for the PPVT score (0.7% of the total), and the missing values were equally distributed across demonstration and comparison conditions (15 demonstration, 0.5%; 25 comparison, 0.9%).
The social skills rating by the caregiver was missing for slightly more than two percent of the respondents in third grade. The missing values were equally distributed across demonstration and comparison groups. Some 60 percent of the missing values were located in four sites -- Rhode Island (23.0%), New York (15.8%), Idaho (13.8%), and Arkansas (7.9%). Another 4.6 percent of the missing values were associated with the Arizona site. The remaining 35 percent of the missing values were distributed across the remaining 25 sites.
The variables indicating the presence of the mother and/or a father figure in the home showed virtually identical patterns of missing values. A total of only slightly over one percent of the data were missing (1.1% overall), and these missing values were located in a total of 10 sites. Five sites -- Arizona, Florida, Idaho, South Dakota, and Oregon --accounted for slightly more than 75% of the missing values, but no differential pattern across treatment conditions was noted overall or for any one site. There is no apparent explanation for the clustering of missing values in these sites.
Discussion
The results of these investigations reveal two important points: first, the quantity of missing data within the National Transition Demonstration Study is well within the parameters of what might be expected within a complex, multi-site, longitudinal study; and second, data are clearly not missing at random. Thus, analyses must proceed carefully and imputation techniques chosen with consideration of the patterns of missing data. For the statistical analyses reported in this overall report of study findings, no imputed data have been included. Future analyses will include data imputed using EM algorithm methods and results compared.
| Table of Contents | Previous |

