Table of Contents | Previous | Next |
Technical Report #3
Psychometric Analyses of Instruments Used
PSYCHOMETRIC REVIEW OF SCALES
Scott SnyderM. Lee Van Horn
In this psychometric review we have included all of the scales in the National Transition Project data set that are being used as measures of the project’s outcomes or predictors of outcomes. Only instruments for which a summary scale score or subscale scores are available are included in this review. The review focuses on the development and validation of the scoring methods for each of the instruments reviewed. With the exception of three nationally developed and standardized assessments, all of the instruments we review here have been validated on the National Transition Project data. As part of the psychometric review, the reliabilities of each instrument were also assessed. Internal reliability (usually Chronbach’s alphas) was computed for all scales or subscales using National Transition data. Test-retest reliabilities were not available from the Transition data; these reliabilities from the original development of each instrument are reported when they are available.
THE ASSESSMENT PROFILE FOR EARLY CHILDHOOD PROGRAMS: RESEARCH VERSION
Abbott-Shim & Sibley, 1998
The classroom environment is one of the areas that was targeted for change via the Transition Demonstration Project. Emphasis was placed on creating classroom environments and implementing classroom practices that were optimal for children’s social and academic development. The quality of classroom environments and practices were considered important dependent and mediating variables in the National Transition Demonstration Project. The Assessment Profile for Early Childhood Programs: Research Edition (Abbott-Shim & Sibley, 1987, 1992) was selected as the major tool for documenting classroom practices in this study. Although this tool was originally designed for use in preschool programs, it had been adapted for use in kindergarten classrooms (Abbott-Shim, Sibley 1992). This adaptation, the Assessment Profile for Early Childhood Programs: Research Version served as the foundation instrument for assessment of classroom practices in this study.
The Assessment Profile consists of a 87 dichotomous judgments (observed, not observed) made during a single observation within a classroom. Judgments are made in the following areas: Learning Environment (concerning the availability, variety and appropriateness of materials and learning materials and space), Scheduling (concerning the evidence of a schedule that balances a variety of activities), Interacting (concerning the quality of teacher-child interactions and the nature of classroom management), curriculum (concerning the nature of instructional delivery), and individualizing (concerning the nature and use of assessment).
To establish content validity, the Assessment Profile was cross-referenced with the Accreditation Criteria of the National Association for the Education of Young Children (NAEYC) and the Early Childhood Environment Rating Scale (Harms & Clifford, 1980). Reported reliabilities were in the .90 range and inter-rater reliabilities ranged from .85 to 1.00. Trainers of raters from each site were trained by the developers of the Assessment Profile to an .85 agreement criterion before being qualified to train local observers (also to the .85 criterion).
During the second year of the project, several items were added to the instrument in order to be more sensitive to characteristics and expectations of primary grade classrooms. Preliminary factor analyses of the Assessment Profile using Headstart Transition data suggested minor to moderate variations from the initial factor structure across grade levels. These variations reflected differences in the initial calibration sample and the primary grade classrooms of the Headstart Transition sample, and a lack of variability for several items, the inclusion of the new field test items. While the national evaluation team and the developers of the Assessment Profile recognized such variation as meaningful, there was agreement that for the purposes of longitudinal analysis a factor structure should be selected that can be used across grade levels. That is, while cross-sectional analyses reveal moderate differences in factor structures for kindergarten and third grade classrooms, a common factor solution across grade levels was needed in order to address core research questions.
With approval of the project officer and monitoring by the directors and staff of the transition study, the developers of the Assessment Profile: Research Version used item response theory strategies and factor analyses to generate a five factor solution that was applicable to longitudinal comparisons. The resulting solution preserved the structure of the Assessment Profile: Research Version. The following text and tables are from the Psychometric Report of the Assessment Profile for Early Childhood Programs: Research Version for the National Transition Demonstration Project (Abbott-Shim, Sibley & Neel, 1998).
Psychometric Report of the Assessment Profile for Early Childhood Programs: Research Version for the National Transition Demonstration Project
March 1998Martha Abbott-Shim
Annette Sibley
John Neel
Historical Development of the Assessment Profile for Early Childhood Programs
The development of the Assessment Profile for Early Childhood Programs (Abbott-Shim & Sibley, 1987) began in 1975 as a formative assessment measure of the effectiveness of a child care teacher training project. The intent of the instrument was to document the application of teacher training in classroom settings. The original Assessment Profile contained 147 items across six dimensions: Health & Safety (24 items), Learning Environment (18 items), Scheduling (23 items), Curriculum (28 items), Interacting (32 items), and Individualizing (22 items). These dimensions were chosen to represent the training content and simultaneously represented a logical conceptual organization of the elements of classroom practices. Training experience demonstrated that the dimensions were inter-related and that changes in one dimension generally resulted in changes in other dimensions. The correlation coefficients of the scales also supported the inter-relatedness of the dimensions (Abbot-Shim, Sibley & Neel, 1992).The Assessment Profile for Early Childhood Programs: Research Version (Abbott-Shim & Sibley, 1992) was developed in response to the interest of researchers who were seeking an efficient, objective, observational measure of classroom practices. In an effort to respond to researchers who wanted to eliminate redundancies and reduce the number of items to “critical criteria”, each dimension of the Assessment Profile was factor analyzed to determine if there was sufficient common variance to meet the requirements of Item Response Theory (ITR) to form scales. The items in the Health & Safety scale did not share sufficient variance to meet the criteria of IRT. Therefore the Health & Safety scale was dropped from the Assessment Profile: Research Version. The remaining five scales (Learning Environment, Scheduling, Curriculum, Interacting, and Individualizing) had sufficient variance and were retained.
The National Transition Demonstration Project data set has provided a substantial number of classrooms and was, therefore, used to confirm the original analyses of the five scales. These analyses revealed modest changes in the factor structures. The original norming sample included 401 preschool, child care, Head Start and kindergarten classrooms (Abbott-Shim, Sibley & Neel, 1992). Since the Transition sample is larger and included primary grade classrooms, it is understandable that a slightly different factor structure might emerge. The analysis also examined a number of field test items that were included with the Transition sample.
Initial Analyses of Factor Structures of the Assessment Profile: Research Version with Transition Data
In conducting the analyses and examining the scales for the Assessment Profile: Research Version, several assumptions were taken into account. First, the authors of the Assessment Profile: Research Version recognize their predisposition to preserve the original scales of the instrument. However, in examining the analyses, the authors utilized objective criteria throughout the scale revision process. Second, the 87 items that were common across the data sets for all grade levels were used in the analyses. Therefore, additional field tests items which had been added to the instrument and collected at some of the data collection periods were eliminated from these analyses because they were not available for kindergarten through third grade data sets. Third, it was decided that there would be the same number of items for each of the revised scales on the Assessment Profile: Research Version. The item pools for each of the scales had differing numbers of items and some scales had greater numbers of items that were more acceptable than other scales. Therefore, a few good items were eliminated from some scales because item selection was needed to obtain an equal number of items for each scale. Finally, IRT was used as the primary selection criterion since IRT scoring is used for the instrument.The National Transition Assessment Profile: Research Version data included kindergarten, first, second, and third grade classrooms. All factor analyses reported here used tetrachoric correlations. As seen in Table 1, separate factor analyses for kindergarten, first, second, and third grade levels were conducted to determine if there were differences in factor structures across these grade levels. Table 1 reports the findings for 933 kindergarten, 935 first grade, 762 second grade, and 820 third grade classrooms. Factor loadings of .40 and above are reported for the items represented in the first five factors by grade level. The factors and items are fairly stable across the different grade levels with the exception of first grade in which the Interacting and Curriculum items merge into one factor. Although additional factors emerged in these analyses, none of these were stable across the different grade levels.
The results of these factor analyses across grade levels were similar enough that a factor analysis for the combined data set, including 2,630 classrooms across kindergarten, first and second grades, was conducted. Five factors in this analysis accounted for 71% of the variance. These five factors were clearly the five scales of the Assessment Profile: Research Version. The factor structure of the combined groups was similar to the factor structures of the separate groups. In addition to these five factors, there were eight other factors, each factor consisting of one to five items. Items measuring factors other than the five factors of the Assessment Profile: Research Version were prime candidates for removal from the scales. Since we were creating IRT based scales, we decided to use IRT procedures to select items.
Item Response Theory (IRT) Analyses for the Assessment Profile: Research Version
We were encouraged by the five factors to try an IRT scaling for the five scales of the Assessment Profile: Research Version for the combined data sets, 2,630 kindergarten, first, and second grade classrooms. First we conducted separate factor analyses on the items for each of the scales and found that the primary factor accounted for 59%, 63%, 41%, 71%, and 45% of the variance of the items on the Learning Environment, Scheduling, Curriculum, Interacting, and Individualizing scales respectively. These percentages substantially exceed Reckase’s (1979) 25% requirement for essential unidimensionality for the creation of an IRT based scale. Unidimensionality was supported and creation of IRT based scales was thus justified.
It was our initial expectation that we would base the scales on a two parameter IRT model because the scores are based on observation and there is not a possibility of guessing. However, there was enough difference in the fit statistics of the items to justify a three-parameter rather than a two-parameter model. The difference in the fit statistics occurred because the lower asymptote for some item characteristic functions was obviously non-zero. Figure 1 illustrates the difference in fit as plotted by the Bilog (Mislevsky, R.J. & Bock, 1990) computer program. Figure 1 shows the fit of a three-parameter model of an item with lower asymptote of .50. In Figure 1 the dots represent the three-parameter item characteristic curve found for this item. The x’s represent found proportions of positive responses to the item, and the asterisks represent what the lower asymptote might look like if the two-parameter model were used with the required lower asymptote of zero. As can be seen from the figure by the greater distance from the x’s to the asterisks than from the x’s to the dots, the lower asymptote of zero does not fit the observed proportion well. It is this lack of fit which leads to a three parameter model rather than a two parameter model. Examination for item fit yielded 12 usable items on each scale. The IRT reliabilities of the Learning Environment, Scheduling, Curriculum, Interacting, and Individualizing scales were .89, .95, .80, .93, and .90, respectively.
We interpreted the non-zero intercepts under the three-parameter model to mean that even for a lower functioning classroom, there was still a positive probability that the item would be observed. For example, in the Scheduling scale item # 12 [Classroom activities reflect variety: There is daily time when Teacher works with a small group of three to eight children.] has an asymptote of .50 while in the curriculum scale, item # 9 [Curriculum is individualized: Activities that involve children of differing skill levels are modified to accommodate variation within the group] in the Curriculum scale has an asymptote of .11. These asymptotes reflect that about half of the lower functioning classrooms were observed to have a teacher working with a small group, while in about one of ten lower functioning classrooms, the teacher was observed to modify activities for different skill levels.
Finally, we conducted separate factor analyses on the 12 items for each of the scales and it was found that the primary factor accounted for 75%, 75%, 55%, and 58% of the variance of the items on the Learning Environment, Scheduling, Curriculum Interacting, and Individualizing scales, respectively. The median factor loadings were .82, .97, .75, .84, and .82, respectively.
Factor Structures of the Revised Scales on the Assessment Profile: Research Version
Having selected the 12 items with the best fit for each of the five scales, we ran a factor analysis to examine the resulting factor structure. Using the 2,630 classrooms across kindergarten, first and second grades, five factors accounted for 78% of the variance. Table 2 reports the factor loadings of .30 and above for 57 items on the five scales of the Assessment Profile: Research Version. Only three items did not load on the first five factors. However, the IRT analysis was able to use these items in forming the scales, since all items fit well. Since IRT requires only essential unidimensionality, as opposed to strict unidimensionality, for creation of a scale, it’s not surprising that a few items did not load on the five factors.The five scales of the Assessment Profile: Research Version have met the unidimensionality criteria for IRT creation of scales, shown strong fit to a three-parameter IRT model, and have strong IRT reliability estimates for the sample used. In addition, the factor analysis of the revised scales, which included 60 items, confirmed the factor structure. The Assessment Profile: Research Version is thus shown to be a useful set of scales for measuring developmentally appropriate teaching practices.
References
Abbott-Shim, M.S. & Sibley, A.N. (1987). Assessment Profile for Early Childhood Programs. Atlanta: Quality Assist, Inc.
Abbott-Shim, M.S. & Sibley, A.N. (1992). Assessment Profile for Early Childhood Programs: Research Version. Atlanta: Quality Assist, Inc.
Abbott-Shim, M.S., Sibley, A.N., & Neel, John (1992). Research Manual, Assessment Profile for Early Childhood Programs. Atlanta: Quality Assist, Inc.
Mislevy, R.J., & Bock, R.D. (1990). BILOG 3 [Computer software]. Chicago: Scientific Software.
Reckase, M.D. (1979). Unifactor latent trait models applied to multifactor tests: Results and implications. Journal of Educational Statistics, 4 (3), 207-230.
|
| Item Number Interacting |
Kindergarten (n=933) Factor 1 |
1st Grade (n=935) Factor 1* |
2nd Grade (n=762) Factor 2 |
3rd Grade (n=820) Factor 5 |
|---|---|---|---|---|
| A1 | 0.88 | 0.91 | 0.77 | |
| A2 | 0.84 | 0.9 | 0.78 | |
| A3 | 0.81 | 0.86 | 0.58 | |
| A4 | 0.71 | 0.9 | 0.62 | |
| B1 | 0.82 | 0.83 | 0.81 | 0.61 |
| B2 | 0.73 | 0.83 | 0.72 | |
| B3 | 0.75 | 0.84 | 0.65 | |
| C1 | 0.67 | 0.41 | 0.61 | 0.49 |
| C2 | 0.94 | 0.8 | 0.92 | 0.83 |
| C3 | 0.83 | 0.68 | 0.74 | 0.73 |
| C4 | 0.76 | 0.73 | 0.82 | 0.87 |
| C5 | 0.89 | 0.84 | 0.87 | 0.8 |
| D1 | 0.72 | 0.84 | 0.54 | |
| D2 | 0.69 | 0.74 | 0.51 | |
| D3 | 0.43 | 0.57 | ||
| Sch A3 | 0.63 | |||
| Sch C2 | 0.62 | |||
| Cur D1 | 0.6 | 0.65 | ||
| Cur D2 | 0.62 |
| Item Number Learning Environment |
Kindergarten (n=933) Factor 2 |
1st Grade (n=935) Factor 3 |
2nd Grade (n=762) Factor 1 |
3rd Grade (n=820) Factor 1 |
|---|---|---|---|---|
| A1 | 0.96 | 0.89 | 0.96 | 0.87 |
| A2 | 0.71 | 0.54 | 0.7 | 0.81 |
| A3 | 0.96 | 0.88 | 0.94 | 0.9 |
| A4 | 0.97 | 0.85 | 0.84 | 0.58 |
| A5 | 0.93 | 0.93 | 0.84 | |
| A6 | 0.94 | 0.91 | 0.92 | 0.92 |
| A7 | 0.96 | 0.88 | 0.94 | 0.92 |
| A8 | 0.82 | 0.58 | 0.82 | 0.75 |
| A9 | 0.7 | 0.55 | 0.93 | 0.88 |
| B1 | 0.48 | 0.55 | 0.69 | 0.43 |
| B3 | 0.88 | 0.72 | 0.82 | 0.84 |
| B5 | 0.44 | |||
| C2 | 0.57 | 0.62 | 0.45 | |
| Cur A1 | 0.6 | |||
| Cur B2 | 0.64 | |||
| Cur B3 | 0.61 | 0.76 | ||
| Cur B4 | 0.81 | 0.82 | ||
| Cur C1 | 0.56 | 0.6 | ||
| Cur C2 | 0.86 | 0.66 | ||
| Cur C5 | 0.69 | 0.9 | 0.81 | |
| Cur C6 | 0.57 | 0.61 | ||
| Int D3 | 0.84 | 0.59 | ||
| Sch C3 | 0.69 | 0.7 | ||
| Ind C2 | 0.54 | |||
| Int D3 | 0.87 |
| Item Number Curriculum |
Kindergarten (n=933) Factor 5 |
1st Grade (n=935) Factor 1 |
2nd Grade (n=762) Factor 5 |
3rd Grade (n=820) Factor 5 |
|---|---|---|---|---|
| A3 | 0.59 | |||
| B1 | 0.7 | 0.59 | ||
| B2 | 0.58 | 0.57 | ||
| B3 | 0.53 | 0.75 | ||
| B4 | 0.58 | 0.58 | ||
| B5 | 0.65 | 0.71 | 0.55 | 0.67 |
| B6 | 0.78 | 0.49 | ||
| B7 | 0.7 | 0.71 | ||
| C1 | 0.78 | 0.63 | ||
| C2 | 0.72 | 0.47 | ||
| C3 | 0.77 | 0.79 | 0.4 | |
| C4 | 0.73 | 0.53 | ||
| C5 | 0.46 | |||
| D1 | 0.48 | 0.64 | ||
| D2 | 0.59 | |||
| D3 | 0.46 | |||
| D4 | 0.67 | 0.72 | ||
| D5 | 0.42 | 0.83 | ||
| LE A10 | 0.49 | |||
| Ind D2 | 0.5 | |||
| Sch A3 | 0.57 | |||
| Sch C1 | 0.79 | |||
| Sch C4 | 0.74 |
| Item Number Factor 5 |
Kindergarten (n=933) Factor 3 |
1st Grade (n=935) Factor 2 |
2nd Grade (n=762) Factor 3 |
3rd Grade (n=820) Factor 3 |
|---|---|---|---|---|
| A1 | .93 | .98 | .94 | .98 |
| A2 | .59 | .46 | ||
| B1 | .97 | .99 | .92 | .97 |
| B2 | .97 | .97 | .94 | .96 |
| B3 | .96 | .96 | .96 | |
| B4 | .85 | .89 | .84 | .80 |
| B5 | .97 | .99 | .94 | .98 |
| B6 | .84 | .94 | .85 | .83 |
| B7 | .95 | .96 | .96 | .95 |
| B8 | .96 | .99 | .95 | .99 |
| C3 | .46 | |||
| LE C2 | .52 | |||
| Cur A1 | .52 |
| Item Number Individualizing |
Kindergarten (n=933) Factor 4 |
1st Grade (n=935) Factor 4 |
2nd Grade (n=762) Factor 4 |
3rd Grade (n=820) Factor 2 |
|---|---|---|---|---|
| A1 | .76 | .72 | .72 | .73 |
| A2 | .69 | .75 | .65 | .59 |
| A3 | .79 | .66 | .81 | .79 |
| B1 | .90 | .80 | .79 | .73 |
| B2 | .92 | .87 | .92 | .89 |
| B3 | .86 | .79 | .90 | .95 |
| B4 | .90 | .89 | .92 | .94 |
| B5 | .59 | .67 | .67 | .75 |
| C1 | .44 | |||
| E4 | .64 | |||
| Cur C4 | .49 | |||
| Cur D5 | .40 | |||
| Cur D6 | .87 | .71 | .76 | .89 |
| Cur E4 | .52 | .64 | .60 |
|
|
||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||
|
To put the synopsis from Quality Assist in perspective, a report by a member of the Transition
Demonstration Project who monitored the work of Quality Assist follows:
SUMMARY FEEDBACK FROM THE MEETING WITH QUALITY ASSIST TO DISCUSS THE LONGITUDINAL REVISION OF THE ASSESSMENT PROFILE
Scott Snyder
The development of the Assessment Profile (2(nd) revision) by the staff of Quality Assist for use in the longitudinal evaluation of the Headstart Transition project was based on several critical decisions. First, a decision was made to preserve, to the extent possible, the initial five-factor framework of the original Assessment Profile. Second, as the Assessment Profile had undergone revision during the course of the longitudinal study, a decision was made to base longitudinal instrument development only on those 87 items common to all grades and cohorts. Third, in order to avoid interdependencies in the data due to classrooms being used for both cohorts, each classroom is analyzed only once. Fourth, it was decided to generate scales with equal numbers of items. Finally, a decision was made to use item response theory (ITR) as the major analytical procedure for constructing scales and for reporting results. The fourth and fifth decisions are psychometrically related (i.e., it is the use of IRT that enables scales of equal numbers of items to be generated without substantial loss of validity within any single scale).
The research done by Quality Assist appears to conform to technical standards of instrument development. The development of a longitudinal scale presents unique psychometric challenges associated with selecting scale structures that are both reasonably stable across grade levels and also interpretable. Through the use of IRT, factor analysis, qualitative and quantitative examination of item performance and qualitative interpretation of scale composition, the researchers have generated a psychometrically defensible and acceptable instrument for monitoring classroom characteristics at distinct grade levels.
The researchers at Quality Assist acknowledge that an independent group of researchers analyzing the data may, based on different decisions, generate an instrument comprised of different scales or items. However, it appears, based on my review of the analyses presented by Quality Assist, that an alternate structure capable of accommodating longitudinal data would not vary greatly from the proposed measure. It should not be surprising that an approximation of the original 5 factors would emerge from the analysis by Quality Assist, given that the item set that was analyzed was selected to reflect those 5 factors.
I have asked for some analyses to be performed before we formally recommend the generation of local and national data using the revised measure. First, I have asked for an investigation of the item variance across grade levels. This analysis is important in validating the stability of the scales across time. Second, I have asked that a quantitative demonstration be developed of the power of IRT to yield approximately equal scores based on 10, 12, or 14 items (or a similar distribution of items) within each scale of the Assessment Profile. Finally, I have asked for an analysis of the unidimensionality of the revised scales within each grade level. As there is sufficient evidence of unidimensionality across grade levels to justify use of the instrument for longitudinal analysis, unidimensionality within grade level would serve to support the use of the IRT scores for analysis within grade level.
It is important for users of the revised measure to examine the items which comprise each of the revised scales. While Quality Assist has elected to preserve the original labels for the factors, the constructs reflected by the remaining items are somewhat different than those in the original instrument.
Two features of the addendum by Dr. Snyder warrant comment. First, it is important to note that while the work of Quality Assist is psychometrically sound, an independent group of researchers may, using the available data, generate an alternative and equally defensible factor structure that does not match the five factor solution proposed by Quality Assist. Second, reductions and transpositions of items comprising subscales of the instrument alter, to some extent, the constructs represented by such subscales. Specifically the following changes are notable:
Learning Environment: A disproportionate number of items relating to the arrangement of classroom space to encourage independence and reflect the individuality of the child have been removed. Therefore, the Learning Environment subscale now primarily reflects (9 of 12 items) the nature and accessibility of instructional materials.
Scheduling: The changes in this scale involve the deletion of items concerning the variety of classroom activities. To some extent such items are implicit in other items, and therefore may not represent a substantial change in the underlying construct measured by the scale.
Curriculum: All items relating to multicultural sensitivity and appreciation were excluded due to the misfit of item performance with the expectations of the psychometric model used to generate the scale. More than half of the items concerning the ability of children to guide their own learning were eliminated. Half of the items relating to the individualization of curriculum were removed. Therefore, the Curriculum scale now focuses more on teacher instructional behaviors than previously.
Interacting: Due to their absence from kindergarten observations, all items dealing with the teacher providing support for student self-regulation were omitted. Furthermore, only one item (out of four) concerning child engagement within the instructional context was retained. Items concerning children’s affect were deleted or moved to another subscale.
Individualizing: All three items concerning the inclusion of, and accommodations for, children with special needs have been omitted. More than half of the items concerning parent-teacher communication were also omitted. These changes focus the revised scale on assessment practices.
The aforementioned changes highlight: (a) the need to limit interpretation of the Assessment Profile: Research Version for the Transition Study to the modified constructs discussed above, and (b) the need to examine not only scale scores from the Assessment Profile but also the individual items regarding multicultural sensitivity and inclusive practices for their value in evaluating the quality of classroom practices and as mediators of child outcomes.
Analyses of internal consistency of the resulting scales yielded coefficient alphas ranging from .78 to .91 across scales and grades. Moderate intercorrelations were evident amongst the scales within grade level, suggesting that while the scales are not interdependent, they do tend to covary.
A DEVELOPMENTALLY APPROPRIATE PRACTICES TEMPLATE (ADAPT)
Gottlieb, 1995In addition to the Assessment Profile for Early Childhood Programs: Research Version (Abbott-Shim, 1992) a second measure was developed to evaluate the success of transition services to influence effective classroom practices and to examine the influence of classroom characteristics and practices on academic and social outcomes. A group of investigators involved in the Transition Study was interested in developing a second measure of classroom practice that they felt was more closely aligned with the guidelines for developmentally appropriate practice proposed by the National Association for the Education of Young Children (Bredekamp, 1993). The resulting measure, A Developmentally Appropriate Practice Template (ADAPT) (Gottlieb, 1995), involves observational ratings of eighteen attributes of classrooms. Attributes are nested within three scales: Curriculum and Instruction, Interaction, and Classroom Management. A composite rubric is also used to generate an overall rating for each classroom. The scale was initially developed on a sample of first and second grade classrooms. Exact inter-rater reliability values for an initial subset of 68 classrooms across 21 schools ranged from .69 to .78. This value is somewhat less than the standard criterion of 85% agreement. However, as this instrument was optional, local sites were not required to establish higher levels of reliability. It should also be noted that exact agreement is a stringent criteria for Likert items. Gamma coefficients are often reported as alternative indices of interrater reliability with such scales.
While the ADAPT was not required, it was administered in a majority of the study sites. Because the scale was developed and adopted after the first-year of the study, it was not administered until 1995. Therefore, it was available in first-grade classrooms for the second cohort only. The ADAPT was administered to 491 first grade classrooms, 1,083 second grade classrooms, and 1,177 third grade classrooms. Factor analyses yielded a two factor solution in first grade (classroom rules/structure, classroom climate) accounting for 64% of the variance. Only a single factor emerged for second and third grade (accounting for approximately 60% of the variance). Confirmatory factor analyses did not reveal that a two factor solution was an improvement over a one factor solution for any grade level. Given the truncated sample for first grade and the results of the confirmatory factor analyses, a single factor solution was adopted for this instrument. Coefficient alphas exceeded .95 at each grade level. Therefore, the summated total score will serve as the primary outcome index for the ADAPT.
Authors of the ADAPT and the Assessment Profile: Research Version are working conjointly with members of the National Transition Research team to understand the functions of both scales in describing the status and changes of classroom practices. Based on preliminary cluster analyses and correlational studies, it appears that the ADAPT may be tapping a general disposition of teachers toward implementing the guidelines for developmentally appropriate practices proposed by the National Association for the Education of Young Children while the Assessment Profile may isolate more discrete classroom components. A number of studies examining the convergent and divergent validities of these scales will be conducted.
FAMILY RESOURCE SCALE (FRS)
Dunst & Leet, 1987The Family Resource Scale (FRS) was used in the transition study as a measure of the respondent’s perception of the adequacy of their family resources. This measure of family resources provides a unique assessment of how well-off a family is from their own perspective, which is potentially very different from more standard comparisons such as poverty status. The FRS also provides a broader assessment of family resources, including their ability to meet their basic needs, and an assessment of how much time the family has. This scale assesses families’ perceptions of their needs with the expectation that families will direct their energies to fulfilling their most basic needs first. The instrument has been shown to be related to families’ commitment to intervention (Dunst, Leet, & Trivette, 1988).
The FRS was administered in the fall of kindergarten and again in third grade. Respondents rank thirty items on a five point scale ranging from not at all adequate’ to almost always adequate,’ with not apply’ being an additional choice. The items are ordered starting with those that are most basic (e.g. Food for 2 meals a day, and Enough clothes for your family) to those that are the least basic (Money for family entertainment, and Travel/vacation). The FRS was developed in consultation with a group of 28 professionals, and was tested on a group of 45 low to middle SES mothers of preschool-aged children (Dunst & Leet, 1987). The instrument had an internal consistency of .92 in the development data set, and a test-retest reliability of .52. A principal components analyses, using varimax rotation, of the development data set yielded an 8 factor solution which accounted for 75% of the variance of the test. This factor set was not used for further analyses.
Because a review of the FRS concluded that validity of the subscales was not adequately demonstrated (McGrew, 1992), further analyses of the FRS subscales were performed for the Transition study before their use in any analyses. Cohort I data was used in an exploratory principal components analyses, reserving Cohort II data as a confirmatory sample. The psychometric analyses began by examining the does not apply’ response. For the item Good job for yourself or spouse,’ N/A responses were recoded as 1’ because 90% of those responding N/A’ were not employed. The remaining items with over 10% N/A’ responses were dropped from the analyses because they clearly were not relevant to a large portion of the sample. One additional item Money to buy special equipment/supplies for child(ren)’ was also dropped because it applied to only a subset of the sample. In the remaining 26 items, N/A’ responses were recoded as missing and the items were entered into a principal components analyses using listwise deletion of all missing values. This resulted in a sample of 2,321 Cohort I kindergarten families and 1,883 Cohort I third grade families. A parallel series of principal component analyses (using polychoric correlation matrices and varimax rotation) were performed on the Cohort I kindergarten and third grade data. Complex items that loaded on more than one factor at .45 or greater were dropped. This process resulted in a set of three highly intuitive subscales -- Basics, Money, and Time -- that included 22 of the 30 items on the FRS (See Table 1).
The principal components analyses were followed by confirmatory factory analyses on Cohort II family data from kindergarten and third grade. Following listwise deletion of missing data, 2688 kindergarten families and 2101 third grade families were included in the confirmatory analyses. Using LISREL 8 with a CSM estimation procedure (weighted least squares estimation of the polychoric matrices (Kaplan, 1990)), the three correlated factor model was imposed on the two Cohort II samples. Fit indices were adequate, suggesting that the three correlated factor model fit the data reasonably well. Additionally, these fit indices reflected superior model fit in contrast to competing models.
For further analyses, subscale scores were computed. This was done by taking the mean of all the items in each respective subscale to form three new variables: Basics, Money, and Time. The mean was used so that the three subscales would be in the same metric, and would be comparable to the original responses (i.e., a score of 1.5 is half way between “Not at all adequate” and “Seldom adequate”). As part of this process, the “Does not apply” response category was addressed, and it was decided that for those subjects who had 20% or less n/a responses for a subscale, the n/a response would be replaced with an imputed value based on their present FRS items (imputed via an EM algorithm). This method allowed us to include respondents who knew how to answer most of the items on the subscale. It is the least biased method for dealing with these responses in that the imputed values have very little effect on any given respondent’s subscale scores, and many fewer respondents are eliminated from the analyses. Dropping respondents from the analyses represents decreased statistical power and also a bias in the sample if those respondents who are dropped are at all different from those who remain.
| Basic Needs | Money | Time | |
|---|---|---|---|
| Food for 2 meals a day | X | ||
| House or apartment | X | ||
| Enough clothes for your family | X | ||
| Heat for your house or apartment | X | ||
| Indoor plumbing/water | X | ||
| Medical care for your family | X | ||
| Furniture for your home or apartment | X | ||
| Telephone or access to a phone | X | ||
| Dental care for your family | X | ||
| Good job for yourself or spouse | X | ||
| Money to buy things for self | X | ||
| Money for family entertainment | X | ||
| Money to save | X | ||
| Travel/vacation | X | ||
| Time to get enough sleep/rest | X |
In summary, as part of our psychometric review of the FRS, the proposed subscales were found to be inadequate. They were modified through a series of exploratory principal components analyses and subject to verification with confirmatory factor analyses. This process resulted in three highly intuitive subscales: Basics, Money, and Time. Scores for these subscales were computed for all respondents, and for those with a small number of “Does not apply” responses, imputation was used to replace the “Does not apply.” These subscales were also found to have adequate internal reliabilities (Chronbach’s Alphas where between .72 and .87).
FAMILY ROUTINE INVENTORY (FRI)
Jensen, James, Boyce, & Hartnett, 1983The family context has been seen as providing order and stability in the lives of children (Boyce, Jensen, James, & Peacock, 1983). The Family Routine Inventory (FRI) is an instrument that has been developed to measure differences between families in the ordering of their every day lives. The presence of routines is expected to be a buffer against stressors that families and children experience.
The FRI was administered as part of the family interview in the fall of kindergarten and again as an optional instrument in third grade. Initial examination of the FRI in the Transition data focused on examining possible subscales. The FRI was examined separately in Cohorts I and II for kindergarten and third grade using exploratory principal component analyses (with polychoric correlation matrices and varimax rotation). The results of these analyses were found to be inconsistent across Cohorts and time periods, and they also failed to show theoretically significant subscales. Consequently, the author’s suggested scoring protocol of using one frequency score (a summation of the ordinal values) is used for all further analyses.
The FRI was originally validated on a sample of 307 mothers who represented diverse ethnic and socioeconomic backgrounds (Jensen, James, Boyce, & Hartnett, 1983). The instrument was found to have acceptable test – retest reliability (.79) and was also validated by comparison to the subscales of the Family Environment Scale which measure similar constructs. Scores were found to be moderately related to the Family Environment Subscales, to family income, and to the age of the oldest child in the family. The FRI scores in the Transition data where found to have acceptable internal consistency (Cronbach’s alpha’s: kindergarten = .71, and third grade = .77).
NEIGHBORHOOD SCALES
Furstenberg, Cook, Eccles, Elder, & Sameroff, 1990The Neighborhood Scales are a measure of positive and negative dimensions of the neighborhood a family lives in, from the perspective of the respondent (Furstenberg, Cook, Eccles, Elder, & Sameroff, 1990). This measure allows family units in the National Transition Project to be placed in the broader context of their neighborhoods. The Neighborhood Scales consist of six scales which measure: neighborhood cohesiveness; barriers to services; negative effects; social control in neighborhood, probability of success for children in neighborhood, and a global rating of neighborhood.
The Neighborhood Scales were administered as part of the family interview in the fall of kindergarten, and again in first grade. In third grade, one of the six subscales was part of the National Core and the remaining five subscales were optional, and were administered in 10 sites. As part of the psychometric review, the six scales of this instrument were tested with two confirmatory factor analyses (CFA) using kindergarten and first grade data. The CFA (using polychoric correlations and the WLS estimation procedure) showed strong support for the existence of six correlated subscales (All fit indices were above .90), and rejected the alternative model that all items loaded on a single construct. The six subscales were moderately correlated (from -.49 to .67) in each grade. The internal reliability of the scales was adequate (Chronbach’s alphas ranged from .74 to .87 in kindergarten and .76 to .88 in first grade).
PEABODY PICTURE VOCABULARY TEST-REVISED (PPVT-R)
Dunn & Dunn, 1981Due to the importance of communication and comprehension, receptive language is considered an important factor in a child’s successful transitions in school. The Peabody Picture Vocabulary Test-Revised (PPVT-R) was used as the measure of receptive vocabulary. The test is not considered to be a general test of intelligence, and was not used as a proxy measure of general or verbal intelligence for this study. It was hypothesized that classroom quality would be related to PPVT-R scores. Furthermore, because of the importance of verbal comprehension as a factor in school success, performance on the PPVT-R was expected to be an important predictor of other academic outcomes.
The PPVT-R is individually administered and requires approximately ten to fifteen minutes to complete. The scale requires a child to point to a picture that represents the word spoken by the examiner. The PPVT-R was standardized nationally on a representative sample of 5,028 persons. One hundred children of each gender at each age level were used in the standardization sample. Rasch analysis was used to equate the two forms of the scale. A non-technical discussion of Rasch scaling is provided at the end of this document. Internal consistency values (split-half reliabilities) for children and youth ranged from .61 to .88. The median test-retest reliability was approximately .78 for a an interval of 31 days or less indicating adequate short-term stability. It should be noted that a relatively lower level of stability was found for children between the age of five years and eight years eleven months than for older children.
While reviewers are generally positive about the technical merits of the PPVT-R, and researchers use the instrument frequently as a measure of receptive vocabulary, concern has been expressed about the adequacy of sampling in terms of geographic, linguistic, ethnic, and socioeconomic representation (Wiig, 1985). A Spanish version of the instrument was made available and was used as a secondary measure with Spanish-speaking children.
SCHOOL CLIMATE SURVEY
Kelley, Glover, Keefe, Halderson, Sorenson, & Speth, 1986The School Climate Survey (Kelley, Glover, Keefe, Halderson, Sorenson, & Speth, 1986) was created to measure individuals' perceptions of how the community feels about the school. Respondents are asked to rate how much most people’ would agree with specific statements. The statements are meant to be school wide, not specific to a given classroom. The nature of this instrument is such that it opens the possibility of creating composite ratings for each school, based on the ratings of all respondents from that school.
The School Climate Survey was administered to parents, teachers, and principals in all five years of the study. The scale was modified for use in the Transition study, and 9 items of the original instrument were deleted. This resulted in a 46 item instrument, with 9 subscales (Teacher-Student Relationships; Security and Maintenance; Administration; Student Academic Orientation; Student Behavioral Values; Student Peer Relationships; Parent and Community-School Relationships; Instructional Management; Student Activities). There are six possible responses to each item ranging from strongly disagree’ to strongly agree,’ and a response of don’t know.’ The psychometric review for this scale focuses on validation of the 9 subscales in the National Transition Project data set, and on resolving questions about the use of the don’t know’ response.
Initial validation of the School Climate subscales involved exploratory principle components analyses using teacher and family responses, with listwise deletion of the don’t know’ responses. All teacher responses were included in one analysis for this purpose, and family responses were analyzed by grade level of the child. These analyses were not conclusive. Use of an eigen value of 1 for determining the number of factors to be retained suggested a nine factor solution that was similar but not identical to that proposed by the authors. However, the amount of variance accounted for by each factor over 5 was less than 3%, suggesting that these factors were not very useful. Because no theoretically meaningful solution containing less than 9 factors was found, the decision was made to use a 9 factor solution. CFA analyses were then conducted using polychoric correlations and the WLS estimation procedure. These analyses rejected a one factor solution, but found very few differences between the new 9 factor solution suggested by the principal components analyses and the original one proposed by the authors. Without having a strong reason to modify the original factor structure, the authors' original factors are used for the rest of the study. The factors have adequate internal reliability (Chronbach’s alphas range from .73 to .93 in kindergarten), and are highly correlated (between .73 and .93 for families, and between .58 and .88 for teachers using the correlations estimates derived from the CFA analyses).
One other issue with the School Climate Survey was also dealt with. A large number of respondents cited don’t know’ for one or more items. This is especially true of responses to the family interview, in which some items had over a 25% don’t know’ response. The scoring protocol for the School Climate Survey calls for treating “don’t know” responses as missing, and then computing a mean score for each subscale based on the number of responses that were present. This is mathematically equivalent to replacing “don’t know” responses with the average of all other items in that subscale for a given subject. We, however, find this option to be problematic because individual items have different distributions and are conceptually different from one another; thus replacing a given item with the mean of different items is a poor estimate of the value of the missing data. This also does not address what happens if a given item has a large “don’t know” response.
In our analyses of the “don’t know” responses, we first looked at the percentage of respondents in each group that cited “don’t know” for each year. This clearly showed that “don’t know” is an often cited response for families, although with the exception of one subscale it was rarely cited by the teachers and principals. Consequently, we recommend not using the one subscale, Student Activities, that all three groups of respondents tend to answer “don’t know” to. The remaining eight subscales do not appear to have a problem with teacher or principal scores.
Further review of patterns of “don’t know” responses shows that for the family respondents, other subscales appear to have large percentages of “don’t know” responses. This appears to be an indication by the respondents that the questions are difficult for parents to access. We decided that subscale scores to be used should have on average no more than 10% “don’t know” responses for all the items on the subscale. Using this criteria, the Administration, Student Behavioral Values, Parent and Community-School Relationships, and Instructional Management subscales scores would not be used for the family respondents. That leaves 4 subscales remaining for the families, Teacher-Student Relationships, Security and Maintenance, Student Academic Orientation, and Student-Peer Relationships.
For the four remaining subscales of the family respondent, we recommend treating the “don’t know” responses as missing data, and imputing their values using the EM algorithm. We only recommend using this solution, however, when the respondent answers most (75%) of the items on a given subscale. So, if a respondent replied “don’t know” to one or two items in the Teacher-Student subscale, those responses would be replaced with imputed values; however, if they responded “don’t know” to 3 or more items, their subscale score would remain missing. A review of the missing data patterns in kindergarten shows that this procedure would greatly increase the amount of data available for those four subscales in which it is used (with the imputation procedure, valid data is obtained for 93% to 96% of the subjects).
This recommendation is a conservative response in that it requires respondents to have answered most of the items before we assign them a score for that subscale. However, some conceptual issues have been raised with the use of any method of replacing “don’t know” responses with data. Imputation or any other method of replacing these values is essentially taking a respondent’s answer, and changing it. That is, the data is not actually missing, consequently replacing it with other values is not imputation. In our view the approach we are using provides the most conservative and least biasing method of solving this problem. That is, we are only replacing “don’t know” responses when a subject did know enough of the items for a subscale that we can be reasonably confident that we are having little effect on what their subscale score would have been had they known the answer to it. Also, we are suggesting a technique for replacing the “don’t know” responses which we expect to provide the best possible estimates of the items that are missing.
SOCIAL SKILLS RATING SYSTEM (SSRS)
Gresham & Elliott, 1990The Social Skills Rating System (SSRS) is an instrument, administered to parents and teachers, that measures different aspects of children’s social skills. Respondents are asked how often the child exhibits a behavior, and how important that behavior is. However, the scoring protocol addresses only the assessments of a child’s behavior. While the teacher and parent forms of the SSRS are different, about half of the items on them are identical. The SSRS was developed on a nationally representative sample of 4,170 children, and standard scores are available for the different SSRS scales which are based on population norms taking the sex of the child into account.
In the Transition project, the SSRS is the primary non-academic child outcome. The Social Skills scale of the SSRS was administered to parents and teachers in kindergarten through third grade and the problem behavior scale was administered to parents and teachers in second and third grades. These scales were all found to have adequate internal reliability (Coefficient Alphas from .87 to .94) and test-retest reliability (Correlations of .87 to .65) in the national sample on which they were tested, with the caveat that the test-retest reliability for parents' ratings was a little low (.65). Each of the SSRS scales administered are composed of a number of subscales. The Social Skills scale has three subscales (Cooperation, Assertion, and Self-control) in the teacher version, and one additional one (Responsibility) in the Parent version. The Problem Behavior scale has three subscales (Externalizing, Internalizing, and Hyperactivity) in both versions. Both of the scales and their subscales had very high factor loadings for both parents and teachers (the lowest being .51).
In summary, the SSRS is a widely used, nationally standardized instrument which measures children’s social behaviors as rated by multiple observers. Two of the SSRS scales, Social Skills and Problem Behaviors, were administered in the National Transition Project to parents and teachers. The testing manual reported adequate internal reliability and test-retest reliability for the scales used. Because of the rigorous development undergone by the SSRS, no further review was conducted in the National Transition Project.
WOODCOCK-JOHNSON PSYCHO-EDUCATIONAL BATTERY-REVISED (WJ-R)
(Woodcock & Johnson, 1990)The Woodcock-Johnson Psycho-Educational Battery -Revised is a set of individually administered tests for assessing a variety of academic and cognitive skills. Two scales from the WJ-R were used as part of the core battery for the Transition Demonstration Study. The Reading and Mathematics scales of the battery were administered annually to children in order to provide a standardized measure of individual progress over time in two academic areas viewed as critical indicators and predictors of children’s success in school.
The Reading cluster is comprised of two separate tests, letter-word identification and passage comprehension. Letter-Word Identification requires the child to identify letters or words that are presented to them. Passage Comprehension requires the child to identify a picture represented by a phrase or to provide a word that would appropriately complete a sentence within the context of a passage. For children between 6 and 9 years old, the Examiner’s Manual (Woodcock & Mather, 1990) reports internal consistency reliabilities of .96 and .94 for Letter-Word Identification and .95 and .88 for Passage Comprehension.
The Mathematics cluster is comprised of two separate tests, calculation and applied problems. Calculation requires the child to perform basic mathematical computations (addition, subtraction, multiplication, division). The Applied Problems test requires the child to determine the appropriate mathematical procedure needed to solve a problem, identify necessary information to apply, and perform simple calculations. For children between 6 and 9 years old, the Examiner’s Manual (Woodcock & Mather, 1990) reports internal consistency reliabilities of .93 and .89 for Calculation and .84 and .90 for Applied Problems.
When the test was restandardized in 1986-1989, a stratified national sampling design included 3,245 subjects between kindergarten and 12(th) grade. Sampling and norming procedures meet high technical standards. Critiques of the instrument support its technical quality (e.g., Cummings, 1995). Rasch scaling procedures were used during the norming of the battery. A nontechnical discussion of the Rasch approach and its implications for subsequent analysis is provided at the end of this document. Due to the integrity of sampling procedures and the psychometrics of the instrument, there was no need for supplemental psychometric analysis of these scales by the National Transition Project.
RASCH-WRIGHT (W-ABILITY) SCORES: AN INTRODUCTION
A central feature of the proposed analyses of the Woodcock-Johnson Psychodeducational Battery-Revised (WJ-R) and Peabody Picture Vocabulary Test-Revised (PPVT-R) is the use of the Rasch-Wright (W-ability) scores as an outcome metric for longitudinal study. The purpose of this document is to provide a brief and non-technical overview of the Rasch-Wright scores and their applications to the two primary academic measures.The W-ability scores proposed for use in longitudinal analyses based on the WJ-R and PPVT-R are based on a general measurement model known as latent trait analysis or item response theory. A specific application of the model is the Rasch-Wright approach, based in the work of George Rasch and Benjamin Wright. The model is useful in the development of new tests, the analysis of existing tests, and the interpretation of test performance for individuals or groups. As is explicated in the examiner's manuals, both the WJ-R and the PPVT-R used latent trait analysis in item selection and score development. Richard Woodcock (of the WJ-R) was one of the first researchers to apply Rasch-Wright scaling to the development of an academic achievement test -- the initial 1977 version of the Woodcock-Johnson. The model provides a cost-efficient tool for: (a) identifying appropriate items, (b) calibrating item difficulties (that are calculated on the same W scale as person ability) used to determine item ordering, and (c) generating W-ability scores that were in turn translated to other normative derived scores (e.g., normal curve equivalents, percentiles, age equivalents).
In the case of both instruments, raw scores (the number of items passed) were first converted to W-ability scores. The W-ability score associated with each raw score is generated from a probabalistic model which takes into account the difficulty of items on the test. The W-ability scores have several critical advantages that make them particularly well-suited to monitoring and analyzing change of individuals and groups across time. First, W-ability scores represent a unidimensional continuum (e.g., growth model) that is not referenced to a particular subsample (e.g., age-group). In other words, improvements in performance along the achievement trait (i.e., learning) are reflected in gains on W-ability scores. Performance is referenced to the underlying dimension rather than to a normative comparison group. Low scores reflect the lowest levels of educational attainment in a given domain (e.g., mathematical calculations) and high scores reflect the highest levels of educational attainment. Second, W-ability scores maintain equal-interval characteristics. That is, a gain of 3 W-ability points from 84 to 87 on the PPVT-R represents the same amount of gain as a 3 point gain from 102 to 105 on the same test. This characteristic of interval-level scaling is an important assumption of parametric inferential statistics. When coupled with the growth continuum features discussed first, W-ability scores are particularly desirable for longitudinal analyses. Third, for considerations of individual-level scores, each W-ability score is associated with a unique standard error of measurement. Therefore, separate confidence intervals can be computed for each W-ability score. Finally, as W-ability scores are the first scores converted from raw scores, they contain the least amount of unintended error variance contributed by transformations to supplemental age-based derived scores. For example, the PPVT-R normalized standard scores (with a mean of 100 and standard deviation of 15) are based on area transformations of W-ability scores within each age group followed by interpolations between the 25 age-groups tested. Such transformations and interpolations may present a degree of error that would not be present using the W-ability scores alone. The use of percentile scores or age- or grade-equivalents presents even greater interpretive and analytical challenges.
The WJ-R and the PPVT-R have generated W-ability scores with different ranges and centers. For example, the PPVT-R W-ability scale ranges from 20 to 180 (centered on 100), while the WJ-R scores range from 300 to 700 (centered on 500). These scales are arbitrary and were set by the developers of the tests. They do, however, retain the properties of unidimensionality and equal-interval scaling which, as indicated before, makes the ability scores particularly desirable for longitudinal and comparative analyses.
Another metric commonly computed are logit estimates of item difficulty and person ability. Logit ability estimates describe performance in terms of natural logarithm units. Such scores have properties that may be even more desirable for longitudinal research than W-ability scores (e.g., ratio-level measurement, unbounded estimates for regression-based analyses). The publishers of the PPVT-R have made conversion tables for such scores available. Discussion of the problems and prospects of logits is beyond the scope of this paper.
W-ability scores are not without problems. For example, the meaning of the scores is not immediately evident to consumers. To understand a W-ability score, the reader/researcher must understand the possible range of values and the nature of the growth curve. Percentiles, IQ-type standard scores, normal curve equivalents, and age equivalents are more common metrics in educational research than are W-ability scores. Furthermore, the relationship between W-ability scores and age is typically not linear. For example, the latent trait being assessed by the PPVT-R, hearing vocabulary, is represented by a decelerating curve across the age of the respondent in the standardization group (Dunn & Dunn, 1981). Therefore, while W-ability is scaled at equal intervals, the actual dimension as expressed within the population may be curvilinear. It should be noted that the PPVT-R growth curve does not show marked deceleration for the age group of students involved in the transition study.
The nature of the application of the W-ability scores to the national questions depends on the validity of two assertions made by the developers of the two tests. First, the authors of the PPVT-R argue that, due to the properties of the scores, the W-ability estimates of the L and M forms of the test may be treated as equivalent and therefore directly compared (whereas the raw scores cannot). This assumes however, appropriate horizontal equating of the forms. While examinations of the calibrations tables seem to support the validity of the claim, further inquiry will be made to ensure a valid understanding of the equating procedures and the consequences. The authors of the WJ-R argue that W-ability scores for subtests within a common domain can be averaged (yielding a cluster score representing the domain as a whole). A potential concern about this approach is the implied assumption of equal standard deviations between the subtests across time. These assertions are important for the analyses of the national questions. While the Rasch-Wright scores have psychometric properties that make them desirable for longitudinal research, further inquiry regarding the validity of the aforementioned assertions is needed before confidence can be placed in the integrity of the resulting analyses.
REFERENCES
Abbott-Shim, M., Sibley, A., & Neel, J. (1998). Psychometric report of the Assessment Profile for Early Childhood Programs: Research Version for the National Transition Demonstration Project. Atlanta: Quality Assist, Inc.
Abbott-Shim, M.S. & Sibley, A.N. (1992). Assessment Profile for Early Childhood Programs: Research Version. Atlanta: Quality Assist, Inc.
Abbott-Shim, M.S., Sibley, A.N., & Neel, John (1992). Research Manual, Assessment Profile for Early Childhood Programs. Atlanta: Quality Assist, Inc.
Abbott-Shim, M.S. & Sibley, A.N. (1987). Assessment Profile for Early Childhood Programs. Atlanta: Quality Assist, Inc.
Boyce, W. T., Jensen, E. W., James, S. A., & Peacock, J. L. (1983) The Family Routines Inventory: Theoretical origins. Social Science Medicine, 17, 1983-2000.
Cummings, J. (1995). Review of the Woodcock-Johnson Psycho-Educational Battery-Revised. In J.C. Conoley & J.C. Impara (Eds), The Twelfth Mental Measurement Yearbook (1113-1116). Buros Institute of Mental Measurement: Lincoln, NE.
Dunn, L. M. & Dunn, L. M. (1981). Peabody Picture Vocabulary Test-Revised. Circle Pines, MN: American Guidance Service.
Dunst, C. J., & Leet, H. E. (1987). Measuring the adequacy of resources in households with young children. Child care, health and development, 13, 111-125
Dunst, C. J., Leet, H. E., & Trivette, C. M. (1988). Family resources, personal well-being, and early intervention. The Journal of Special Education, 22, 108-116.
Furstenberg, F. F., Cook, T. D., Eccles, J. P., Elder, G. H., & Sameroff, A. J. (1990). Neighborhood Scales. Unpublished Manuscript.
Gottlieb, M. (1995). A Developmentally Appropriate Practice Template (ADAPT).
Gresham, F. M., & Elliott, S. N. (1990). Social Skills Rating System. Circle Pines, MN: American Guidance Service, Inc.
Harms, T. & Clifford, R. M. (1980). Early Childhood Environment Rating Scale. New York: Teachers College Press.
Jensen, E. W., James, S. A., Boyce, W.T., & Hartnett, S. A. (1983). The Family Routines Inventory: Development and validation. Social Science Medicine, 17, 201-211.
Kaplan, D. (1990). Evaluationg and modifying covariance structure models: A review and recommendation. Multivariate Behavioral Research, 25, 137-155.
Kelley, E. A., Glover, J. A., Keefe, J. W., Halderson, C., Sorenson, C., & Speth, C. (1986). School Climate Survey (Modified) Form A. National Association of Secondary School Principals. Reston: Virginia.
McGrew, K. S., Gilman, C. J., & Johnson, S. (1992). A review of scales to assess family needs. Journal of Psychoeducational Assessment, 10, 4-26.
Wiig, E. (1985). Review of Peabody Picture Vocabulary Test-Revised. In J. Mitchell, Jr. (Ed.), The Ninth Mental Measurement Yearbook, Vol 2 (1127-1128). Buros Institute of Mental Measurement: Lincoln, NE.
Woodcock, R. & Johnson, M. (1990). Woodcock-Johnson Psycho-Educational Battery-Revised. Allen, TX: DLM Teaching Resources.
Woodcock, R. & Mather, N. (1990). Examiners' Manual: Woodcock-Johnson Tests of Achievement. Allen, TX: DLM Teaching Resources.
| Table of Contents | Previous | Next |

