Table of Contents | Previous | Next |
| Section 3 | |
| INFORMATION INCLUDED FOR EACH INSTRUMENT
|
|
![]() |
The purpose of this resource document is to provide information, in one place, about existing screening and assessment instruments designed for use with children under age 3 and their families, as well as instruments designed for assessing services provided by programs serving them. Thus, we cast a broad net and include a wide range of screening and assessment tools of potential use to programs. Many of the instruments described are established instruments that yield a standard score that places the child’s performance in the context of other children of the same age. We also include some data collection tools that may be useful, such as implementation rating scales and questionnaires that include questions on family practices, health, and health care receipt from the national Early Head Start Research and Evaluation Project (EHSRE). |
| We did not set strict inclusion criteria, but tried to provide information on a range of features for each instrument so programs can make informed decisions in selecting instruments. Each program must determine the purposes for considering a particular instrument and evaluate how well the instrument fulfills those purposes. In general, because of their limited applicability for programs serving infants and toddlers, we did not include measures for which the lowest appropriate age for administration was older than 2 years. We made an exception for certain instruments, such as the Woodcock Johnson III and Peabody Picture Vocabulary Test, that Head Start programs sometimes use and that may be helpful for continuity when children go on to Head Start. We consulted multiple sources of information to identify instruments for inclusion in this resource document. We looked at the National Early Head Start Research and Evaluation Project (EHSRE) to identify instruments used by the national and local research teams and instruments that research programs used. We held group discussions with Early Head Start program staff at the 2002 Birth to Three Institute to learn about screening and assessment tools they are using in their programs. Information was provided about screening and assessment tools that Early Head Start programs are currently using. We consulted with researchers and technical assistance experts. Finally, we conducted a literature review to identify instruments that are used widely and have been developed and/or normed within the past 15 years, or after 1987. The instruments included in this document were developed for a variety purposes and by individuals from different disciplines. Thus, you may find that some instrument names are overly technical or offensive. In these cases, you may want to present the instruments to parents using a less technical name that describes what the instrument measures in terms that parents will understand. For example, you might want to refer to the Parent-Child Conflict Tactics Scale as a questionnaire on discipline and responses to children’s behavior. The screening and assessment instruments in this resource document are presented in three groups: (1) instruments for measuring child development; (2) instruments for measuring parenting, the home environment, and parent well-being; and (3) instruments for measuring program implementation and quality. Within each group, instruments are in alphabetical order. Summary tables listing the instruments are presented at the beginning of each group of instruments. 3 This resource document is intended to be a living document that will be updated as new screening and assessment instruments are identified or become available. We gathered information about each instrument from different sources, depending on the type of instrument. For the more formal, copyrighted instruments, we relied primarily on the manuals or Web-based information available from the authors or their publishers. If we found a key research article about a formal instrument, we also reviewed it and included the pertinent information. For the more experimental, less formal instruments, we reviewed the instrument itself and the supporting material we were able to locate, such as research reports and published articles, and reviews conducted by others. Each entry includes a reference section that identifies the sources of information we used. Many of these instruments are grounded in developmental theory and research. Developers of standardized tests for children usually begin with their theory of how abilities develop and identify areas to be assessed. Then they create items to measure the identified areas and try them with children to determine whether the items discriminate among children by age. After a core set of items is identified, test developers often launch a large, nationally representative study to test the items and obtain statistical information about how the study participants performed on each item. From the study findings, the test developers determine the best set of items, develop rules about where to begin and end the test, and decide on procedures for converting raw scores (based on summing the number of items answered correctly or on the average rating across items on a rating scale) to norm-referenced scores. The norm-referenced scores take advantage of the nationally representative study and allow comparisons between how an individual child performed on the test and how children of the same age in the study performed. The nationally representative study also provides information about how the instrument works with diverse and low-income populations. Other types of research also provide important information about a screening or assessment instrument. Studies that use a new instrument in conjunction with established instruments that measure the same ability or skill provide information about whether the new instrument measures what it was intended to measure. Other studies compare how well the new instrument predicts children’s performance in a given skill area many years later. Because they take a long time to conduct, these studies are not available for very new instruments, but they can be valuable in evaluating an instrument administered when children are young. No screening or assessment instrument performs perfectly across all the dimensions practitioners and researchers believe are important (such as the statistical properties of the instrument or how easily the resulting information feeds back into individualized intervention planning) and for all the purposes for which the instrument may be used. We encourage you to weigh the information described for each instrument according to your program’s theory of change, your comprehensive plan for gathering and analyzing data, and the purposes for which you will use the information. Consultation with an expert may help you sort through this information and select screening and assessment instruments. The language that describes screening and assessment instruments is filled with jargon. Box 4 defines the key terms used in this document. The rest of this chapter includes a summary of what you will find described for each instrument included in this resource document. Each entry includes a summary table and a more detailed description of the topics we identified as most useful for making comparisons across instruments. The topics in the summary table include: |
|
|
|
- Initial material cost: 1 (under $100), 2 ($100 to $200), 3 (more than $200). - Reliability: 1 (none described); 2 (all or mostly under .65); 3 (all or mostly .65 or higher). See Box 4 for a brief definition of the various types of reliability. We chose these groupings based on the prevalent rule of thumb researchers and assessment developers use. Other things being equal, the higher the reliability is, the better the instrument is. - Validity: 1 (none described); 2 (all or mostly under .5 for concurrent; all or mostly under .4 for predictive); 3 (all or mostly .5 or higher for concurrent; all or mostly .4 or higher for predictive). See Box 4 for a brief definition of the various types of validity. We chose these groupings based on the prevalent rules of thumb researchers and instrument developers use. Generally, the higher the validity is, the better. It is especially challenging to create instruments for infants and toddlers that strongly predict how the children will do as preschoolers. Therefore, the grouping for predictive validity reflects a less stringent criterion for the highest grouping. - Norming sample characteristics: 1 (none described); 2 (older than 15 years, not nationally representative or representative of the low-income population enrolled by Head Start programs serving infants and toddlers); 3 (normed within past 15 years, nationally representative or representative of the low-income population enrolled by Head Start programs serving infants and toddlers). See Box 4 for a brief definition of representativeness of the norming sample. This section also includes information on the date that the norming sample was obtained. The more time that has elapsed since the norming sample was obtained, the less likely it is to be representative. Many authors/publishers re-norm their assessments every 10 to 12 years to keep them up-to-date. We chose 15 years as the critical time here. - Ease of administration and scoring: 1 (not described); 2 (self-administered or administered and scored by someone with basic clerical skills); 3 (administered and scored by a highly trained individual). The administration and scoring requirements for each instrument vary and these descriptors help you determine what is involved for these steps. The other topics included for each instrument are:
- Measures of internal consistency (split-half reliability, internal consistency reliability) that indicate the extent to which the items in the instrument “hang together” and tell a coherent story about the child or adult’s functioning - Measures of stability (test-retest reliability, alternate form reliability) that indicate the extent to which the instrument yields the same results when used at different times or using a different form of the instrument (for those that have multiple forms) - Measures of the reliability of administration (inter-rater reliability) that indicate the extent to which two different observers or instrument administrators would interpret and record the information in the same way
- Content validity, which relies on expert judgment to determine that an instrument actually measures what it is intended to measure - Criterion-related validity, including concurrent validity, which indicates how well the instrument results relate to other information collected at the same time, and predictive validity, which indicates the extent to which the instrument results are related to later functioning
The entries are organized alphabetically in three groups: (1) measures of child development; (2) measures of parenting, the home environment, and family well-being; and (3) measures of program implementation and quality. In front of each group of entries is a summary table that lists the instruments profiled in that section and summarizes their main features. 3 THE INCLUSION OF AN INSTRUMENT IN THIS RESOURCE DOCUMENT DOES NOT CONSTITUTE ENDORSEMENT OF THE INSTRUMENT BY THE AUTHORS, MATHEMATICA POLICY RESEARCH, OR THE U.S. GOVERNMENT. (back) |
|
|
Box 4: BRIEF DEFINITIONS OF KEY TERMS Assessment. Assessment is a generic term referring to a variety of procedures for obtaining systematic information on a child’s, parent’s, family’s, or program’s strengths or needs. As noted in Chapter I, the Head Start Program Performance Standards focus on the child and family assessment purposes of identifying “(i) the child’s unique strengths and needs and the services appropriate to meet those needs; and (ii) the resources, priorities, and concerns of the family and the supports and services necessary to enhance the family’s capacity to meet the developmental needs of their child.” These two major purposes of assessment are sometimes described as providing information for individual diagnosis and program planning. The purposes of a diagnostic assessment are to (1) identify whether an individual has special needs, (2) determine what the problems are, (3) suggest the cause of the problems, and/or (4) propose strategies to address the problems (Meisels and Provence 1992). The purposes of an assessment for program planning are to (1) learn about an individual’s ability to perform particular tasks or achieve mastery of particular skills, and (2) design intervention activities for the individual that support the completion of tasks and mastery of skills over time. Depending on the purpose of the assessment process, it may include norm-referenced tests; observations in the home, child care, early intervention, program, or school setting; interviews with family members, child care providers, or others who may provide important information about the individual; and ratings by adults knowledgeable about the child (including a parent, caregiver, or teacher) (Sattler 1992). The performance standards also require programs to conduct an “assessment of community strengths, needs, and resources,” as well as an annual program self-assessment of “effectiveness and progress in meeting program goals and objectives and in implementing federal regulations.” Screening. Screening is made up of a set of activities designed to identify individuals who have a high probability of exhibiting delayed, abnormal, or problematic development. The screening is intended to identify problems at an early stage and to use this information to flag individuals for further, in-depth assessment activities. Basal. A basal is established on a standardized test when the individual demonstrates that he or she successfully completes the first few items administered. On most standardized tests, the tester begins administering the items based on how old the individual is, starting later if the individual is older. If the individual passes the number of items specified in the test manual for establishing a basal, the tester is able to assume that the individual would have gotten all of the previous items correct and adds in the number of untested items to the correctly passed items administered to the individual. If the individual does not pass the specified number of items, the tester would administer earlier items until the prescribed number of items are passed or the tester reaches the start of the test. Using a basal rule saves time during the testing session and reduces fatigue. Ceiling. A ceiling is established on a standardized test when the individual demonstrates that he or she fails a few of the later items administered. On most standardized tests, the tester continues administering the items until a certain number (either in a row or a proportion, such as six out of eight in a row) are failed. If the individual fails the number of items specified in the test manual for establishing a ceiling, the tester ends the test and is able to assume that all later test items would be failed by that individual as well. This saves time during the testing session and reduces fatigue. Criterion-Referenced Test. This type of test compares an individual’s performance to an established measure of performance rather than to the performance of others. Criterion-referenced tests will usually include a measure of mastery, or how well a child is able to complete a task. For example, if a test required that a child identify all of the letters of the alphabet, that would be a criterion-referenced test. We would be able to describe the child’s mastery of the test by using statements such as, “The child is able to identify 80 percent of the letters in the alphabet.” Norm-Referenced Test. This type of test compares an individual’s performance to the performance of others on the same measure. Usually, the norms are developed from data collected from a large, nationally representative group of individuals. Reliability. Indicators of reliability tell how dependable an assessment or screening tool is for the purpose it is used. Reliable tools are stable over time and include items that measure the same thing in different ways. For tools that require standardized observation (for example, child care quality observations or ratings of children’s behavior), the scores obtained by two different, well-trained observers must be similar to be considered reliable. Statistical measures of reliability are typically reported as correlation coefficients, which range from 0 to 1.0, with a higher value reflecting greater reliability. Many researchers and test developers require that assessment and screening tools have reliability values of 0.7 or higher. For our summary descriptors, we adopted a criterion of 0.65, which reflects a rule of thumb commonly used in the field. Typical indicators of reliability include measures of consistency of results and stability over time:
Stability. By this measure, an assessment is reliable to the extent the procedure yields the same result on two different occasions. Test-retest reliability involves testing the same group of individuals at least twice, with a relatively short interval between assessments, usually no longer than a few days or weeks apart. The higher the test-retest reliability, the more stable the assessment tool is considered to be. Longer periods between administrations of the same assessment will reduce the reliability, partly because the individual’s situation (for example, skill) can be expected to change. Some assessment tools have two versions of the same test so that the same skills or behaviors can be assessed a second or third time (as in a pre-post or longitudinal study). In such cases, test developers include information on alternate form reliability. To demonstrate that both forms of the test are essentially equivalent, a random half of a large group of individuals is given one form of the test and the other half is given the other form. Alternate form reliability is demonstrated if the scores of the two groups are highly correlated. Reliability of administration. Another reliability consideration applies to assessment tools that require an observer to score a child’s or parent’s behavior or complete a rating or checklist describing the behavior observed. To use such assessments in evaluation, researchers and test developers want to be sure that these ratings can be made consistently. One index of consistency is the extent to which two trained observers obtain the same scores when they do their observations at the same time, although independently. This index is referred to as inter-rater reliability. It is usually reported either as the correlation between the scores or ratings obtained by the two observers or as the percentage of items on which the two agree. Representativeness of Norming Sample. Standardized screening and assessment tools provide information about how the children and parents in your program are doing compared to the group (or sample) of individuals the test developers or researchers included in their norming group. Knowing whether the norming sample was nationally representative or representative of the children or parents in your program is important in deciding whether to use a screening or assessment tool. Most test authors include this information in their manuals. In general, it is better if the norming sample includes individuals of the same age group that you will be assessing, as well as geographic and racial/ethnic diversity, so that the assessment results will be relevant to the families in your program. Validity. Indicators of a screening or assessment tool’s validity provide information about whether the tool measures what it is supposed to for the purpose it is being used. Several types of validity are commonly used:
- To establish concurrent validity, test developers and researchers administer the new screening or assessment tool as well as a similar, established tool to the same individuals within a few hours or days. If the correlation between the two measures is high, concurrent validity is established. Strict interpretations require concurrent validity to reach levels of .70 or higher, but as a rule of thumb, many researchers accept .50 or higher as acceptable. Sometimes concurrent validity is expressed in terms of percent agreement between the two measures. In this compendium, we consider 80 percent agreement or higher as acceptable. - To establish predictive validity, researchers and test developers determine whether the screening or assessment tool conducted at one time point with a group of individuals is correlated with later functioning (these studies are often conducted over two to five years or more). If the correlation between the two measures obtained across the time interval is high, predictive validity is established. If, for example, a measure of vocabulary at age 3 is highly correlated with a test of reading ability in second grade, the vocabulary test could be said to have predictive validity. In some cases, researchers use other activities or events as the criterion, rather than another assessment. For example, predictive validity might be established by correlating age 3 vocabulary with children’s second-grade language report card grades. In general, the younger the child being assessed, the poorer the predictive validity. There is a long history of poor predictive validity among infant tests, with almost none meeting high levels of validity, such as .80. Researchers have advanced many explanations for this, including the important contributions of the different environments to which children are exposed. Because we know the predictive validity of infant and toddler assessment tools is low, in this compendium, we consider a correlation of .40 to be adequate for establishing predictive validity. Scoring. Alone, the scores from screening and assessment instruments (raw scores) have limited value. It is only when they are compared against a similar group (or norming sample) of children with known characteristics that a child’s score becomes meaningful. Because of this, instrument developers often provide the user with tables for converting raw scores into scores that are normed to a comparison sample. Below are some of the more frequently used normative scores:
1 This discussion is important for interpreting scores from standardized instruments. Scores from other instruments can also be interpreted meaningfully if you can compare the performance of children or parents across two points in time (such as comparing scores at the beginning and end of their program experience. 2 A standard
deviation is a measure of the score’s dispersion or variability
in a sample. The proportion of scores within a standard deviation
unit of the mean score is known. For example, in a normal distribution,
68 percent of all the scores fall between one standard deviation
below and one standard deviation above the mean. Thus, scores expressed
in standard deviation units enable the user to understand how a
child has performed relative to other children in the sample. |
| Table of Contents | Previous | Next |


