Table of Contents | Previous |
Attachment B
Reliability of the Measures
Classroom Observation Measures: Observation Measures of Language and Literacy Instruction (OMLIT)
The observations focused on literacy instructional processes and environments in the classrooms, specifically on aspects of classroom practice that have been shown in empirical research to support children’s language development and acquisition of early literacy skills. The complete battery of observation measures includes five instruments from the Observation Measures for Language and Literacy (OMLIT; Goodson, Layzer, Smith, Rimdzius, 2004) battery and the Arnett Caregiver Rating Scale (Arnett, 1989).
The Snapshot of Classroom Activities (OMLIT-Snapshot)
The OMLIT-Snapshot is a description of classroom activities and groupings, integration of literacy in other activities, and language in the classroom. It has two sections. The Environment section describes the number of children and adults present, as well as the type of adult (staff, parents, etc.). The Activities section describes activities that are taking place. For each activity, the observer records the number of children and adults in that activity, whether any adult or child is talking, and whether they are speaking English or another language, and whether literacy materials are used (text, writing, letters, singing).
The Read Aloud Profile (OMLIT-RAP)
The OMLIT-RAP is a description of staff behavior when reading aloud to children (in CLIO, the RAP was completed when an adult was reading to at least two children). The RAP records adult behavior during the read-aloud session in four categories: (a) pre-reading (set-up) behavior, (b) behavior while reading the book, (c) post-reading behavior, and (d) the language the adult uses when talking to children during the read aloud. The RAP records characteristics of the adult, the children, and the book itself in three categories: (a) role of the adult involved in the read-aloud (e.g., teacher, aide, etc.), (b) characteristics of the book being read, and (c) number of children involved in the read-aloud. The RAP also includes five quality indicators which summarize particular aspects of the read-aloud: (a) the degree to which the adult introduces and contextualizes new vocabulary to support children’s learning, (b) the depth of the discussion related to the story that the adult facilitates with the children before, during, and after the read-aloud, (c) the extent to which the adult uses open-ended questions that invite children to engage in prediction, imagination, and/or rich description, (d) the depth of children’s engagement with the read-aloud activity, and (e) the quality of any post-reading book-related activities that the adult organizes (beyond oral discussion).
The Classroom Literacy Opportunities Checklist (OMLIT-CLOC)
The OMLIT-CLOC is an inventory of classroom literacy resources. It provides an overall rating of the extent to which a classroom is a literacy-rich environment and delineates eight aspects of the literacy environment: (a) physical layout of the classroom, (b) the text or print environment, (c) books and reading or listening areas, (d) writing resources, (e) literacy-related materials and toys, (f) cultural diversity in literacy materials, (g) literacy integrated in classroom areas or learning centers, and (h) the richness and integration of a curriculum theme.
The Classroom Literacy Instruction Profile (OMLIT-CLIP)
The OMLIT-CLIP involves a two-stage coding protocol in which the observer first determines if any classroom staff member is involved in a literacy activity and, if so, the observer codes seven characteristics of the literacy activity: the type of activity, the literacy knowledge being afforded to the children, the adult’s level and type of participation in the activity, any text support, languages spoken by staff and children, and the number of children involved. If the literacy activity involves adult-child discussion, the quality of this discussion is rated on three characteristics—the cognitive challenge in the discussion, the extensiveness of the discussion, the level of abstraction of the discussion.
The Quality of Language and Literacy Instruction (OMLIT-QUILL)
The OMLIT-QUILL is an overall evaluation of the quantity and quality of the instructional practices that build children’s print awareness and oral language skills, expose children to a rich and varied vocabulary, and build children’s phonological awareness. These practices are predictors of better reading outcomes for children once they are in school; this is particularly true of those at risk for reading difficulties (Dickinson and Tabors, 2001; Lonigan, Burgess, and Anthony, 2000; NICHD, 2000; Snow, Burns, and Griffin, 1998; Whitehurst and Lonigan, 1998). In addition, the QUILL evaluates instructional practices with English language learners.
Reliability of the OMLIT
Two kinds of reliability have been established for the OMLIT measures, based on data from two national observation studies:
-
Inter-rater reliability: the degree of agreement between two trained observers administering the observation measures at the same time in the same classroom.
-
Agreement with a criterion: the extent of agreement between coding by trained observers and “master” or ”correct” coding by experts of a standardized stimulus (e.g., a videotape of a classroom, written examples, etc.).
The discussion below presents preliminary data on the first two types of reliability. Future waves of observations will provide additional data to increase the accuracy of our estimates of the reliability. The third type of reliability will depend on different data collection designs planned to occur in the near future.
1. Agreement with Criterion Coding
Paper and Pencil Tests
Reliability was assessed via paper and pencil tests on two of the OMLIT measures—the Snapshot and the QUILL. Written scenarios describing classroom events were prepared and coded in advance by the OMLIT developers (the “criterion” coding). The accuracy of observers’ coding of written scenarios was determined by comparing it to the criterion coding of the same scenarios. Although this type of paper-and-pencil test does not simulate the “live” action in a classroom, it does provide a measure of how well observers understand the coding definitions for the various activities and specialized literacy data.
On the Snapshot, observers coded 15 written scenarios, and their coding was compared to criterion coding of the same written scenarios done in advance by three of the OMLIT developers. A high level of agreement was achieved between the coding done by the observers and the criterion coding (Exhibit 1). On average, the coding of the written scenarios by the observers agreed almost perfectly with the criterion coding by the trainers. Further, each of the individual observers scored 95% or higher on the agreement between their coding and the criterion coding.
On the QUILL, the agreement ranged from 69% to 84% when agreement was defined as an exact match in ratings (Exhibit B-1). The agreement was substantially higher when the definition of agreement was expanded to agreement within a point.
| Codes/Variables | Average % Agreement with Criterion Coding | |
|---|---|---|
| Snapshot | % | |
| Environment (all codes) | 98% | |
|
93 | |
|
98 | |
|
99 | |
|
99 | |
| Activities (all codes) | 98% | |
|
98 | |
|
99 | |
|
99 | |
|
98 | |
|
99 | |
|
96 | |
|
96 | |
| All categories on Snapshot | 98% | |
| QUILL | Exact Agreement % | +/- 1 Pt on Scale % |
| Overall average quality | ||
|
.79 | .83 |
|
.70 | .76 |
|
.69 | .73 |
|
.71 | .76 |
|
.82 | .85 |
|
.84 | .88 |
Coding Videotapes
Observers coded two videotape recordings of teachers reading aloud to a group of children using the RAP. The agreement between the observers’ coding and the criterion coding by the developers was assessed in four areas:
- Instructional behavior in the pre-reading (set-up) period.
- Instructional behavior while reading the book.
- Post-reading instruction.
- Quality ratings on (a) introduction of new vocabulary, (b) depth of story-related discussion, including use of open-ended questions that invite children to engage in prediction, imagination, and/or rich description, and (c) the depth of any post-reading book-related activities that the adult organizes (beyond oral discussion).
Agreement between the observers and the criterion coding was computed as the average agreement across the two videotapes. The average percent agreement was very high on coding the instructional strategies used by the teacher during the read-aloud (Exhibit B-2). Average percent agreement on the Quality Indicators also was high (88%).29
| Codes on the RAP | Average % Agreement with Criterion Coding |
|---|---|
| Instructional Behavior | % |
|
96% |
|
95 |
|
98 |
| All Pre-reading, reading, post-reading codes | 96 |
| Quality Indicators | % |
|
100% |
|
94 |
|
91 |
| All Quality Indicators | 92% |
2. Inter-Rater Agreement from Live Observations
Inter-rater agreement on the OMLIT was assessed as part of the training process (14 paired observations), and, subsequent to training, as part of the actual data collection (17 paired observations). The calculation of inter-rater reliability used data from both of these sources.
Classroom Literacy Opportunities Checklist (CLOC)
Scores on the CLOC include an average score across all items and average scores on each of six components of literacy resources. Inter-agreement on the CLOC was based only on data from the double-coding in 17 Even Start classrooms.
The average CLOC rating by the two observers agreed exactly in 80% of the pairs (Exhibit B-3). Nine of the ten sections on the CLOC had reliabilities above 70%; the ratings on “literacy materials in other centers” had a lower reliability of 59%. Discussions with observers suggest that the low reliability was attributable to the difficulty of noticing individual literacy resources (a book, pencils and paper) in other centers. We will strive to increase the reliability of this section through (a) improving the definition of the item to help observers understand what they are looking for, and (b) focusing training on these items to heighten observer awareness of isolated materials in different areas of the classroom.
| CLOC Codesa (# items) | Average % Agreementb | Range in % Agreement Across Observer Pairs |
|---|---|---|
| Total across all items (56) | 80% | 57% - 100% |
|
91 | 20% - 100% |
|
77 | 38% - 100% |
|
78 | 50% - 100% |
|
81 | 25% - 100% |
|
82 | 25% - 100% |
|
73 | 19% - 100% |
|
71 | 20% - 100% |
|
76 | 10% - 100% |
a Each item rated on a scale of 1-3 b Based on exact agreement between the ratings assigned to CLOC items by paired observers. |
Quality of Instruction in Language and Literacy (QUILL)
Inter-rater reliability on the frequency of the different types of language/literacy activities was defined as two observers selecting the exact same rating (“none,” “one,” “a few,” or “many” instances of the literacy activity). On the quality ratings, agreement was defined as two observers selecting a quality rating that was within one point (on the 5-point scale).
Inter-rater agreement on the frequency of literacy activities ranged from 67% to 83%, with average agreement of 76% (Exhibit B-4). Coders agreed least often on the frequency of activities that promoted oral language and that called children’s attention to the functions and features of print. On the quality ratings, agreement ranged from 68% to 94%.
| QUILL Codes | Average % Agreement |
|---|---|
| Frequency of literacy activities | Exact |
| All literacy/language activities | 82% |
| Writing activities | 88 |
| Activities to promote letter/word knowledge | 82 |
| Activities to promote oral language | 67 |
| Activities to promote functions/features of print | 67 |
| Activities to promote understanding of sounds | 71 |
| Quality of instruction in literacy | +/- 1 Pt |
| All language and literacy activities | 94% |
| Writing activities | 85 |
| Activities to promote letter/word knowledge | 85 |
| Activities to promote oral language | 87 |
| Activities to promote functions/features of print | 68 |
| Activities to promote understanding of sounds | 69 |
Reading Aloud Profile—The RAP
The rate of agreement on the RAP when coding read-alouds in actual classrooms was quite high, regardless of the fact that most 3-hour paired observations typically involved only 1 or 2 read-alouds. Agreement on instructional behavior before, during and after reading a book ranged from 85% to 97%, with an overall average of 90% (Exhibit B-5). (The inter-rater agreement on individual instructional codes during reading ranged from 53% to 93%.) The overall quality ratings also had high inter-rater agreement. The inter-rater agreement was around 85%, when agreement was defined as within one point; the agreement dropped substantially when agreement required both coders to derive exactly the same quality rating.
| RAP Codes | Average % Agreement | Range in % Agreement Across Observers | |
|---|---|---|---|
| Adult Behavior | |||
| Pre-reading strategies used by teacher | 89% | 73% - 100% | |
| Reading strategies used by teacher | 85 | 64% - 100% | |
| Post-reading strategies used by teacher | 97 | 73% - 100% | |
| Pre-reading, Reading, Post-reading codes combined | 90 | 76% - 98% | |
| Quality Indicators | +/- 1 pt | Exact | |
| Vocabulary links | 83% | 76% | NAa |
| Adult use of open-ended questions | 83 | 64 | NA |
| Depth of post-reading activity | 85 | 76 | NA |
| a An observer either agreed or not with the rating on the criterion coding, which means there is not a continuous range of agreement. |
Classroom Literacy Instruction Profile: The CLIP
Inter-agreement on the CLIP was based only on data from the double-coding in 17 Even Start classrooms.
The CLIP measure involves a two-stage coding protocol. First, the observer determines if any of the classroom staff are involved in a literacy activity. If so, then the observer codes seven characteristics of the literacy activity. If no staff member is involved in a literacy activity, the observer records only the type of non-literacy activity that the classroom is involved in. The first aspect of inter-rater reliability that was computed for the CLIP was the extent to which the two coders agreed on whether or not a staff member was involved in a literacy activity during the CLIP coding period. For observation segments where the two raters agreed that the teacher was involved in a literacy activity, the percent agreement was computed on the seven characteristics of the literacy activity.
On average, the inter-rater agreement on the occurrence of a literacy event was 85% (Exhibit B-6). When both observers identified a literacy activity, there was very high agreement on the characteristics of that activity. The two most critical categories are the type of literacy activity and the literacy knowledge afforded, and the inter-rater agreement on these codes was above 95%. The inter-rater agreement on the quality ratings was also very high.
| CLIP Codes | Average % Agreement | Range in % Agreement Across Pairs | |
|---|---|---|---|
| Occurrence of literacy event | Staff involved in literacy event or not | 85% | 50% - 100% |
| Rate of literacy activities (total # literacy events/# CLIPs) | 94 | 76% - 100% | |
| Characteristics of literacy events | Type of literacy activity | 98% | 50% - 100% |
| Number of children involved | 96 | 0/1 | |
| Language spoken by teacher | 97 | 71% - 100% | |
| Language spoken by children | 97 | 67% - 100% | |
| Instructional style | 97 | 57% - 100% | |
| Text support | 98 | 61% - 100% | |
| Literacy knowledge afforded | 96 | 56% - 100% | |
| Quality ratings | Cognitive challenge | 92% | NA |
| Depth of discussion | 93 | NA | |
Snapshot of Classroom Activities—The Snapshot
High inter-rater agreement was not expected for many of the Snapshot codes, since the allocation of children to activities could vary depending on the direction of rotation of the observer’s scan of the classroom. For this reason, while we expected that observers might agree on the activities taking place in the classroom, they were much more likely to differ on the number of children they assigned to each activity. This also leads us to believe that the inter-rater reliability estimates for the Snapshot present an underestimate of the true level of agreement across trained observers in how they would code an idealized “stationary” classroom.
The Environment section on the Snapshot includes a count of the numbers of children and adults present in the classroom. There was a high level of agreement—above 80%—on all codes on the Environment (Exhibit B-7). On the Activities section of the Snapshot, children and adults are allocated into activities. This is the part of the Snapshot where small differences in timing between observers could adversely affect their agreement. As predicted, the inter-rater agreement was lowest for the categories involving numbers of children in an activity. The level of agreement on the numbers of adults in each activity also was low. On the other hand, the types of activities that each observer coded had higher inter-rater agreement (82%), as did the integration of literacy in activities (88%). Although the level of agreement at the activity level on whether or not children or adults were talking was only 71%, agreement was very high—100%— on whether or not there were any adults or children talking in any of the activities coded on a Snapshot.
| Snapshot Codes | Average % Agreement | Range in % Agreement Across Pairs(b) | |
|---|---|---|---|
| Environment | Total # children present | 88% | 71% - 100% |
| Type of adults present: teachers/aides | 81 | 71% - 100% | |
| Type of adults present: other | 87 | 71% - 100% | |
| All codes on Environment | 85 | 65% - 100% | |
| Activities on Snapshot | Type of activity | 82% | 79% - 100% |
| Number of children in activity | 57 | 33% - 79% | |
| Number of teachers in activity | 80 | 33% - 78% | |
| Number of aides in activity | 81 | 55% - 92% | |
| Number of other adults in activity | 91 | 60% - 100% | |
| Literacy in other activities | 89 | 76% - 100% | |
| Any language by child/adult in each activity | 71 | 51% - 84% | |
| Snapshot-level Codes | Any adult talk in Snapshot | 100% | NA |
| Any child talk in Snapshot | 100 | NA | |
| Any adult/child talk in Snapshot | 100 | NA | |
3. IRT Scaling
The QUILL ratings and CLOC constructs have undergone IRT scaling by Futoshi Yamoto, a psychometrician at Abt, which shows these constructs to have very high reliability. A separate technical report has been prepared on the IRT scaling, and this will be available soon.
Reliability of the TOPEL
Psychometric Properties
Exhibit B-8 shows the internal consistency reliabilities for subtest of the TOPEl at three ages in the standardization sample.
| Subtest | 3 years | 4 years | 5 years |
|---|---|---|---|
| Definitional Vocabulary | .91 | .92 | .91 |
| Phonological Awareness | .85 | .86 | .88 |
| Print Knowledge | .89 | .94 | .95 |
| Note: Reliabilities computed from data from national standardization sample (n=700) |
Training of the TOPEL
Child assessors were trained on the three TOPEL subtests over a three-day period, including actual administration of the TOPEL on non-study children. During the training, assessors were trained to a standard of 95% agreement on coding of standardized test protocols and use of appropriate probes. Each trainee practiced test administration using practice scripts (designed to test administration rules) while trainers observed and provided feedback.
Trainees then practiced administering the test with children in volunteer sites while trainers observed and coded children’s responses while monitoring administration to ensure that standardized procedures were followed. Each trainee’s record booklets were then compared with trainers’ simultaneously coded booklets to check for variance immediately following each administration, and feedback was provided as necessary. Trainees continued practice administration until no variances occurred. Prior to working with study children, each trainee was “tested-out” by administering the complete battery to a trainer, who followed a script designed to test administration rules.
Finally, initial data collection was conducted under immediate supervision of trainers; that is, each assessor was observed (by a trainer) while conducting assessments with actual study children. Any deviation from standardized procedures or variance in recorded scoring were grounds for termination. Thus, all assessors who were invited to continue data collection had demonstrated mastery of standard administration.
| Table of Contents | Previous |

