Skip Navigation
acfbanner  
ACF
Department of Health and Human Services 		  
		  Administration for Children and Families
          
ACF Home   |   Services   |   Working with ACF   |   Policy/Planning   |   About ACF   |   ACF News   |   HHS Home

  Questions?  |  Privacy  |  Site Index  |  Contact Us  |  Download Reader™Download Reader  |  Print Print      

Office of Planning, Research & Evaluation (OPRE) skip to primary page content
Advanced
Search

Table of Contents | Previous

Attachment B

Reliability of the Measures

Classroom Observation Measures: Observation Measures of Language and Literacy Instruction (OMLIT)

The observations focused on literacy instructional processes and environments in the classrooms, specifically on aspects of classroom practice that have been shown in empirical research to support children’s language development and acquisition of early literacy skills. The complete battery of observation measures includes five instruments from the Observation Measures for Language and Literacy (OMLIT; Goodson, Layzer, Smith, Rimdzius, 2004) battery and the Arnett Caregiver Rating Scale (Arnett, 1989).

The Snapshot of Classroom Activities (OMLIT-Snapshot)

The OMLIT-Snapshot is a description of classroom activities and groupings, integration of literacy in other activities, and language in the classroom. It has two sections. The Environment section describes the number of children and adults present, as well as the type of adult (staff, parents, etc.). The Activities section describes activities that are taking place. For each activity, the observer records the number of children and adults in that activity, whether any adult or child is talking, and whether they are speaking English or another language, and whether literacy materials are used (text, writing, letters, singing).

The Read Aloud Profile (OMLIT-RAP)

The OMLIT-RAP is a description of staff behavior when reading aloud to children (in CLIO, the RAP was completed when an adult was reading to at least two children). The RAP records adult behavior during the read-aloud session in four categories: (a) pre-reading (set-up) behavior, (b) behavior while reading the book, (c) post-reading behavior, and (d) the language the adult uses when talking to children during the read aloud. The RAP records characteristics of the adult, the children, and the book itself in three categories: (a) role of the adult involved in the read-aloud (e.g., teacher, aide, etc.), (b) characteristics of the book being read, and (c) number of children involved in the read-aloud. The RAP also includes five quality indicators which summarize particular aspects of the read-aloud: (a) the degree to which the adult introduces and contextualizes new vocabulary to support children’s learning, (b) the depth of the discussion related to the story that the adult facilitates with the children before, during, and after the read-aloud, (c) the extent to which the adult uses open-ended questions that invite children to engage in prediction, imagination, and/or rich description, (d) the depth of children’s engagement with the read-aloud activity, and (e) the quality of any post-reading book-related activities that the adult organizes (beyond oral discussion).

The Classroom Literacy Opportunities Checklist (OMLIT-CLOC)

The OMLIT-CLOC is an inventory of classroom literacy resources. It provides an overall rating of the extent to which a classroom is a literacy-rich environment and delineates eight aspects of the literacy environment: (a) physical layout of the classroom, (b) the text or print environment, (c) books and reading or listening areas, (d) writing resources, (e) literacy-related materials and toys, (f) cultural diversity in literacy materials, (g) literacy integrated in classroom areas or learning centers, and (h) the richness and integration of a curriculum theme.

The Classroom Literacy Instruction Profile (OMLIT-CLIP)

The OMLIT-CLIP involves a two-stage coding protocol in which the observer first determines if any classroom staff member is involved in a literacy activity and, if so, the observer codes seven characteristics of the literacy activity: the type of activity, the literacy knowledge being afforded to the children, the adult’s level and type of participation in the activity, any text support, languages spoken by staff and children, and the number of children involved. If the literacy activity involves adult-child discussion, the quality of this discussion is rated on three characteristics—the cognitive challenge in the discussion, the extensiveness of the discussion, the level of abstraction of the discussion.

The Quality of Language and Literacy Instruction (OMLIT-QUILL)

The OMLIT-QUILL is an overall evaluation of the quantity and quality of the instructional practices that build children’s print awareness and oral language skills, expose children to a rich and varied vocabulary, and build children’s phonological awareness. These practices are predictors of better reading outcomes for children once they are in school; this is particularly true of those at risk for reading difficulties (Dickinson and Tabors, 2001; Lonigan, Burgess, and Anthony, 2000; NICHD, 2000; Snow, Burns, and Griffin, 1998; Whitehurst and Lonigan, 1998). In addition, the QUILL evaluates instructional practices with English language learners.

Reliability of the OMLIT

Two kinds of reliability have been established for the OMLIT measures, based on data from two national observation studies:

  • Inter-rater reliability: the degree of agreement between two trained observers administering the observation measures at the same time in the same classroom.

  • Agreement with a criterion: the extent of agreement between coding by trained observers and “master” or ”correct” coding by experts of a standardized stimulus (e.g., a videotape of a classroom, written examples, etc.).

The discussion below presents preliminary data on the first two types of reliability. Future waves of observations will provide additional data to increase the accuracy of our estimates of the reliability. The third type of reliability will depend on different data collection designs planned to occur in the near future.

1. Agreement with Criterion Coding

Paper and Pencil Tests

Reliability was assessed via paper and pencil tests on two of the OMLIT measures—the Snapshot and the QUILL. Written scenarios describing classroom events were prepared and coded in advance by the OMLIT developers (the “criterion” coding). The accuracy of observers’ coding of written scenarios was determined by comparing it to the criterion coding of the same scenarios. Although this type of paper-and-pencil test does not simulate the “live” action in a classroom, it does provide a measure of how well observers understand the coding definitions for the various activities and specialized literacy data.

On the Snapshot, observers coded 15 written scenarios, and their coding was compared to criterion coding of the same written scenarios done in advance by three of the OMLIT developers. A high level of agreement was achieved between the coding done by the observers and the criterion coding (Exhibit 1). On average, the coding of the written scenarios by the observers agreed almost perfectly with the criterion coding by the trainers. Further, each of the individual observers scored 95% or higher on the agreement between their coding and the criterion coding.

On the QUILL, the agreement ranged from 69% to 84% when agreement was defined as an exact match in ratings (Exhibit B-1). The agreement was substantially higher when the definition of agreement was expanded to agreement within a point.

Exhibit B-1

Average Agreement on Coding Written Scenarios on the Snapshot and QUILL
(% Agreement between 14 Observers and Criterion Coding)
Codes/Variables Average % Agreement with Criterion Coding
Snapshot %
Environment (all codes) 98%
  • Total # children present
93
  • Total # adults present
98
  • Type of adults present: teachers and aides
99
  • Type of adults present: other adults
99
Activities (all codes) 98%
  • Type of activity
98
  • Number of children in activity
99
  • Number of teachers in activity
99
  • Number of aides in activity
98
  • Number of other adults in activity
99
  • Integration of literacy in other activities
96
  • Any language by children or adults
96
All categories on Snapshot 98%
QUILL Exact Agreement % +/- 1 Pt on Scale %
Overall average quality    
  • Writing
.79 .83
  • Letter/word knowledge
.70 .76
  • Oral language
.69 .73
  • Functions/features of print
.71 .76
  • Print motivation
.82 .85
  • Sounds
.84 .88

Coding Videotapes

Observers coded two videotape recordings of teachers reading aloud to a group of children using the RAP. The agreement between the observers’ coding and the criterion coding by the developers was assessed in four areas:

  • Instructional behavior in the pre-reading (set-up) period.
  • Instructional behavior while reading the book.
  • Post-reading instruction.
  • Quality ratings on (a) introduction of new vocabulary, (b) depth of story-related discussion, including use of open-ended questions that invite children to engage in prediction, imagination, and/or rich description, and (c) the depth of any post-reading book-related activities that the adult organizes (beyond oral discussion).

Agreement between the observers and the criterion coding was computed as the average agreement across the two videotapes. The average percent agreement was very high on coding the instructional strategies used by the teacher during the read-aloud (Exhibit B-2). Average percent agreement on the Quality Indicators also was high (88%).29

Exhibit B-2

Average Agreement on Coding Videotaped Read Alouds with the RAP
(% Agreement between 14 Observers and Criterion Coding)
Codes on the RAP Average % Agreement with Criterion Coding
Instructional Behavior %
  • Pre-reading strategies
96%
  • Reading strategies
95
  • Post-reading strategies
98
All Pre-reading, reading, post-reading codes 96
Quality Indicators %
  • Vocabulary links
100%
  • Adult use of open-ended questions
94
  • Depth of post-reading activity
91
All Quality Indicators 92%

2. Inter-Rater Agreement from Live Observations

Inter-rater agreement on the OMLIT was assessed as part of the training process (14 paired observations), and, subsequent to training, as part of the actual data collection (17 paired observations). The calculation of inter-rater reliability used data from both of these sources.

Classroom Literacy Opportunities Checklist (CLOC)

Scores on the CLOC include an average score across all items and average scores on each of six components of literacy resources. Inter-agreement on the CLOC was based only on data from the double-coding in 17 Even Start classrooms.

The average CLOC rating by the two observers agreed exactly in 80% of the pairs (Exhibit B-3). Nine of the ten sections on the CLOC had reliabilities above 70%; the ratings on “literacy materials in other centers” had a lower reliability of 59%. Discussions with observers suggest that the low reliability was attributable to the difficulty of noticing individual literacy resources (a book, pencils and paper) in other centers. We will strive to increase the reliability of this section through (a) improving the definition of the item to help observers understand what they are looking for, and (b) focusing training on these items to heighten observer awareness of isolated materials in different areas of the classroom.

Exhibit B-3

Inter-Rater Agreement on the CLOC
(17 Paired Observations of Early Childhood Classrooms)
CLOC Codesa (# items) Average % Agreementb Range in % Agreement Across Observer Pairs
Total across all items (56) 80% 57% - 100%
  • Physical layout of classrooms (5)
91 20% - 100%
  • Print environment (8)
77 38% - 100%
  • Books/reading area/listening area (16)
78 50% - 100%
  • Writing resources (5)
81 25% - 100%
  • Literacy toys and materials (7)
82 25% - 100%
  • Cultural diversity (3)
73 19% - 100%
  • Literacy in other centers (3)
71 20% - 100%
  • Curriculum theme (9)
76 10% - 100%
a Each item rated on a scale of 1-3

b Based on exact agreement between the ratings assigned to CLOC items by paired observers.

Quality of Instruction in Language and Literacy (QUILL)
Inter-rater reliability on the frequency of the different types of language/literacy activities was defined as two observers selecting the exact same rating (“none,” “one,” “a few,” or “many” instances of the literacy activity). On the quality ratings, agreement was defined as two observers selecting a quality rating that was within one point (on the 5-point scale).

Inter-rater agreement on the frequency of literacy activities ranged from 67% to 83%, with average agreement of 76% (Exhibit B-4). Coders agreed least often on the frequency of activities that promoted oral language and that called children’s attention to the functions and features of print. On the quality ratings, agreement ranged from 68% to 94%.

Exhibit B-4

Inter-Rater Agreement on the QUILL
(31 Paired Observations of Early Childhood Classrooms)
QUILL Codes Average % Agreement
Frequency of literacy activities Exact
All literacy/language activities 82%
Writing activities 88
Activities to promote letter/word knowledge 82
Activities to promote oral language 67
Activities to promote functions/features of print 67
Activities to promote understanding of sounds 71
Quality of instruction in literacy +/- 1 Pt
All language and literacy activities 94%
Writing activities 85
Activities to promote letter/word knowledge 85
Activities to promote oral language 87
Activities to promote functions/features of print 68
Activities to promote understanding of sounds 69

Reading Aloud Profile—The RAP
The rate of agreement on the RAP when coding read-alouds in actual classrooms was quite high, regardless of the fact that most 3-hour paired observations typically involved only 1 or 2 read-alouds. Agreement on instructional behavior before, during and after reading a book ranged from 85% to 97%, with an overall average of 90% (Exhibit B-5). (The inter-rater agreement on individual instructional codes during reading ranged from 53% to 93%.) The overall quality ratings also had high inter-rater agreement. The inter-rater agreement was around 85%, when agreement was defined as within one point; the agreement dropped substantially when agreement required both coders to derive exactly the same quality rating.

Exhibit B-5

Inter-Rater Agreement on the RAP
(31 Paired Observations of Early Childhood Classrooms)
RAP Codes Average % Agreement Range in % Agreement Across Observers
Adult Behavior  
Pre-reading strategies used by teacher 89% 73% - 100%
Reading strategies used by teacher 85 64% - 100%
Post-reading strategies used by teacher 97 73% - 100%
Pre-reading, Reading, Post-reading codes combined 90 76% - 98%
Quality Indicators +/- 1 pt Exact
Vocabulary links 83% 76% NAa
Adult use of open-ended questions 83 64 NA
Depth of post-reading activity 85 76 NA
a An observer either agreed or not with the rating on the criterion coding, which means there is not a continuous range of agreement.

Classroom Literacy Instruction Profile: The CLIP
Inter-agreement on the CLIP was based only on data from the double-coding in 17 Even Start classrooms.

The CLIP measure involves a two-stage coding protocol. First, the observer determines if any of the classroom staff are involved in a literacy activity. If so, then the observer codes seven characteristics of the literacy activity. If no staff member is involved in a literacy activity, the observer records only the type of non-literacy activity that the classroom is involved in. The first aspect of inter-rater reliability that was computed for the CLIP was the extent to which the two coders agreed on whether or not a staff member was involved in a literacy activity during the CLIP coding period. For observation segments where the two raters agreed that the teacher was involved in a literacy activity, the percent agreement was computed on the seven characteristics of the literacy activity.

On average, the inter-rater agreement on the occurrence of a literacy event was 85% (Exhibit B-6). When both observers identified a literacy activity, there was very high agreement on the characteristics of that activity. The two most critical categories are the type of literacy activity and the literacy knowledge afforded, and the inter-rater agreement on these codes was above 95%. The inter-rater agreement on the quality ratings was also very high.

Exhibit B-6

Inter-Rater Agreement on the CLIP
(17 Paired Observations of Early Childhood Classrooms)
CLIP Codes Average % Agreement Range in % Agreement Across Pairs
Occurrence of literacy event Staff involved in literacy event or not 85% 50% - 100%
Rate of literacy activities (total # literacy events/# CLIPs) 94 76% - 100%
Characteristics of literacy events Type of literacy activity 98% 50% - 100%
Number of children involved 96 0/1
Language spoken by teacher 97 71% - 100%
Language spoken by children 97 67% - 100%
Instructional style 97 57% - 100%
Text support 98 61% - 100%
Literacy knowledge afforded 96 56% - 100%
Quality ratings Cognitive challenge 92% NA
Depth of discussion 93 NA

Snapshot of Classroom Activities—The Snapshot
High inter-rater agreement was not expected for many of the Snapshot codes, since the allocation of children to activities could vary depending on the direction of rotation of the observer’s scan of the classroom. For this reason, while we expected that observers might agree on the activities taking place in the classroom, they were much more likely to differ on the number of children they assigned to each activity. This also leads us to believe that the inter-rater reliability estimates for the Snapshot present an underestimate of the true level of agreement across trained observers in how they would code an idealized “stationary” classroom.

The Environment section on the Snapshot includes a count of the numbers of children and adults present in the classroom. There was a high level of agreement—above 80%—on all codes on the Environment (Exhibit B-7). On the Activities section of the Snapshot, children and adults are allocated into activities. This is the part of the Snapshot where small differences in timing between observers could adversely affect their agreement. As predicted, the inter-rater agreement was lowest for the categories involving numbers of children in an activity. The level of agreement on the numbers of adults in each activity also was low. On the other hand, the types of activities that each observer coded had higher inter-rater agreement (82%), as did the integration of literacy in activities (88%). Although the level of agreement at the activity level on whether or not children or adults were talking was only 71%, agreement was very high—100%— on whether or not there were any adults or children talking in any of the activities coded on a Snapshot.

Exhibit B-7

Inter-Rater Agreement on the Snapshot
(31 Paired Observations of Early Childhood Classrooms)
Snapshot Codes Average % Agreement Range in % Agreement Across Pairs(b)
Environment Total # children present 88% 71% - 100%
Type of adults present: teachers/aides 81 71% - 100%
Type of adults present: other 87 71% - 100%
All codes on Environment 85 65% - 100%
Activities on Snapshot Type of activity 82% 79% - 100%
Number of children in activity 57 33% - 79%
Number of teachers in activity 80 33% - 78%
Number of aides in activity 81 55% - 92%
Number of other adults in activity 91 60% - 100%
Literacy in other activities 89 76% - 100%
Any language by child/adult in each activity 71 51% - 84%
Snapshot-level Codes Any adult talk in Snapshot 100% NA
Any child talk in Snapshot 100 NA
Any adult/child talk in Snapshot 100 NA

3. IRT Scaling

The QUILL ratings and CLOC constructs have undergone IRT scaling by Futoshi Yamoto, a psychometrician at Abt, which shows these constructs to have very high reliability. A separate technical report has been prepared on the IRT scaling, and this will be available soon.

Reliability of the TOPEL

Psychometric Properties

Exhibit B-8 shows the internal consistency reliabilities for subtest of the TOPEl at three ages in the standardization sample.

Exhibit B-8

Internal Consistency Reliabilities on the TOPEL
Subtest 3 years 4 years 5 years
Definitional Vocabulary .91 .92 .91
Phonological Awareness .85 .86 .88
Print Knowledge .89 .94 .95
Note: Reliabilities computed from data from national standardization sample (n=700)

Training of the TOPEL

Child assessors were trained on the three TOPEL subtests over a three-day period, including actual administration of the TOPEL on non-study children. During the training, assessors were trained to a standard of 95% agreement on coding of standardized test protocols and use of appropriate probes. Each trainee practiced test administration using practice scripts (designed to test administration rules) while trainers observed and provided feedback.

Trainees then practiced administering the test with children in volunteer sites while trainers observed and coded children’s responses while monitoring administration to ensure that standardized procedures were followed. Each trainee’s record booklets were then compared with trainers’ simultaneously coded booklets to check for variance immediately following each administration, and feedback was provided as necessary. Trainees continued practice administration until no variances occurred. Prior to working with study children, each trainee was “tested-out” by administering the complete battery to a trainer, who followed a script designed to test administration rules.

Finally, initial data collection was conducted under immediate supervision of trainers; that is, each assessor was observed (by a trainer) while conducting assessments with actual study children. Any deviation from standardized procedures or variance in recorded scoring were grounds for termination. Thus, all assessors who were invited to continue data collection had demonstrated mastery of standard administration.




29 Two quality indicators were dropped from the RAP, based on low agreement. “Level of child engagement” and “Depth of adult discussion” were eliminated, because the average agreement on the coding of videotapes was below 75% for each. (back to footnote 29)

 

Table of Contents | Previous