
by learningspiral.com
What metrics do you rely on to judge the accuracy and consistency of scoring constructed-response items in large-scale assessment?
Most credible large-scale assessment programs continuously monitor for scoring accuracy and consistency during scoring sessions. Scoring accuracy is often measured through the use of so-called “validity items.” Prior to the start of a scoring session, validity items are expertly scored, to derive their “true” scores. These validity items are seeded into the flow of items that raters score throughout the scoring session. By comparing the raters’ scores with those assigned by the experts, the accuracy of scoring (by individual raters, groups or tables of raters, scoring rooms, and the scoring site overall) can be judged. Validity targets ─ the expected percentage of exact and adjacent agreement for rubrics of different sizes (e.g., 2-, 3-, 4-, 5-, and 6-point rubrics) ─ can be established for ongoing rater monitoring and remediation.
Scoring consistency can be determined by calculating the exact and adjacent agreement between pairs of raters who independently rate the same response (inter-rater reliability). Inter-rater reliability can be helpful in reporting on the degree of scoring consistency of individual raters, groups or tables of raters, scoring rooms, and the scoring site overall.
A variety of additional scoring statistics (e.g., daily and cumulative mean score and score-point distribution data) can be used to monitor for potential scoring drift on the part of individual raters, groups or tables of raters, scoring rooms, and the scoring site.
Large-scale assessments, with which I have been associated in recent years, have used the following validity and inter-rater reliability targets:
- For two-point rubrics:
80% exact and 95% exact plus adjacent agreement
- For three-point rubrics:
75% exact and 95% exact plus adjacent agreement
- For four-point rubrics:
70% exact and 95% exact plus adjacent agreement
- For five-point rubrics:
65% exact and 95% exact plus adjacent agreement
- For six-point rubrics:
60% exact and 95% exact plus adjacent agreement
Would any large-scale assessment colleagues be willing to share information about the types of statistics and target metrics they use to gauge the accuracy and consistency of scoring? If so, please connect with me at richard.jones@rmjassessment.com.
Thanks for sharing.