A recently-published piece on high-stakes essay scoring paints a depressing picture of this piece of the testing industry. Grading essays is a process entirely different from feeding Scantron sheets through a machine. But companies controlling contracts that give them the responsibility of providing the student data by which we judge the performance of schools (and, increasingly, teachers) seem to be doing their damndest to pretend that it isn’t.
Workers interviewed for the article, many of them temps, report having limited training and ill-informed supervisors (some of whom are also temps). There is pressure to move quickly, scoring up to 200 essays per day. Scorers are also pushed to make sure their evaluations match the scores given by other coworkers reading the same essay. Now, interrater agreement is generally a good thing. It serves as a check on scorers’ judgments, boosting objectivity in what is ultimately a subjective exercise. In the context of at least one for-profit educational testing company, however, this becomes warped:
Supervisors were expected to turn the test scores into a nice bell curve. If his room did not agree at least 80 percent of the time, the tests would be taken back and re-graded, wasting time and money. The supervisor would be put on probation or demoted.
When Farley complained to a fellow supervisor about his problem, she smiled wryly and held up a pencil.
“I’ve got this eraser, see,” she told him. “I help them out.”
The other important element here, which gets limited attention in the story, is the larger question of how we assess writing. This is what I study, and I can say that there’s surprisingly little agreement on what good writing looks like across grade levels. There’s evidence of this in the high-stakes writing prompts themselves, and it’s even more glaring in the rubrics used to score the essays. The article presents an approximation of Questar’s (propietary) rubric; by way of comparison, Michigan’s Department of Education publishes theirs, and it’s every bit as vague. Again, companies are employing best practices on the surface, but things quickly go awry.
As part of their training, Indovino and her co-workers read through pre-graded examples out loud, then discussed why each had been scored the way it was. The process quickly divided the room into two camps—the young, unemployed kids who were just there for a paycheck, and the retired teachers.
“The retired teachers would argue everything,” says Indovino.
After two days of going through example papers, each scorer had to pass a qualifying exam. Indovino scored three sets of ten pre-scored papers. In order to be approved to work on the project, she had to pass two of the sets with at least an 80 percent “agreement rate” with the rubric. She did so with relative ease; most of the rest of the room passed on their second try.
Her first project was from Arkansas, an essay written by eighth-graders on the topic, “A fun thing to do in my town.”
And that’s where the troubles began … How do you score a kid who rails that his town sucks? What about an exceptionally well-written essay on why the student was refusing to answer the question?
Experienced teachers do have a real sense of what they’re looking for, but even they sometimes disagree–which I’d argue is a great thing for a genuine scoring process, and a scary thing for a high-volume testing operation that a) is frequently carried out by people with little or no educational experience or training, b) depends on the use of rubrics that provide very little guidance for inexperienced raters with ‘fuzzy’ cases, and c) has serious consequences for the students, teachers, and schools on the other end.