Rubric Reliability

Rubric validity and reliability are two important factors in the quality of rubrics and assessments. Reliability describes the level of discrepancy in evaluation results when a rubric is used by multiple assessors – or by the same assessor multiple times – on the same item. The more varied the results, the less reliable a rubric is.

For example, if you evaluate a written paper using a rubric today and give its its criteria certain scores, what is the likelihood of your colleagues giving the same paper the same score using that rubric. What is the likelihood of you giving the same paper the same score a month or a year from now? A reliable rubric has clear enough descriptors to remove as much bias as possible, reducing the likelihood of varying results by multiple raters.

What issues make a rubric unreliable?

Rubric reliability are caused because the rubric is not evaluated the same way, and that’s generally a result of descriptors being vague, subjective, or unrelated, or lack of any descriptors. So how do you fix rubric descriptors?

No descriptors: A Likert scale is the typical star ranking that you see in social media and other places. Likert scales are highly subjective and do not provide any reference as what the expectations are how the body of work should be evaluated. A rubric without any descriptors is really a Likert scale, even it could look like a rubric. Here’s an example.
Short, vague descriptors: Now that we established that a rubric without descriptors are “unreliable”, the next question is, what makes a rubric with descriptors unreliable, and the answer is descriptors that are vague, short, subjective, unrelated, etc. For example, if in the “Fair” category of a rubric, the descriptor is “Ok work, could be better”, the rater is left with a highly subjective choice and each rater may have a different concept of what “Okay work” is. However, a descriptor that leaves very little ambiguity is highly objective and reliable.
Generic rubrics: Some organizations publish generic rubrics to be used as starting points for development of more reliable rubrics. Make sure to modify descriptors to match your specific case.
The best way to detect the cause of rubric reliability issues is by analyzing the results of evaluations. Which leads us to the next section.

How to analyze reliability?

As described above, insufficient descriptors are the main cause of rubric reliability issues. An easy way to detect reliability issues without any tools is to just read the descriptors. If there are no descriptors, the rubric is not reliable. If descriptors are vague and subjective, then the level of reliability varies. You can also use some computerized computational methods as well as manual eye-balling methods to measure reliability – after a rubric is used by multiple scorers on the same exact items.

To manually detect rubric reliability, have multiple raters evaluate the same body of work. Then compare the results. The closer the results the more reliable the rubric is. More specifically, the more evaluators agree on a certain criterion, the better and more reliable that criterion is. For example, 10 raters used a rubric with 4 levels of performance (“Poor”, “Fair”,”Good” and “Excellent”) on a item and give it the following evaluation:

Criteria	Poor	Fair	Good	Excellent
A	5	2	1	2
B	0	6	3	1
C	0	0	5	5
D	0	10	0	0

On Criterion A, they widely disagreed, on B there’s less level of disagreement, on C there’s more agreement, and Criterion D, the results are unanimous. So D is the most reliable and A is the least. You can easily detect reliability without much calculation except for the counts.

There are a few statistical/computational methods for calculating inter-rater reliability (IRR) which produce a numeric value describing the degree of agreement among multiple assessors. While the detail of these methods are beyond the scope of this writing, there are tools that can do the calculation for you. iRubric includes built-in functionality for IRR analysis based on assessments done in various areas of iRubric.

How to improve rubric reliability with iRubric:

A few features of iRubric can help you improve rubric reliability:

iRubric can help you make better rubrics with higher quality descriptors: Rubric descriptors are the hardest part of rubric development. Often, rubrics are not designed for the right audience or are designed by authors that are not a part of the actual assessment. For example, certain nuances can be lost if a generic “Writing rubric” is used to evaluate “Fictional writing” pieces. A generic “Writing rubric” could have less specific, therefore, more generic and more subjective descriptors and criteria, as well as, missing some important criteria altogether. iRubric can help building better rubrics by:
- Allowing re-use and re-purpose of existing rubrics
- Providing thousands of samples specific to your needs
- Providing “descriptor” suggestions during the development of a rubric
iRubric Inter Rater Reliability (IRR) reports: iRubric generates IIR reports to indicate which parts of a rubric have reliability issues.
iRubric Summary and Detailed reports: iRubric generates easy to understand visualizations of how each evaluation and allows comparison of multiple evaluations to detect (1) reliability issues and (2) issues when raters do not follow rubric descriptors correctly.
iRubric Aggregate reports: Another easy way to identify rubric reliability issues is by using aggregate reports for each item where you can easily detect areas of low and high reliability.