Automated assessments are popular and cost-effective, but how do you ensure the quality? How do you know that an assessment measures the knowledge level it is supposed to measure? How do you know if your test distinguishes between competent and mediocre students?

Random selection of questions
Avoid a fixed selection of questions. If all students get the same questions, you make it easier to cheat. You risk that someone will publish correct answers on the internet. If it is possible to retake a standardized assessment, you probably measure the students' ability to memorize the correct answers rather than their knowledge.

We conducted an experiment at the Maritime Academy of Asia and the Pacific (MAAP) in the Philippines, where we compared fixed and random questions in assessments. The assessment with random questions had a database with three times as many questions as the students actually got when tested. Both assessments distinguished between competent and mediocre students the first time. The differences occurred when they did the tests for the second time. The scores of the assessments using fixed questions improved the second and third times. Mediocre students apparently evolved into experts. The assessment selecting questions randomly from a database, on the other hand, produced stable results. The students had to study to improve their scores.

Item Analysis
Randomized questions help, but we need analysis to validate if the assessments measure knowledge. A common approach is "Item Analysis," which to conduct is relatively straightforward. The method defines each question in an assessment as an item and calculates different values for each of them. A key value is the discrimination index which reveals if an item discriminates between high and low scorers.

The formula for the discrimination index used in item analysis is:

Ds = (Pu - Pl) /N

Ds = Discrimination index
Pu = The number of students in the upper 27 % that answered correctly
Pl = The number of students in the lower 27 % that answered correctly
N = The total number of students in the upper and lower segments

The index compares the top and bottom 27 % scorers and provides a value in the range between one and minus one. High values show that an item discriminates between high and low scorers in the sense that competent students have a higher average score than weaker students. A discrimination index close to zero indicates that the item does not discriminate between high and low scorers. A negative index value means that weak students have higher average scores than competent students on this particular item. By using the discrimination index, you can validate the performance of each question in an assessment and make adjustments accordingly.

However, item analysis is not compatible with random questions as the formula presumes that all students answer the same questions. When using randomized questions, only a selection of high and low scorers will get a particular item. Thus, the number of high and low scorers will be uneven in most cases, giving inaccurate index values, especially when having limited data.

We have identified two possible adjustments to make Item Analysis compatible with randomized questions:

1. Split the denominator

The simplest adjustment is to split the denominator of high and low scorers like this:

Ds = Pu/N1 – Pl/N2

1. Adjust the data set to each item

The other approach is to use only the data from the assessments that include the particular item.

There are pros and cons to both approaches, but the cons diminish with larger data sets. Both make it possible to use Item Analysis on assessments using random questions as TERP provides both values in TERP Analytics.

Item Response Theory (IRT)
Item Response Theory is a more sophisticated analysis to validate assessments. While Item Analysis examines the item scores, IRT provides an estimate of students' underlying distribution of ability. The basic idea is to estimate the probability that a student provides a correct response to items presented in an assessment.

Using IRT, you get item response functions that illustrate the properties of items (see the illustration below).

The range used for the difficulty and discrimination indexes usually lies between three and minus three. The more difficult an item is, the more the function curve moves to the right. The more an item discriminates between high and low scorers, the steeper the curve. The red lines show how students with different abilities are expected to score on the item.

Compared to Item Analysis, IRT provides more precise data. On the other side, it is much more complicated to calculate. It also requires more skills by the professionals using the data.

Item Analysis or IRT?
Which tool to use when validating and improving assessments depends both on internal resources and external requirements. Some assessments are subjects to audits, and the auditing authority may have preferences for a specific method.

TERP has decided to provide values using both Item Analysis and IRT. Subscribers to the TERP Web Service will have access to the relevant data on all operative assessments and choose what they want to use when validating and making improvements.