References

Ilgen JS, Ma IW, Hatala R, Cook DA. A systematic review of validity evidence for checklists versus global rating scales in simulation-based assessment. Med Educ. 2015; 49:(2)161-173 https://doi.org/10.1111/medu.12621

Joyce CM, Wainer J, Piterman L, Wyatt A, Archer F. Trends in the paramedic workforce: a profession in transition. Aust Health Rev. 2009; 33:(4)533-440 https://doi.org/10.1071/AH090533

Maurin Söderholm H, Andersson H, Andersson Hagiwara M Research challenges in prehospital care: the need for a simulation-based prehospital research laboratory. Adv Simul (Lond). 2019; 4 https://doi.org/10.1186/s41077-019-0090-0

McKenna KD, Carhart E, Bercher D, Spain A, Todaro J, Freel J. Simulation use in paramedic education research (SUPER): a descriptive study. Prehosp Emerg Care. 2015; 19:(3)432-440 https://doi.org/10.3109/10903127.2014.995845

McLaughlin K, Ainslie M, Coderre S, Wright B, Violato C. The effect of differential rater function over time (DRIFT) on objective structured clinical examination ratings. Med Educ. 2009; 43:(10)989-992 https://doi.org/10.1111/j.1365-2923.2009.03438.x

Tavares W, Boet S, Theriault R, Mallette T, Eva KW. Global rating scale for the assessment of paramedic clinical competence. Prehosp Emerg Care. 2013; 17:(1)57-67 https://doi.org/10.3109/10903127.2012.702194

Tavares W, LeBlanc VR, Mausz J, Sun V, Eva KW. Simulation-based assessment of paramedics and performance in real clinical contexts. Prehosp Emerg Care. 2014; 18:(1)116-122 https://doi.org/10.3109/10903127.2013.818178

Improving simulation through student assessment partnerships. https//tinyurl.com/yck88dcs (accessed 12 June 2022)

Tsai BM, Sun JT, Hsieh MJ Optimal paramedic numbers in resuscitation of patients with out-of-hospital cardiac arrest: a randomized controlled study in a simulation setting. PLoS One. 2020; 15:(7) https://doi.org/10.1371/journal.pone.0235315

Van Dillen CM, Tice MR, Patel AD Trauma simulation training increases confidence levels in prehospital personnel performing life-saving interventions in trauma patients. Emerg Med Int. 2016; 2016 https://doi.org/10.1155/2016/5437490

Williams B, Abel C, Khasawneh E, Ross L, Levett-Jones T. Simulation experiences of paramedic students: a cross-cultural examination. Adv Med Educ Pract. 2016; 7:181-186 https://doi.org/10.2147/AMEP.S98462

Yeates P, Cope N, Hawarden A, Bradshaw H, McCray G, Homer M. Developing a video-based method to compare and adjust examiner effects in fully nested OSCEs. Med Educ. 2019; 53:(3)250-263 https://doi.org/10.1111/medu.13783

Differential rater function over time (DRIFT) during student simulations

02 July 2022
Volume 12 · Issue 2

Abstract

Background:

The field of paramedicine continues to advance in scope. Simulation training is frequently used to teach and evaluate students. Simulation examinations are often evaluated using a standardised global rating scale (GRS) that is reliable and valid. However, differential rater function over time (DRIFT) has not been evaluated when using the GRS during simulations.

Aims:

This study aimed to assess if DRIFT arises when applying the GRS.

Methods:

Data were collected at six simulation evaluations. Raters were randomly assigned to evaluate several students at the same station. Each station lasted 12 minutes and there was a total of 11 stations. A model to test DRIFT scores was created and was tested against both a leniency and perceptual model.

Findings:

Of the models explored, one that included students, the rater, and the dimensions had the greatest evidence (−3151 Bayes factors). This model was then tested against leniency (K=−9.1 dHart) and perceptual models (K=−7.1 dHart). This suggests a substantial finding against DRIFT; however, the tested models used a wide parameter so the possibility of a minor effect is not fully excluded.

Conclusion:

DRIFT was not found; however, further studies with multiple centres and longer evaluations should be conducted.

The field of paramedicine has evolved considerably over the last three decades. Originally, the role of a paramedic was restricted to patient transportation from the field to a nearby hospital with little medical intervention along the way. Today, paramedics are skilled clinicians who are capable of performing advanced medical and procedural interventions in a prehospital setting (Joyce et al, 2009; Williams et al, 2016).

To ensure that paramedic students are ready to enter the workforce, academic institutions have been adopting novel techniques (Thompson and Houston, 2021).

Simulation training is a hands-on, practical teaching method that has become increasingly used within teaching institutions (McKenna et al, 2015; Van Dillen et al, 2016; Tsai et al, 2020; Thompson and Houston, 2021). It has been established that simulation training is a critical means of improving medical education. Students exposed to simulation acquire more competencies and expertise in practical skills than those whose curriculum does not require simulation training (Van Dillen et al, 2016).

As simulation is required in paramedic education, its quality should be investigated and assessed on an ongoing basis to ensure that students are achieving the most from simulation training. It is also imperative that the methods used for evaluating simulations are deemed fair (Williams et al, 2016).

In the domain of paramedicine, students typically have their simulation scenarios evaluated by using a validated assessment tool known as the global rating scale (GRS) (Tavares et al, 2013). The GRS is a seven-dimension scale that is used to assess the competency of paramedic students regarding situational awareness, history taking, patient assessment, decision making, resource use, communication and procedural skills (Tavares et al, 2013). Each of these seven themes is measured on a scale ranging from 1 (unsafe) to 7 (exceptional) (Tavares et al, 2014).

The GRS is valid and possesses high inter-rater reliability when it is used to evaluate paramedics in training. Additionally, there is evidence to suggest that the scores achieved on a GRS in simulation are transferable to abilities in a real-world clinical context (Tavares et al, 2013; 2014).

Although the GRS has been proven to be the gold standard when it comes to paramedic student evaluation, to the best of the authors' knowledge, no one has explored the effect of differential rater function over time (DRIFT) on the outcomes of the grades obtained on the GRS (Ilgen et al, 2015). The concept of DRIFT has been demonstrated in other areas of medical education and arises typically because of increasing leniency as a result of rater fatigue (McLaughlin et al, 2009).

Fairness in assessment is crucial to education. Additionally, in a domain such as paramedicine, it is important that evaluation is standardised as public health and safety could otherwise become compromised (Yeates et al, 2019).

It is crucial for student success and for public safety to ensure that the evaluation of paramedic performance is accurate and fair. Therefore, the primary purpose of this study was to explore if a DRIFT phenomenon occurred during multiple GRS evaluations by paramedic raters.

Methods

Study design

This was a cohort study. A group of people who rated paramedic students was selected by convenience sampling and followed over a period of time (at GRS stations). No intervention was given to any participants.

Ethics

The research was approved by the Collège Boréal research ethics board. Following ethical approval, consent was acquired from each participant. To ensure that the participants were blind to the study, consent forms were signed following the examination period.

Setting

The data were collected at Collège Boréal, Sudbury, Ontario, Canada in the 2020–2021 academic year during the six practical simulation evaluations for first-year paramedic students. A total of 12 raters took part in evaluating GRS during the academic year.

The study examined rater evaluations during practical paramedic student evaluations using the GRS that had been applied at 11 stations lasting 12 minutes. The student examinees had been randomly assigned to start at a different station and moved through the GRS circuit in the same order. Each rater scored each paramedic student examinee at the same station and was given a single rest period during the practical examination that was randomly assigned and staggered throughout.

Participants/population

The raters were employed paramedics who held the Advanced Emergency Medical Care Assistant (AEMCA) qualification and were legally allowed to work as a paramedic in the province of Ontario, Canada. Additionally, the raters were staff at Collège Boréal.

Each rater was familiar with the GRS evaluation tool and had been involved in evaluations before. The examinees were first- and second-year paramedic students who were enrolled at the college and were participating in the practical assessments required by the curriculum that take place three times per semester.

Outcome measures

The primary outcome measure was to see if a DRIFT occurred during evaluations.

Data analysis

Modelling the scores for testing for DRIFT

To model the scores for testing for DRIFT, an approximation of a student's probability to achieve a given score for a particular task was worked out. This was calculated by assuming the successful completion of six independent cumulative tasks, each one with the same probability of completion. This is described by a binomial distribution of fixed (n=6) and adjustable parameter P (the probability of completing the task).

To model the effect of other factors on P, the authors considered it depends on several random factors:

  • The student: different students might score differently
  • The evaluator: different evaluators might evaluate differently
  • The dimension of the evaluation (which area of the GRS, e.g. communication): some dimensions might be more difficult than others
  • The category of the task (type of scenario, e.g. trauma): some tasks might be more difficult
  • The date: students might perform better or worse on successive trials
  • In addition, interactions were considered:
  • Students may vary in terms of which dimension and category they perform best in.
  • Modelling DRIFT

    To model the DRIFT, two mechanisms for fatigue to affect the raters were considered.

    The first was the leniency DRIFT model—that the fatigue generated by successive evaluations translates itself into a systematic change in the leniency or strictness of the evaluations. In this model, the same evaluator would consistently drift towards either higher or lower values.

    The second mechanism concerns a perceptual failure model. This model assumes that fatigue results in the error rate at the evaluations.

    Therefore, we distinguished two cases and defined two probabilities:

  • Alpha (type 1 error): the rater considers that the student failed in a particular objective where, in reality, the student's performance was successful
  • Beta (type II error): the rater considers that the student succeeded where, in reality, the student failed.
  • In the model, the authors consider that the transformation of the probability of error increases linearly with the number of tests performed. In this case, the dispersion of scores would converge over time towards an equilibrium score, irrespective of the performance of the student.

    The models were designed to consider various factors when creating them. The models and calculations were conducted on RStudio software and are available upon request.

    The Bayes factor Ks were calculated by taking the exponential of the difference in the logarithm of the evidence of alternative hypothesis. Values of K in dHart >20 were considered decisive, greater than 10, strong, whereas values >5 where considered substantial.

    Findings

    Models tested

    Among the models explored, one consisting of students, the rater and the dimensions had the greatest evidence (−3151). Removal of any one of those factors or inclusion of the time of year, the category or the test of an interaction between student and dimension resulted in a decrease in the amount of evidence. Therefore, the authors used the background of an effect of the student, evaluator and the dimension, and tested the leniency and perceptual failure model.

    Leniency model

    The posterior distribution of the rate of increase in score with successive tests (expressed as logit) was analysed. In this model, all evaluators would systematically change the scores with time. The prior distribution allowed for either leniency (an increase in scores) or strictness (a decrease in scores).

    The posterior distribution was centred on zero DRIFT and a large compression in the distributions is apparent, indicating that the study restricted the possible change in scores to a small range. After 10 tests, the 1–90 credibility interval (CI 1–99) for the logit of P was shown to be (−0.11, 0.18). In terms of score, a student deserving a 4 after 10 tests would get a score in the CI 1–99 (3.83–4.27). Therefore, the DRIFT would be <0.3 points.

    Perceptual failure model

    This model uses two parameters, alpha for the increase in type I error, and beta for the increase in type II errors.

    In the model, the authors assume a perfect evaluator at time 0, with an increasing rate of mistakes with the following tests. At the 10th test, the posterior distribution of alpha and beta are reduced from the previous one. CI 1–99 for alpha (0.007–0.13) and beta (0.13–0.19) do not exclude the possibility of a small error rate in scoring. This error would lead to a decrease in the frequency of very high and very low scores.

    To show this effect, the authors tested how individuals with either P=5/6 or P=1/6 would evolve with successive tests. In the first case, a score with a CI of 5.95–6.00 at first test would fall to a score with a CI of 5.51–6.02. In the second case, a score with an interval of (2.04–2.10) would reach values of (2.04–2.84). Therefore, it is possible that some perceptual fatigue could lead to a rise of up to 0.84 points on the GRS. There is also the prospect of students being failed because they are wrongly given a low score. Therefore, the scores of failing students cannot be discarded.

    Discussion

    The primary objective of this study was to detect if there was a DRIFT in the ability of the raters on successive tests during the day—essentially, a fatigue effect.


    Model tested Evidence (decibels) K (dHart) (decibels)
    All students equal -3247 -419***
    Student -3228 -333***
    Student + rater -3164 -55.8***
    Student + rater + category -3171 -87***
    Student + rater + dimension -3151 0
    Student + rater + dimension + date of year -3160 -39.5***
    Student + rater + dimension x student -3169 -77.9***

    Indicates a decisive Bayes Factor

    This study did not demonstrate that a DRIFT exists but that the possibility of a small effect cannot be completely discarded. However, the results of this study put clear limits on the likelihood of this effect: there is a 1% chance that the effect ≥0.8 points after 10 tests. Therefore, the results of this study suggest that a substantial DRIFT does not exist in paramedic raters when evaluating paramedic students' simulations using the GRS; substantial evidence against both the leniency (K=−9.1 dHart) and the perceptual models (K=−7.1 dHart) was established.

    This study helps to further validate previous research in providing more support that the GRS possesses reliable and valid inter- and intra-rater reliability. The fact that minimal (if any) DRIFT exists adds another positive element to the already favoured GRS. This is important as it supports the increasingly popular GRS as a fair tool for simulation assessment. Additionally, there is an established push to conduct more research on prehospital simulation and this adds to the existing literature (Maurin Söderholm et al, 2019).

    Although the primary purpose of the study was to evaluate DRIFT, the study also revealed some other findings that are noteworthy. For example, five factors (student, evaluator, dimension of the evaluation, category of the task and time of year) were analysed and the first three were found to have predictive power regarding the score whereas the other two did not.

    In addition, the accumulation of experience gained on successive tests while training did not translate into improved scores. Possible explanations for this puzzling lack of effect may include: the duration of the study was not enough to see an increase in student performance; the students were already saturated in their ability to perform the tasks; or the evaluators change their rating as the year progresses. Further research should be conducted to better understand this issue.

    The findings of this study are at odds with other medical-based simulation findings (McLaughlin et al, 2009; Maniar, 2016; Yeates et al, 2019). However, these studies examined different populations (medical residents), had a different type of rater (medical doctors) and used a different tool for assessment (the objective structured clinical examination). Therefore, it may be difficult to draw any conclusions or parallels from this.

    To the best of the authors' knowledge, this was the first study that investigated paramedic raters and the possibility of DRIFT using the GRS. The findings have implications as an increasing number of academic institutions and paramedic services are using the GRS to evaluate or when hiring paramedic students.

    This study provides further evidence that using the GRS is fair and that DRIFT is not having a substantial effect on results.

    Although no DRIFT was found or established, future studies should be conducted involving multiple academic centres and with more GRS simulation stations or evaluations to confirm that this remains true beyond 11 stations.

    Limitations

    There are a few potential limitations in this study that must be taken into consideration when interpreting the findings.

    This study is limited by the fact that data were collected at one institution (College Boréal) during one academic year (2020–2021) and from one cohort of raters.

    Second, this study is limited to a maximum of 11 stations so a DRIFT effect could exist beyond this.

    Additionally, all raters received one randomised station off during the course of the day.

    Conclusion

    As the profession of paramedicine continues to evolve and more simulation training is implemented, it is important to evaluate the curriculum, including the tools that are used to evaluate students (Joyce et al, 2009; Williams et al, 2016).

    The primary purpose of this study was to explore if a DRIFT phenomenon existed in raters during successive testing over the day—essentially, a fatigue effect that occurs when rating using the GRS. This study failed to prove that a DRIFT exists but the possibility of a small effect cannot be discarded. As a result, this study continues to add to the evidence that the GRS is an effective and valid means of evaluating paramedic simulations. However, further multicentre studies with greater numbers of simulation stations should be conducted.

    Key Points

  • Simulation training is becoming an increasingly popular method of training paramedic students and recruits
  • The global rating scale (GRS) has become a gold standard in evaluating simulations
  • This study found no differential rater function over time (DRIFT) during the use of the GRS during simulation evaluations and provides further evidence to back up the GRS as the gold standard in paramedic simulation evaluation
  • As the possibility of a small effect cannot be discarded, further multicentre trials should be conducted in the future
  • CPD Reflection Questions

  • What role do you think simulation training and assessment will play in paramedicine moving forward?
  • If the global rating scale (GRS) were used in the UK to assess paramedic simulations, what would its benefits be?
  • This study demonstrated a lack of differential rater function over time in the GRS. What do you believe would be the outcome of a study using the tool currently used at your institution?