Detection of Patients at High Risk of Medication Errors: Development and Validation of an Algorithm
Abstract
Medication errors (MEs) are preventable and can result in patient harm and increased expenses in the healthcare system in terms of hospitalization, prolonged hospitalizations and even death. We aimed to develop a screening tool to detect acutely admitted patients at low or high risk of MEs comprised by items found by literature search and the use of theoretical weighting. Predictive variables used for the development of the risk score were found by the literature search. Three retrospective patient populations and one prospective pilot population were used for modelling. The final risk score was evaluated for precision by the use of sensitivity, specificity and area under the ROC (receiver operating characteristic) curves. The variables used in the final risk score were reduced renal function, the total number of drugs and the risk of individual drugs to cause harm and drug–drug interactions. We found a risk score in the prospective population with an area under the ROC curve of 0.76. The final risk score was found to be quite robust as it showed an area under the ROC curve of 0.87 in a recent patient population, 0.74 in a population of internal medicine and 0.66 in an orthopaedic population. We developed a simple and robust score, MERIS, with the ability to detect patients and divide them according to low and high risk of MEs in a general population admitted at acute admissions unit. The accuracy of the risk score was at least as good as other models reported using multiple regression analysis.
Medication errors (MEs) are preventable; however, they can result in patient harm and in increased expenses in the healthcare system in terms of hospitalization, prolonged hospitalization and even death 1-8. In a review from 2007, adverse reactions (ARs) were found to occur in 6.1% of all hospitalizations. Of these, 46% were preventable, that is MEs 6. MEs are considered ARs that are preventable in contrary to non-preventable ARs that develop despite correct use of drugs.
Screening interventions are designed to identify disease and thereby enable earlier intervention and management in an attempt to reduce mortality and morbidity 9. The idea of creating a screening tool, an algorithm or a clinical decision rule to capture patients with ARs in general is not new 10-15. Different approaches have been used to develop tools to assist risk assessment or to capture ARs. A review from 2012 evaluated electronic AR detection using electronic triggers 12. The authors found a wide variety of detection rules, with a large variation in sensitivity (40–94%) and specificity (1.4–89.8%), making it difficult to draw conclusions of the overall efficacy of electronic detection.
The object of creating algorithms is described in detail by Streiner and Norman 16. The items used for an algorithm can be identified by either basic research or review of the literature. In both cases, the algorithm comprises items which have been shown empirically to be the characteristics of the specific group of people in question. Once the items have been located, the scale can be constructed in different ways. One method is to ‘weight’ each variable differently in its contribution to the total score by the use of either theoretical or empirical weighting 16.
To our knowledge, all prior studies have used empirical weighting, in terms of multiple regressions to determine the correlation between risk factors and patients' risk of ARs, in order to establish relevant risk factors for a clinical decision rule or algorithm 10, 11, 13-15. In addition, no previous studies have to our knowledge examined the prediction of MEs in patients based on items found by review of the literature or developed an algorithm by the use of theoretical weighting.
We aimed to design a simple Medicine Risk Score, MERIS, to identify patients who are at increased risk of MEs. The risk score is supposed to correctly allocate acutely admitted patients into low risk and high risk by predefined detection limits.
Materials and Methods
Five consecutive steps (literature search, Delphi process, construction of algorithm, calibration of algorithm, prospective pilot study) were used to develop the algorithm. Figure 1 illustrates the overall study design.

The outcome measure was the number of MEs. Many definitions of MEs are used in previous studies 17. We used the definition by Lisby et al. who define MEs as ‘errors in the stages of the medication process – ordering, dispensing, administering and monitoring the effect – causing harm or implying a risk of harming the patient’ 18. This definition requires actual harm or a risk of harming the patient. For the purpose of this work, harm was defined according to the WHO (World Health Organization) seriousness criteria, which is ‘A serious adverse reaction corresponds to any untoward medical occurrence that at any dose results in death, is life-threatening, requires inpatient hospitalization or prolongation of existing hospitalization, results in persistent or significant disability or incapacity, is a congenital anomaly/birth defect’ 19, 20. There is no true gold standard available to confirm the presence of a serious ME. The approach used in this work was to seek consensus between two different evaluators, who were blinded to the patients' risk scores. The question to be answered by the evaluators when evaluating a ME was: ‘is the patient likely to suffer from death, life-threatening events, hospitalization or prolongation of existing hospitalization, persistent or significant disability or incapacity sometime in the future if this error is not corrected?’
Two literature searches and a Delphi process were performed to determine the predictive variables to be included in the algorithm. The complete descriptions of these procedures can be found elsewhere 21-23. The final variables were divided into drug-related risk factors consisting of individual drugs with a high risk of harming a patient, and the patient-related risk factors. The drug-related risk factors included six lists of drugs with three levels of harm and three levels of drug–drug interactions that are presented in Table S1 21, 22. The patient-related risk factors to be tested were determined as follows:
- number of drugs prescribed: 1–2, 3–7, >8
- Age in years distributed in the following age groups: 18–59, 60–69, 70–79, 80–89, 90–99 or continuous age 1–100
- Co-morbidity: ‘yes’, ‘no’ or Charlson Comorbidity Index
- Renal function: eGFR > 60, 60 ≥ eGFR > 30 or eGFR ≤ 30 (mL/min/1.73 m2).
The simple ‘yes’ or ‘no’ scoring for co-morbidity was investigated to account for an often incomplete registration of co-morbidity in the electronic medical records.
Two historic patient populations, a recent patient population and a prospective patient population were used for modelling. Demography of the populations is shown in table 1. The two historic populations came from two different clinical trials 24, 25. One population was acutely admitted orthopaedic (ORTH) patients and the other was acutely admitted patients from a department of internal medicine (INTM). The original purposes for those studies were to investigate the effect of medication review in a randomized and blinded trail design. The original inclusion criteria for the two populations were non-elective patients, aged 65 years or older (ORTH), treated with at least four drugs at the time of admission, or 70 years or older (INTM), and an expected in-hospital length of stay of minimum 24 hr. The orthopaedic population was included as the medication profile might differ from the patients from INTM and therefore perhaps would have increased the potential use of the score.
Population | ORTH | INTM | Recent | Prospective |
---|---|---|---|---|
Patients (n) | 53 | 50 | 146 | 53 |
Age > 80 years/n | 30 (57%) | 26 (52%) | 49 (34%) | 28 (53%) |
No. of drugs ≥ 8/n | 49 (92%) | 39 (78%) | 59 (40%) | 29 (55 %) |
MEs/patient/mean (range) | 0.36 (0–2) | 0.2 (0–2) | 0.35 (0–5) | 1.1 (0–6) |
Patients with MEs/n | 15 (28%) | 9 (18%) | 35 (24%) | 33 (62%) |
- ORTH, orthopaedic population; INTM, internal medicine population; RECENT, patient population from April 2012; PROSPECTIVE, population from a prospective pilot study; n, numbers; MEs, medication errors.
Suicidal patients, terminal patients and patients unable to give written consent were excluded. Data from the ORTH population were complete, while there was no information concerning renal function in the INTM population. MEs in the two populations were determined by two experts, a specialist of clinical pharmacology (LPN) and a specialist in INTM and endocrinology (JR). In cases of disagreement between the two specialists, consensus was reached at face-to-face meetings.
The recent patient population was all patients hospitalized at the Medical Admissions Unit at Aarhus University Hospital during April 2012 (table 1). The unit was a 16-bed ward with a multidisciplinary intake of patients (endocrinology, respiratory medicine, gastroenterology, hepatology and cardiology). Patients eligible for inclusion were patients aged 18 years or older who received at least one drug prior to admission. Patients who were considered suicidal, terminal or intoxicated were excluded. The medical records, lists of medication and laboratory data were used for the evaluation of MEs. MEs in the two populations at the time of admission were determined by two experts, a specialist of clinical pharmacology from the research group (LPN) and a specialist of INTM and endocrinology (JR). In cases of disagreement between the two specialists, consensus was reached at face-to-face meetings.
The prospective pilot study was performed at the same Acute Admissions Unit at Aarhus University Hospital in January 2013. Inclusion and exclusion criteria were identical to the criteria used in the retrospective study (table 1). The study details have been published elsewhere 26. Table 1 shows that the number of medication errors per patient was higher in the prospective population because medication errors were accounted for during the entire hospitalization.
In brief, patients were assigned a risk score by the use of the algorithm. The score was based on information obtained from the patients' electronic medical records. After discharge, all patients had their medical records, medication lists and laboratory test results reviewed to determine the ability of the score to capture patients with MEs. The assessment consisted of two steps: firstly, a clinical pharmacologist (LVA) identified all potential MEs for the entire hospitalization. Secondly, two other clinical pharmacologists (EAS, LPN) independently assessed whether the errors could be classified as MEs according to the definition by Lisby et al. 18. In case of disagreement between the clinical pharmacologists, consensus had to be reached. The three reviewers were blinded to the patients' risk scores. The resulting list of MEs was used as the gold standard for the evaluation of the sensitivity and specificity of the algorithm.
The predictive variables were theoretically weighted in a relatively simple manner. Overall, drugs causing harm were weighted more important than drugs causing interactions by 2:1, based on the assumption that interactions are not necessarily harmful, while the four patient-related risk factors were weighted with equal importance as 1:1. The resulting three basic models are shown in fig. 2 (top).


Three basic models were evaluated in which patient-related risk factors and drug-related risk factors were weighted by 1/3, ½ and 2/3, respectively (fig. 2, top). Within the three basic models, different scores were appointed to the different levels of cut-off. A higher score was appointed to the following:
- drugs with higher risk of interaction or harm according to the three risk levels of low, medium and high risk
- increasing numbers of prescribed drugs
- increasing age
- increasing number of concomitant diseases
- decreasing kidney function.
Figure 2 (bottom) shows the steps of modelling. The sensitivity and specificity for different detection limits dividing the patients into high-risk and low-risk groups, respectively, were calculated for each model to find the model and detection limit with the highest sensitivity and specificity. The best models found in the ORTH population, which included information of renal function, were tested in the INTM population.
The three algorithms with the highest sensitivity and specificity in the historic populations were evaluated retrospectively in the recent patient population hospitalized at the Acute Admissions Unit at Aarhus University Hospital with the purpose of adjusting the risk score further. Once again, the sensitivity and specificity at different detection limits were calculated to evaluate the risk score with the highest precision.
The final risk score was validated in the prospective pilot study. The sensitivity and specificity of the score were calculated, and the result was considered for precision allowing for further calibration. To evaluate the predictive ability of the final risk score, receiver operating characteristic (ROC) curves were constructed, and areas under the curve (AUC) were calculated.
The evaluation of MEs is based on medical professionalism, and like other diagnostic tests, it relies to some degree on a subjective interpretation by the observers. To take into account that observers sometimes agree or disagree by chance, Cohen's kappa coefficient is a widely used method for assessing inter-rater reliability and was therefore used to assessing the inter-rater reliability between evaluators of MEs in the prospective trial. Kappa values less than 0.20 were considered as poor agreement, between 0.21 and 0.40 as fair agreement, between 0.41 and 0.60 as moderate agreement, between 0.61 and 0.80 as good agreement and between 0.81 and 1.00 as very good agreement 27. As Cohen's kappa coefficient is criticized by many authors for having well-documented statistical problems, an alternative inter-rater reliability statistics was used as well, namely the ‘first-order agreement coefficient’, also called the AC1 statistics 28.
Results
A total of 120 different risk scores were investigated in the ORTH population. Of these, 24 risk scores were repeated in the INTM population. The scores with highest sensitivity and specificity were chosen for further evaluation. From the ORTH and INTM population, risk scores 1, 2 and 3 were tested in the recent patient population. Table S2 shows the risk scores chosen for further evaluation.
According to Table S2, the risk score with the highest precision in the recent patient population (risk score 2) showed a sensitivity of 0.78 and a specificity of 0.75. The number of medication reviews that would have been performed as a result of patients being scored as high-risk patients was 55 (38%). This risk score was used in the prospective pilot study.
The sensitivity and the specificity in the prospective pilot study were 0.67 and 0.65, respectively. The observed agreement between evaluators was 70%. When accounting for chance, Cohen's kappa coefficient reduced agreement to 0.33% or 33% prior to consensus. This corresponds to fair agreement. The AC1 statistics showed a chance-corrected coefficient of 62%.
Based on the results from the prospective pilot study, some final changes were made to the algorithm. (i) Almost all patients scored ‘yes’ for co-morbidity (51/53 patients), and therefore, it did not contribute to refine the algorithm's distinction of patients; thus, it was removed. (ii) Many patients were in treatment with more than eight drugs (high risk) leading to an increased number of high-risk patients and consequently a high number of medication reviews (55% of patients). Thus, limits for the number of drugs were changed to low risk 1–5, medium risk 6–11 and high risk ≥12.
The detection limit with the highest precision after revising the algorithm was 13, resulting in a specificity of 0.75 and a sensitivity of 0.64. The number of medication reviews was 49%. The area under the ROC curve was 0.76, 95% CI (0.62; 0.89) (fig. 3). The final score is presented in table 2.

Risk factor | Intervals | Points | Max. no of drugs that counts |
---|---|---|---|
Reduced renal function | eGFR > 60 | 0 | |
60 > eGFR > 30 | 5 | ||
eGFR < 30 | 10.6 | ||
No. of drugs | 0–5 | 0 | |
6–11 | 5 | ||
>12 | 10.6 | ||
No of drugs with | Low risk of harm | 0.25 | 3 |
Medium risk of harm | 0.5 | 8 | |
High risk of harm | 1 | 7 | |
Low + medium risk of interaction | 0.25 | 12 | |
High risk of interaction | 0.5 | 2 |
- eGFR, estimated glomerular filtration rate.
The final version of MERIS identified 44/60 (73%) MEs in 33 patients corresponding to 21/33 (64%) patients and missed 16/60 (27%) MEs. The final score was evaluated in the recent patient population and in the ORTH and INTM populations. In the recent patient population, the area under the ROC curve was 0.87, 95% CI (0.80; 0.94). In the INTM population, the area under the ROC curve was 0.74, 95% CI (0.50; 0.98). In the ORTH population, the area under the ROC curve was 0.66, 95% CI (0.5; 0.82).
Discussion
We developed a simple risk score, MERIS, based on variables found in systematic literature search and by expert opinion. The variables used in the final risk score were reduced renal function, the total number of drugs and the risk of individual drugs to cause harm and interactions. We used theoretical weighting in three acutely admitted, retrospective study populations consisting of an orthopaedic population, a population from INTM and a prospective population for the development. We found the risk score with a final sensitivity of 0.64 and specificity of 0.75, and the area under the ROC curve was 76. The final risk score was considered robust with an AUC under the ROC curve of 0.87 in the recent patient population, 0.74 in the population of INTM (no data on renal function) and 0.66 in the orthopaedic population.
MERIS was developed from theoretical weighting as opposed to other algorithms presently available using empirical weighting. According to Streiner and Norman, differential weighting of variables rarely is worth the trouble 16. However, with fewer than 40 items, weighting may have some effect, and only if an index consists of unrelated items, it may be worthwhile to run a multiple regression analysis and determine empirically whether this improves the predictive ability of the scale. Patient populations are heterogenic and multiple factors interact, some completely at random. The use of statistical methods that aim to capture heterogeneity in one reproducible number might be overestimating the precision of statistics. This is supported when comparing algorithms presented by other authors that show considerable dissimilarity in items uncovered by multiple regressions, suggesting underlying factors that are not elucidated by multiple regressions.
The resulting area under the ROC and the sensitivity and specificity of MERIS are surprisingly identical to scales using empirical weighting. The risk score was developed in a historic patient population and validated in a prospective patient population. The patient populations used in this algorithm were both acute medical and orthopaedic patients, and the AUC under the ROC curve of the final risk score showed a high level of precision when evaluated in the historic populations and in the recent patient population. The final area under the ROC curve was 66 in the ORTH population after adjustments in other populations; however, the ORTH population was a selected patient population, and table 1 displays that the patients were older and treated with more drugs compared with the other populations.
To our knowledge, no algorithms exist that evaluate the risk of MEs by the use of risk scores. Other studies developed risk scores by the use of multiple regressions 10, 11, 13-15. Hohl et al. created clinical decision rules that were sensitive for the detection of ARs causing hospitalization. They found a sensitivity and specificity of 90.8/59.1; however, this result was not validated prospectively 10. Onder et al. 11 developed an algorithm to detect ARs in the elderly past 65 years of age. Almost 6000 patients were used for the statistical analysis, and the resulting risk score was validated in a prospective study including approximately 500 patients. The area under the ROC curve in this prospective study was 0.70, and the best sensitivity and specificity calculated from the paper were 68%/65%. Trivalle et al. constructed a risk score very similar to Onder et al. targeting patients past 65 years of age 15. They did not present a detection limit for patients at high risk, and the sensitivity and specificity could not be calculated from the paper, but the resulting ROC curve was similar to the one by Onder et al. The risk score was not validated in a prospective study.
A recent study used a standardized pharmacy drug-related problem alert system, which automatically generated alerts when a new prescription was entered 14. The alerts were based on main sources of pharmacological information. The evaluation of seriousness of the alerts was, however, not evaluated, and multivariate regressions were performed on correlations to patients with at least one drug-related problem alert. The area under the ROC was 0.77, and this was confirmed in a prospective validation study. No chart review was performed as confirmation or ‘gold standard’, and as the MEs were generated automatically by an electronic system, a confirmation of their clinical relevance by medical experts would have been relevant.
The reference standard used for evaluating potential MEs with a risk of harm is to some extent subjective and depends on a personal point of view. Potential MEs are by definition not directly associated with serious outcome, and it may therefore be difficult to identify them. This study used consensus as gold standard which is a method often used to increase the accuracy of reference 10, 11, 29, 30. The observed agreement between evaluators was 70%. We found kappa to reduce agreement to 0.33 or 33%, corresponding to fair agreement. However, The AC1 statistics showed a reasonable chance-corrected coefficient of 62% which is in line with the observed level of agreement. This supports the concerns expressed by some investigators referred to as the ‘kappa paradox’, where a high value of observed agreement can be drastically lowered by an imbalance in the table's marginal totals 31.
An example of good agreement was demonstrated by Hohl et al. 10. The study used three clinical pharmacists to assess whether a patient's hospitalization was due to an AR, and when the diagnosis was in agreement with the physician's working diagnosis, it was considered the reference standard 10. The kappa inter-rater reliability was 0.75, corresponding to good agreement. The evaluation in Hohl et al. was based on actual hospitalizations due to ARs, whereas the evaluations performed in our study were based on potential MEs. Previous studies found that facts are easier to agree upon than problems requiring an element of judgement 32-35.
The reference standard is an important determinant of the diagnostic accuracy of a test. The lack of a true reference standard might bias the results of our study when calculating sensitivity and specificity. Bearing this in mind, the resulting sensitivity and specificity appear to be good and equally as good as studies using multiple regressions. A perhaps more accurate ‘gold standard’ could have been clinical follow-up of adverse outcome in patients after discharge due to the perceived MEs. This was, however, not feasible as information required in the retrospective populations was not available. In total, 302 patients were used in this study for modelling, and in 50 patients, no data on renal function were available. Compared to the other studies mentioned, the number of patients in our study was lower. Onder et al. used almost 6000 patients for the multiple regression analysis 11. The low number of patients in our study was not sufficient for making meaningful statistics. However, the patients came from different hospitals and from different medical specialties contrary to the other studies mentioned. This could be an advantage when applying the algorithm in the daily clinic.
The MERIS algorithm is a very simple score using available information from medical records; thus, it could be incorporated into a CPOE system and thereby continuously generate automatic scores in patients' medical records during hospitalization. Above a pre-specified detection limit, the score would then advise the healthcare professional to review the patient's medication or to perform other relevant interventions.
When the MERIS algorithm is used for detecting high-risk patients, corresponding interventions are required to lower the patients' risk. Interventions, for example medication reviews, in high-risk patients should preferably be tested in a randomized, controlled design to establish its effectiveness on this particular group of patients.
Bearing in mind the risk of bias from using potential MEs as a reference standard, we developed a simple and robust score, MERIS, with the ability to detect patients and divide them into low and high risk of MEs in a general population admitted to acute admissions unit. The accuracy of the risk score was at least as good as other models reported using multiple regression analysis. With a score and detection limit of >13, the sensitivity was 0.64 and specificity 0.75. The area under the ROC curve was 0.76.