Article Text
Abstract
Objective To develop a framework for good clinical decision-making using machine learning (ML) models for interventional, patient-level decisions.
Design Grounded theory qualitative interview study.
Setting Primarily single-site at a major urban academic paediatric hospital, with external sampling.
Participants Sixteen participants representing physicians (n=10), nursing (n=3), respiratory therapists (n=2) and an ML specialist (n=1) with experience working in acute care environments were identified through purposive sampling. Individuals were recruited to represent a spectrum of ML knowledge (three expert, four knowledgeable and nine non-expert) and years of experience (median=12.9 years postgraduation). Recruitment proceeded through snowball sampling, with individuals approached to represent a diversity of fields, levels of experience and attitudes towards artificial intelligence (AI)/ML. A member check step and consultation with patients was undertaken to vet the framework, which resulted in some minor revisions to the wording and framing.
Interventions A semi-structured virtual interview simulating an intensive care unit handover for a hypothetical patient case using a simulated ML model and seven visualisations using known methods addressing interpretability of models in healthcare. Participants were asked to make an initial care plan for the patient, then were presented with a model prediction followed by the seven visualisations to explore their judgement and potential influence and understanding of the visualisations. Two visualisations contained contradicting information to probe participants’ resolution process for the contrasting information. The ethical justifiability and clinical reasoning process were explored.
Main outcome A comprehensive framework was developed that is grounded in established medicolegal and ethical standards and accounts for the incorporation of inference from ML models.
Results We found that for making good decisions, participants reflected across six main categories: evidence, facts and medical knowledge relevant to the patient’s condition; how that knowledge may be applied to this particular patient; patient-level, family-specific and local factors; facts about the model, its development and testing; the patient-level knowledge sufficiently represented by the model; the model’s incorporation of relevant contextual factors. This judgement was centred on and anchored most heavily on the overall balance of benefits and risks to the patient, framed by the goals of care. We found evidence of automation bias, with many participants assuming that if the model’s explanation conflicted with their prior knowledge that their judgement was incorrect; others concluded the exact opposite, drawing from their medical knowledge base to reject the incorrect information provided in the explanation. Regarding knowledge about the model, we found that participants most consistently wanted to know about the model’s historical performance in the cohort of patients in their local unit where the hypothetical patient was situated.
Conclusion Good decisions using AI tools require reflection across multiple domains. We provide an actionable framework and question guide to support clinical decision-making with AI.
- ethics
- clinical decision-making
- evidence-based practice
- policy
- critical care
Data availability statement
Data are available on reasonable request. We are open to requests for the original data. If requested, we will return to the original participants to obtain consent prior to releasing the de-identified transcripts from interviews.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
WHAT IS ALREADY KNOWN ON THIS TOPIC
There is strong interest in using artificial intelligence (AI) systems to improve patient care.
A key concern surrounds ethical and responsible clinical decision-making for patients, which accounts for AI outputs while respecting patient context and values.
Current proposed solutions (eg, explainability, transparency, etc) are solely model-focused and fail to address the patient context.
Medicolegal approaches focus narrowly on whether clinician decisions and model outputs are ‘correct’ or ‘incorrect’, which does not provide constructive guidance for situations of medical uncertainty as are frequently encountered in acute care environments.
WHAT THIS STUDY ADDS
We offer an ethical framework for incorporating AI outputs in the broader clinical decision-making context in high stakes environments like the intensive care unit.
This framework does not require perfect model performance, nor are models required to be ‘explainable’ in order to make reasonable clinical decisions.
HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY
By identifying how clinicians can and should make decisions when using AI, we hope that this work can provide a structure for clinicians at the point of care to consider, question and make good judgements on how to ethically use AI tools in practice.
We anticipate that medicolegal work may draw from these results in determining reasonableness should a clinical decision be called into question.
Furthermore, professional colleges may consider these findings when providing advice to their members in navigating the future of medicine with AI.
Introduction
Despite popularized perceptions of superior performance, there is increasing recognition that even high-performing artificial intelligence (AI)/machine learning (ML) tools are most appropriate as decisional aids rather than replacement.1 A perennial barrier to their responsible use is the uncertainty around how clinicians ought to use individualised predictions, balancing clinical judgement against algorithmic outputs.
Explainable AI (XAI; that is, providing point-of-care information purporting to account for a given prediction) has been proposed as a mechanism for mediating clinical decision-making, some going so far as to suggest it as essential for ethical practice.2–6 Empirical work, however, highlights challenges, including how explainability methods (at this time) remain somewhat unreliable,7 are unnecessary for adoption,8 posit a standard not applied elsewhere in healthcare,9 10 may introduce additional errors7 11 12 and are focused narrowly on the model instead of the patient.13 XAI has been driven primarily by methodological developments,14 15 often overlooking the needs of users. To date, there has not been an exhaustive evaluation of a selection of possible explanation methods that are useful for the clinical context.16–19 Early attempts at determining which explanation methods are useful in a clinical context to improve decision making and trust are presented in recent studies.20–22 Additionally, the field would benefit from a corresponding exploration of the ethical considerations of XAI and their impact on end-user decision-making.13 It is possible that some challenges for translation relate to the failure of many technically driven approaches to capture the full complexity of the context in which clinical decisions are often made.
Medicolegal frameworks often focus on whether a given AI output is right or wrong, whether the clinical decision that follows is right or wrong, and whether harm occurs.23 Yet, medical decisions are characterised to varying extents by uncertainty; this is perhaps more true in acute care settings than elsewhere. The ‘rightness’ or ‘wrongness’ of a model prediction may not be immediately verifiable, nor is it true that a clinician’s decision may be judged to be ‘right’ or ‘wrong’ on the basis of the outcome alone. Clinical judgement across jurisdictions generally must meet a standard of ‘reasonableness’—that is, what is reasonable for a similarly situated clinician to understand the risks and benefits?24 25 This situation is only compounded by the advancement of AI products as ‘clinical decision support’ (meaning that the burden of responsibility rests on the clinician user) and the paucity of professional guidance available to support clinician decision-making. Where the predictions made by ML tools can inform clinical judgement, there is a clear need to better understand how ML outputs ought to be considered in the clinician’s judgement.
We conducted a grounded theory study to develop a theory of ‘good’ decisions with ML tools in an acute care setting (the intensive care unit (ICU)) by interviewing clinicians engaging with a simulated ML model and patient case. In specifying ‘good’, we are specifically invoking decisions which meet notions of reasonableness (medicolegal) and justifiability based on values (ethical). The resultant normative framework can be applied prospectively to support clinical decision-making using ML tools.
Materials and methods
Research design
This study was initiated in January 2021 and completed recruitment in March 2022. Our protocol was preregistered on Open Science Framework (https://osf.io/wvrz8). Participants were invited from multiple institutions across the greater Toronto area from both paediatric and adult hospitals. We used an interactive virtual interview design (due to COVID-19) exploring a hypothetical patient case using a simulated ML model aided by seven model visualisations to support interpretability. The full study protocol is included in online supplemental appendix A, which describes the interview design and development including clinical details (online supplemental box A.1). We also describe the development and justification for the choice of visualisations (online supplemental appendix B, table B.1).
Supplemental material
Participants were recruited by a research assistant (previously unknown to them) using purposive snowball sampling initially focused on the application domain (ICU) and then moving outwards to assess generalisability of the developing themes and framework. Inclusion criteria were: (1) practicing clinician; (2) working in an inpatient care environment; (3) prior experience extubating patients, managing patients postextubation, or participating in decision making pertaining to patient extubation; (4) English fluency; (4) access to technology for the purpose of the interview. We balanced sampling across multiple domains, including level of experience, technical knowledge of ML and specific roles. We included one ‘outsider’ perspective to bring in a more technical lens.
The interview centred on the clinical decision for 4-month old Siri, a patient intubated in the ICU. Siri is medically complex (trisomy 18 (T18), requiring continued respiratory support since birth) and recently underwent surgical repair of their heart. The decision (and unit of analysis26) is whether or not to extubate Siri, with an ML model predicting a 60% chance of a successful extubation (defined as not needing re-intubation within 48 hours). Interviews proceeded as follows: (1) presentation of case and establish initial plan; (2) present model prediction and explore initial reactions; (3) present each visualisation and explore perspectives; (4) explore rationale, justification and ethics of decision-making process and outcome. Interviews typically lasted 2 hours over one or two sessions.
The grounded theory analysis used raw transcripts and field notes26 27.27,27Grounded theory is an insightful, systematic approach to characterising emerging or novel phenomena.27 In the initial open coding step, team members MDMcC and KT read and coded transcripts independently before meeting to synthesise codes and subcodes and resolve discrepancies. Once the initial set of codes is established, the whole team performed axial coding to identify and analyse inter-relationships between codes and subcodes in generating the developing theory. We brought in a legal scholar (IS) to triangulate with medicolegal standards. Using constant comparative method,26 28 29 we continually returned to transcripts in the context of larger research themes that describe the overall conceptual understanding of the phenomenon. As we found that no new codes were coming up after around 12 interviews, we selectively sampled for participants whose backgrounds and perspectives had not yet been represented (28 respiratory therapists, senior staff). The final set of interviews were focused on resting the theory’s fit, work, relevance and modifiability as indicative of its quality, rigour and value.30 We ended recruitment at 16 participants, in line with recognised standards in qualitative research.31 3231 32
As a final step for assessing our theory’s rigour, we undertook member checks as external validity testing with both participants and non-participants in the ICU. Part of this step included patient and public involvement as consultation where we presented the work and findings to two patient advisor groups (youth and adults) at SickKids to obtain their feedback on the work. They advocated strongly for ensuring we align with the shared decision-making paradigm that is endorsed broadly in many medical locales worldwide, and indeed was discussed by nearly all of our participants.
Our study is reported in alignment with the Consolidated criteria for Reporting Qualitative research checklist.
Results
Sixteen participants and their demographic information are presented in table 1.
Participant demographics
Figure 1 depicts the theoretical framework constituting how clinicians should make good decisions for patients using ML tools. The framework (figure 1) draws from three main categories: the medical knowledge system (denoted by yellow); context (blue) and the ML tool (red). Three further categories constitute the overlap with the three primary categories: integrating knowledge within context (green); the model’s ‘knowledge’ (orange) and the model’s ability to reflect context (purple). These categories are organised around a central question (black) that is the decision space.
The circular figure represents the framework for making good decisions using AI. The outer circle represents the domains required by the clinician to reflect on in order to make a responsible decision for an individual patient. A resolution is achieved through reflective equilibrium across categories, using the patient’s best interests as the guidepost. We chose to represent the framework as a circle to indicate that there is overlap across categories and no specific hierarchy of categories; rather, the clinician’s judgement reflected across all these categories may weigh one more heavily than another, anchored by the individual patient.
Table 2 describes the elements within each category. Tables 3 and 4 elaborate on each element along with representative quotations.
Practical questions to guide reflection across the framework
Categories pertaining to the machine learning model
Medical categories
The medical knowledge system
Participants all drew from scientific knowledge pertaining to medical facts about the patient’s condition and specific comorbidities. For example, the decision about extubation is affected by the presence of T18, which involves central nervous system disturbances (including disorganised breathing). Participants referred to knowledge regarding how the relevant features of Siri’s case could potentially have interaction effects (e.g., comorbidities) and how these interactions may influence their risk assessment. Several participants referred to looking into the academic literature (e.g., clinical guidelines, evidence from clinical trials, reviews) to inform their judgement as well as their own experiential knowledge working with similar cases. Participants also noted the limitations of knowledge, such as the uncertainty about T18’s relative contribution to extubation risk, the recognition that complexity and uncertainty are related and how sometimes patients surprise clinicians (e.g., having an unexpected positive or negative clinical response).
Context
The bedside examination was essential for all participants, citing the observation of a previous extubation attempt, observing interactions on the ventilator and frequent patient checks as valued information contributing to their decision. Participants articulated that the bedside examination was the most reliable form of knowledge—as measurements and metrics become less proximal and more distal to the patient, they were perceived as less reliable.
The goals of care for the patient were the guideposts for medical judgement, as decided upon by Siri’s parents (We refer to family factors here in recognition of Siri’s age—as an infant, Siri would not have decisional capacity. This category of considerations, however, can readily be abstracted to include the goals of care as decided on by a capable patient themselves, or by a surrogate decision-maker in conjunction with a non-capable patient. Notably, we describe a shared decision-making process, where the decision itself is reached through a reciprocal process between clinicians and families, recognising that a legal surrogate is the actual decision-maker). Participants noted that families in the same situation might legitimately make different choices (eg, non-interventional plan of care or curative). Participants in paediatrics remarked that such decisions are subject to the best interests standard (the legal and ethical obligation of both substitute decision-makers, typically parents, and physicians), requiring an overall balance of more benefits than harms. These goals of care dictated how they would approach the decision about extubation.
Many participants cited how clinician-level factors such as personal risk assessment and tolerance, familiarity with difficult airway management and knowledge of complex patients modified the decision. For example, some stated they might prefer a more senior clinician perform the extubation given Siri’s particular risk level. Local context also contributed, particularly regarding the timing of the decision. For example, several participants expressed the desire for the ear, nose and throat team (specialists who support airway management issues) to be on hand. Some also cited Child Life Specialists as a resource to manage patient stress which would affect their timing. These factors highlight how a decision’s ‘goodness’ can be modified by contextual factors and that clinicians should overall aim to maximise all potentially contributing factors for patient benefit.
All participants spoke about communication with families and the need to prepare them for the intervention, modifying how they communicate based on that particular family’s knowledge, preference and context as well as considering how the family can support optimisation of the extubation (e.g., bedside presence). Several participants remarked that a good decision can be rendered less ‘good’ by poor communication, harming the relationship with the family and compromising trust.
The machine learning tool
All participants drew from visualisation 1, which provides an overview of the model’s training, validation, prospective clinical evaluation and regulatory approvals (online supplemental Appendix B, Table B.1).8 Nonetheless, participants felt this regulatory and testing information did not directly tell them how to interpret the output for Siri. Most participants pointed to the model having received regulatory approval as a bit of comfort, to varying degrees. Some were reassured but noted that in paediatrics other interventions have been approved without having been tested on paediatric patients which gives them pause. Others felt regulatory approval was an unacceptably low bar in and of itself.
The phrase ‘my patients’ came up repeatedly, referring to whether the model had been tested prospectively for its performance on the patients in their specific unit. Participants were most reassured by the prospective testing and good performance of the model among the patients at the hospital and in the unit in which it was now being used.
Participants reported appreciation about the model use, notably the specific task it performed (risk prediction for extubation success, defined as no need for reintubation within 48 hours). The link between the prediction and the clinical decision, however, must be clearly specified, as one participant remarked on inadvertent phrasing by one of the interviewers: “Is the model actually telling me a recommendation or is it just presenting me the data?”
All participants wanted to know the inputs (i.e., the specific data sources the model was pulling). However, participants also raised that knowing the inputs did not necessarily convey sufficient information about how those inputs were transformed into computable data. For example, the genetic diagnosis and the chart notes were raised frequently as sources of uncertainty as to how the variables were precisely computed, which had bearing on how the participant would interpret the prediction.
Several participants remarked that the performance metrics were less convincing to them compared with their own experience with the model. Several remarked that the outcomes would ‘speak for themselves’, indicating that ultimately performance (eg, compared with explanations) would calibrate the scope of their reliance on the model. Performance did not have to be perfect, but participants noted the importance of learning through experience where the model tended to ‘get it right’ and develop a schema of the types of errors that the model was prone to. Most asked about representation—whether patients ‘like Siri’ were part of the training data.
As explainability was directly tested in the interview, we observed a range of perspectives. Nearly all stated that they would like to know how both the predictions and the explanations were computed for Siri, but for many it appeared a curiosity rather than something critical to their correctly interpreting the model output. For example, one experienced physician remarked “I’m not hugely convinced that explanations are going to be necessary for the most efficacious models. And you know what I mean by that is that we have a long track record in medicine of using things that work even if we don't understand them” [P004, male]. Some, however, felt strongly that they would not accept a ‘black box’ model.
All participants did, however, want to get a general sense of the method by which the algorithm computes the task. Some referred to the formula, others the weights, and others a ‘skeleton’ of the model’s computational processes. When queried, the primary purpose of requesting the above information seemed to be to compare the model’s general computational process to their own clinical schema.
Knowledge in context
Participants all spoke about applying knowledge towards goals of care, where they considered how knowledge about medical evidence and context are interpreted in light of the specific patient at hand to engage in shared decision-making with the family.
Participants reflected on their interpretation of information, noting that individual sources of information (eg, bedside signals) need to be interpreted relative to the patient’s individual situation. For example, knowing the patient’s baseline relative to the general baseline for patients of their age or condition modified how they used bedside information. Participants also wanted to know what had been done previously to optimise the patient, and where one source of information was less reliable, the triangulation of information sources to justify decisions. At a higher-level, this way of personalising medical care drew from established knowledge about patterns in general and applying it to the individual patient.
Several participants raised questions about T18’s social, political and historical context. Many stated that knowledge established from clinical research was prone to bias because of pessimistic prognostications in the context of disability. The knowledge foundation in this case (i.e., with respect to the presence of T18) was perceived as less reliable than for other medical issues.
The model’s ‘knowledge’
All participants wanted to not just know the inputs, but whether they were the right inputs—that is, processing the information deemed clinically relevant to the task. A few participants expressed a belief that more complicated models (meaning more inputs) could take more patient features into account, and assumed improved accuracy on this basis. Others remarked that one could simultaneously increase the uncertainty around the knowledge produced.
Several participants considered data quality, noting that the data sources were more or less reliable as model inputs depending on the context in which they were collected. For example, some remarked that there is variability in clinical note documentation within and between specialities and between individual providers, with trustworthiness of those sources also being variable.
Participants wanted the model to convey elements of the patient’s prediction which were actionable versus non-actionable. Specifically, nearly all participants decided not to extubate Siri despite what they viewed as a generally legitimate prediction because they felt more could be done to optimise Siri for extubation.
Several participants wondered whether the limitations of medical knowledge about T18 had become incorporated into the model predictions. Bias as a knowledge problem could happen when consistent patterns of care were encoded by the model, but were not actually representative of the true situation. Multiple participants cited previous medical research where decisions about withdrawal of life-sustaining therapies were made on the basis of prognostic factors that themselves became determinants of outcomes (the ‘self-fulfilling prophecy’). Participants worried about being overly pessimistic about Siri’s prognosis due to stigma against disability.
Participants all wanted to know how the explanation related to the prediction. Given that the prediction was a risk score, participants were unclear about whether a particular highlighted feature or signal was indicative of the favourable or unfavourable direction of the prediction; for example, did the variables presented in the Feature Importance visualization online supplemental file 1 Appendix B, Visualization 2 in Table B.1 contribute to the 60% likelihood of success or to the 40% possibility of failure? They often supplied their own interpretation of the explanation in line with their clinical knowledge.
Model’s ability to reflect the context
There was a consistent notion that the model outputs needed to be contextualised with direct knowledge of the patient, positioning the patient as anchor for the ultimate verification of a prediction’s reasonableness. A common example from ICU-trained participants was that physiological signals can look quite nefarious just by observation, yet on setting eyes on the patient something benign had happened. Participants who were aware of these artefacts wanted to know if the model had been designed to extract these events from the overall prediction task (artefact recognition) to help them appraise the relevance of the prediction.
Nearly all participants remarked that the model prediction as well as the task the model was supporting were but part of a larger judgement they needed to make that centred on the big picture. The big picture considered the goals of care for Siri and how the task itself related to their ability to achieve those goals. The risk was relative to those goals, such that 60% for Siri was not the same as a 60% risk for another child with lesser medical complexity; and 60% for a child who was not to be re-intubated was not the same as 60% for a child on an interventional care pathway.
Reasoning on discrepancies
To challenge their thinking, we constructed two visualisations with directly contrasting information (see online supplemental appendix B, table B.1, visualisations 5 and 7). The contrast related to the extent to which T18 was influencing Siri’s risk for extubation failure: (1) if Siri’s T18 phenotype is relatively mild and thus does not substantially modify their risk—visualisation 7 is wrong or (2) T18 is a significant predictor despite Siri’s apparent robustness—visualisation 5 is wrong. In the clinical presentation, participants were told that Siri had a history of difficult intubations online supplemental file 1 Appendix A: Extended Methods, and Box A: Simulated Case Scenario, which was a clue that T18 was in fact a significant contributing factor.
Participants responded in two ways. Some assumed the model had knowledge they did not have, with many participants stating there must be ‘a reason’ for this explanation. Others stated that visualisation 5 and the mere presence of a discrepancy made them question the model’s reliability.
Those who detected the discrepancy drew from their knowledge about T18 generally (“medical knowledge” category) and considered Siri’s ‘robustness’ based on their bedside observations (“context” and “integrating knowing in context” categories). Several also appropriately recognised that T18 is a condition for which data patterns may be biased by virtue of under-representation in the training set (”machine learning model” category) and incomplete knowledge (“model’s ‘knowledge’” category) about the prognostic implications. There were no clear patterns in terms of level of experience, role or other participant-level factors, other than that nearly all participants from cardiac critical care observed that visualisation 5 struck them as unusual. While the sample size is small, it is noteworthy that multiple participants without ML knowledge or cardiac intensive care experience picked up on the discrepancy while others with domain expertise did not.
Decision space
Participants generally recognised that the model’s output represented a subset of the considerations they needed to reflect on, taking the information provided by the model as informing their calculus of the overall risks and benefits and their respective likelihoods to this individual patient, mediated by contextual factors. We asked how they would explain their thinking to a querying parent:
I would explain all of the different factors that I was clinically taking into account and say that we also have a general model that can give us a score and taking that score into consideration with what I know about Siri and [their] medical history, this is how we’ve come to our clinical decision. I’m […] giving a higher level of all the different aspects that are contributing to my decision and to let me come to this conclusion. [P015, female, RN]
Participants reported that the extubation prediction formed a larger part of a care plan, and that ultimately the accuracy of the prediction would come secondary to many contextual factors such as the goals of care and the status of the unit. Overall, decisions were guided primarily by acting towards the best interests of patients (“I want to make the good decision and not just the ‘algorithm-correct’ decision” [P015, female, RN]).
Discussion
Our study presents a comprehensive decision-making framework for medical decisions with ML tools. We describe six categories to reflect on to make good decisions that offer a means by which model predictions can inform clinical decision-making where the patient’s best interest remains central and consistent with shared decision-making as well as meeting medicolegal standards of reasonableness.24 25 This framework can provide a structure for reasoning about decision-making in situations where there are conflicting sources of information; for example, when a model prediction may be incorrect or misaligned with goals of care. With the paucity of professional guidance and burden of responsibility on clinicians, we intend for this framework to be supportive in aiding clinical decision-making with ML tools.
Our findings contribute to the larger literature surrounding considerations for explainability and responsible decision-making. While many expressed a desire for ‘understanding’ the model’s prediction, ultimately, they were not central to the medical decision. This finding is consistent with other work describing how model predictions must be contextualised within a standard of care that centres on patients’ interests8 21.1 2
We found some support for concerns raised regarding automation bias and explainability’s potential exacerbatory effect. Many participants did not observe the conflicting visualisations; when it was highlighted, some responses showed evidence of automation bias (eg, “the model must know something I don’t”), often expressing belief that the model processes information the same way the clinician does.33 Consistent with Gaube et al,11 we did not observe clear associations between years of experience or technical knowledge of ML in terms of response patterns.
These findings are particularly important given that explainability can actually worsen decision-making, with respect to both clinical accuracy and ability to detect systematic biases.11 12 34 35 It is thus particularly problematic that explainability is often a sort of proxy for ‘ethical AI’, a positioning which entirely bypasses whether explainability is an effective mechanism for improving clinical judgment, supporting the detection of incorrect predictions, and patient care decisions. We propose that we must urgently pursue research in the exploration of clinicians’ epistemic and philosophical beliefs and values concerning AI as a potential contributor to over-reliance.
Recently, overriding clinical judgement in favour of algorithmic outputs has resulted in a legal case being pursued by patients and their the families who have suffered harm.36 Other AI tools such as those assisting with translation have also been found to result in patient harm.37 38 Medicine cannot afford to compromise patient trust; as we have seen elsewhere, trust is difficult to gain but easy to lose. Ensuring that medical decisions are not automated but remain focused on the best interests of patients is paramount to securing and retaining trust. Our framework pushes away from both algorithmic paternalism and dichotomous ‘AI versus physician’ modes to demonstrate how to take in evidence about an AI model’s performance while aligning with the shared decision-making standard.39
Limitations of the study
This is the first attempt to develop a framework guiding decision-making and it will be important to establish its transferability across contexts. Despite the heterogeneity of our participants, it is possible that institutional culture may drive some results, for example, the ICU at SickKids has a strong embedded ethics culture which may drive some of their responses. SickKids also has a highly visible AI strategy, with many staff members (and some participants) actively involved. We used a hypothetical ML, which, though realistic, may not precisely mimic actual model performance. Further work is needed to determine the extent to which our framework might apply to other domains such as imaging or detection. The framework is comprehensive and thus long—we anticipate that many aspects become internalised over an extended period of experience with the model, but certainly whether the framework is clinically useful should be subject to an empirical study.
Conclusion
Our work presents a novel framework for integrating ML outputs into shared clinical decision-making standards. We intend for this work to be informative to professional practice guidelines and individual clinicians looking to responsibly use ML tools in their practice. Professional organisations may draw from this work to provide supportive guidance for members in order to satisfy their ethical duties towards patients while responsibly engaging with innovative tools. Our work indicates that although clinicians often prefer explanations, their decisions might be better supported by providing more nuanced information about the model’s performance in its particular context, coupled with support for clinicians exercising their judgement in the patient’s best interests.
Data availability statement
Data are available on reasonable request. We are open to requests for the original data. If requested, we will return to the original participants to obtain consent prior to releasing the de-identified transcripts from interviews.
Ethics statements
Patient consent for publication
Ethics approval
This study was approved by the SickKids Research Ethics Board (#1000064003). For all participants, consent for their audio recording was obtained prior to scheduling the virtual interview. Audio recordings and transcripts were de-identified and maintained behind institutional firewalls with limited access by study investigators only. Participation was voluntary and done via email by the study project coordinator (KT) to avoid undue influence. Similarly, consent will be obtained by the study project coordinator as AA and MDMcC are connected with the potential study participants in their clinical roles. Findings from this study will be disseminated to clinical teams, presented in various conferences and submitted for publication. Data from the study will be destroyed in accordance with local protocols. As this is an exploratory study, we plan to expand the study and encourage replication by our colleagues to enhance knowledge in this imminent area of scholarship.
Acknowledgments
The authors are immensely grateful for the support of this study from the Department of Critical Care Medicine at the Hospital for Sick Children in Toronto, Ontario, Canada. We express our gratitude to the SickKids Youth AI Council and the Children's Council for allowing us to discuss this work and for providing their feedback.
References
Footnotes
X @Mmccradden
Contributors ST, SJ and AG conceptualised this study and conducted early exploratory work with MDMcC. AA and MDMcC contributed to the development of the overall protocol and methods, with support from FC, MZ and KT. Data analysis was led by MDMcC with KT, with support from IS, AA, SJ and AG. All authors contributed intellectually to the final protocol and manuscript. The final protocol reflects the combined, equal intellectual work of ST, AA and MDMcC. AG was the principal investigator for this project. MDMcC takes accountability as guarantor of the overall work.
Funding ST was also supported by the CIHR health system impact fellowship. MDM receives salary support from the Sickkids Foundation.
Competing interests None declared.
Patient and public involvement Patients and/or the public were involved in the design, or conduct, or reporting, or dissemination plans of this research. Refer to the 'Methods' section for further details.
Provenance and peer review Commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.