NotesFAQContact Us
Collection
Advanced
Search Tips
Back to results
Peer reviewed Peer reviewed
Direct linkDirect link
ERIC Number: ED658699
Record Type: Non-Journal
Publication Date: 2022-Sep-23
Pages: N/A
Abstractor: As Provided
ISBN: N/A
ISSN: N/A
EISSN: N/A
Threats to Validity in the Application of Machine Learning in Education
Kylie Anglin
Society for Research on Educational Effectiveness
Background: For decades, education researchers have relied on the work of Campbell, Cook, and Shadish to help guide their thinking about valid impact estimates in the social sciences (Campbell & Stanley, 1963; Shadish et al., 2002). The foundation of this work is the "validity typology" and its associated "threats to validity." In this framework, researchers consider the validity of inferences regarding the constructs represented by operationalized variables (construct validity), the strength of association between two variables (statistical validity), the causal relationship of those variables (internal validity), and the generalizability of that relationship (external validity). In each of these validity types (construct, statistical, internal, external), Shadish, Cook, and Campbell outline key threats to validity so that researchers may make design choices that improve their inferences. The framework has had a meaningful influence on the rigor of education research, resulting in a methodological transformation over the past fifteen years towards randomized trials and quasiexperimental designs (Reardon & Stuart, 2019). Today, education research is in the midst of a second transformation as new data sources like natural language and text data have required new methodological approaches (Reardon & Stuart, 2019). Key among these approaches is the application of machine learning to educational data. In supervised machine learning approaches, researchers typically; 1) sample a subset of the data for manual analysis, labelling the data according to the construct of interest; 2) split the labelled data into a training and validation set; 3) use the training dataset to train a model to learn the features that are predictive of the labels; 4) calculate performance statistics on the validation set; 4) apply the model to unlabeled data; and 5) use the model to label unseen data and make inferences regarding educational processes. There are key threats to validity at each of these stages. This paper argues that given the importance of valid inferences is not diminished with new data sources and techniques, the validity types framework can continue to be useful in the application of machine learning to educational impact analyses. Purpose: This paper builds on the validity types framework by considering the key threats to validity in inferences drawn from machine learning. While the majority of these threats have been discussed outside of education, they are rarely discussed within the validity types framework. By bringing each these threats into a single framework, well known by education researchers, we hope to encourage researchers using machine learning to systematically consider plausible threats to validity and the design choices they can make to rule those threats. Methods: We draw on the writings of Shadish, Cook, and Campbell (2002), methodological work from machine learning scholars (Hastie et al., 2009; Jurafsky et al., 2018), and recent applications of machine learning in education to categorize threats in the validity types framework and to demonstrate these threats commonly operate in educational contexts, as well as how they may be ruled out. While the full paper includes definitions, examples, and design solutions, here we simply list a few key threats in each category. Results: Construct Validity: Researchers need to consider construct validity when labelling data manually and when considering how the meaning of those labels change with the machine learning model. Key threats include: (1) Researchers only calculate reliability between the machine learning model and one human labeler, without considering the validity of the training and validation data; (2) Model features are not (or cannot be) examined for construct validity; (3) Model features are not examined in context (and are thus misinterpreted); (4) Model performance is related to unknown characteristics, including potentially important linguistic subgroups; (5) Participants learn how to game the model; (6) Unsupervised machine learning results are interpreted by a single researcher. (Mono-interpreter bias.); and (7) Unsupervised machine learning results are the sole outcome or predictor included in a study. (Mono-operation bias.) Statistical Validity: Researchers need to consider statistical validity at two points: in measuring the performance of the machine learning model and in using the result of the model in correlational, quasi-experimental, and experimental analyses. Key threats include: (1) Researchers do not calculate the most policy relevant performance statistics; (2) Researchers "peek" at the validation data set; (3) Researchers do not examine the sensitivity of results to hyper-parameters; (4) Researchers do not acknowledge uncertainty surrounding performance estimates; and (5) Null conclusions regarding a causal estimand are drawn from a noisy measure labelled using a machine learning model. Internal Validity: Researchers need to consider internal validity when inferring a causal relationship between the text features identified by a model and the outcome of interest. In all but a few cases, causal inferences are unlikely to be warranted. Most text data -- whether collected for the purpose of understanding variations in units, treatment, outcomes, or settings (UTOS) -- are non-experimental in nature. The most prevalent threat in this case is selection bias; a predictive relationship between a text feature and an outcome may very well be due other differences between groups. External Validity: Before generalizing inferences regarding a causal estimand, researchers relying on machine learning also need to consider external validity when they generalize performance statistics from a validation set to other data. Thus, threats to external validity occur when there is a difference between the validation dataset and the data to which the researcher wishes to apply the model. This occurs when: (1) Performance statistics are calculated on a convenience sample, rather than a representative sample; (2) Performance statistics are calculated at a single point in time and generalized to new time points; (3) New performance statistics are not calculated when the model is applied to a new setting; and (4) There is dependence between the training and validation datasets. Conclusions: Given the exciting and complicated nature of machine learning, researchers can too often focus on the details of the algorithm overlooking the validity of the resulting inferences (Geiger et al., 2020; Hagen, 2018). Here, we unify machine learning validity concerns under the validity types framework in order to encourage researchers to systematically consider these threats, and to improve research design in order to protect against them.
Society for Research on Educational Effectiveness. 2040 Sheridan Road, Evanston, IL 60208. Tel: 202-495-0920; e-mail: contact@sree.org; Web site: https://www.sree.org/
Publication Type: Reports - Research
Education Level: N/A
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: Society for Research on Educational Effectiveness (SREE)
Grant or Contract Numbers: N/A