Advancing High School Dropout Predictions Using Machine Learning.

Anika Alam; A. Brooks Bowden

Background: The importance of high school completion for jobs and postsecondary opportunities is well- documented. Combined with federal laws where high school graduation rate is a core performance indicator, school systems and states face pressure to actively monitor and assess high school completion. This proposal employs machine learning techniques to identify students at-risk of exiting high school in either 9th or 10th grade. We argue that compared to students who may exit in later years of schooling, students who withdraw in the first two years are a vulnerable population and could benefit from earlier intervention, support, and services. This proposal advances the current state of knowledge in the field by (1) predicting early withdrawal of students who exit in 9th or 10th grade without eventual completion, and (2) assessing algorithmic bias for sensitive groups. Research Questions: This project investigates the following research questions to predict early exit from high school: 1: How does the prediction accuracy of supervised learning algorithms to predict high school withdrawal compare to that of traditional models (i.e., logistic regression)? 2: What are the most salient predictors of students who exit high school in 9th or 10th grade? 3: To what extent does each mode provide fair predictions across sensitive attributes such as gender, race/ethnicity, disability status, financial hardship, and English proficiency? Based on prior literature, we hypothesize that machine models will provide more predictive accuracy than an OLS regression. While there are clear hypotheses related to the importance of attendance, behavior, and coursework trajectories on high school completion, little is known about which of these aspects of middle school engagement are most predictive of exiting high school early. Setting: This project relies on existing administrative data housed at the North Carolina Education Research Data Center (NCERDC). We examine student educational records from 6th to 8th grade to predict the probability that a student will exit high school in 9th or 10th grade. The predictors include End-of- Grade test scores, attendance rates, chronic absenteeism, disciplinary infractions, school mobility, and urbanicity. Population: We examine first-time sixth-grade public school students during the 2011-2012 school year. We limit the sample to students in districts that follow a legal dropout age of 16 and those with complete graduation or exit records. We retain students who have some or all attendance and state test score data in middle grades and impute missing data with a student's unique middle school median. 94% of students in this cohort persist in the school system beyond 10th grade, compared to 6% who exit in 9th or 10th grade. Research Design: Like earlier empirical work, we compare prediction accuracy of a traditional logistic regression model to more advanced machine learning algorithms: lasso regression, ridge regression, random forests, and extreme gradient boosting (XGboost) (Mduma et al., 2019; Hung, 2017; Sansone, 2019; Coleman, 2021; Sorenson, 2018). We create binary classification models with an outcome of "1" for students who exited in 9th or 10th grade, and "0" otherwise. To evaluate model performance, we examine the area under curve (AUC), accuracy rate, sensitivity (true positive rate), specificity (true negative rate), and F-1 score. The results focus on sensitivity, or the accuracy rate for students who exit early. We follow standard machine learning practices and cross-validate models by using sixth-grade students in Fall 2010 as a training sample and sixth-grade students in Fall 2011 as a testing sample. We apply standard metrics of maximizing accuracy with parameters and hyperparameters that are standard for each algorithm. We rely on receiver-operating characteristic (ROC) curve analyses and apply a decision threshold to achieve a true positive rate of 80%. To prevent the models from being biased towards the majority class, we address class imbalance by oversampling the minority class. Specifically, we apply Synthetic Minority Oversampling Technique (SMOTE) to generate synthetic minority class observations based on the k-nearest neighbor for the minority class (Chawla et al. 2002; Fernandes et al. 2018; Anis & Ali, 2017). To examine if algorithms discriminate against certain student groups, we "blind" the algorithms to gender and race/ethnicity. We conduct an Absolute Between-ROC Area (ABROCA) slicing analysis to evaluate algorithmic fairness between the majority and minority group for student group. We focus on the following protected attributes: gender, race/ethnicity, English learner status, disability status, and economic disadvantage. Findings: We find that when models are trained with highly imbalanced data, both ensemble methods -- XGboost and random forest -- provide the highest sensitivity rate. However, we see substantial improvements in the sensitivity rates of all models that were trained with synthetic data (SMOTE). The improved accuracy in identifying the minor class comes at a penalty of a lower accuracy rate and lower specificity. Despite other models having higher sensitivity, we argue that the strongest model is with XGboost trained with SMOTE observations because it provides a higher specificity. In examining model features we find that age in 8th grade and being chronically absent in a middle school attendance are most predictive of early exit, followed by 8th grade and 7th grade absences. We detect no signs of algorithmic bias for students stratified by gender, disability status, and economic disadvantage. Conversely, we detected bias in regression-based models for students stratified by English learner status and race/ethnicity. Conclusion: Please see the Abstract PDF for the full updated abstract.