ERIC Number: ED661273
Record Type: Non-Journal
Publication Date: 2024
Pages: 137
Abstractor: As Provided
ISBN: 979-8-3840-3198-7
ISSN: N/A
EISSN: N/A
Using Data Preprocessing Techniques and Machine Learning Algorithms to Explore Predictors of Word Difficulty in English Language Assessment
Mingying Zheng
ProQuest LLC, Ph.D. Dissertation, The University of Iowa
The digital transformation in educational assessment has led to the proliferation of large-scale data, offering unprecedented opportunities to enhance language learning, and testing through machine learning (ML) techniques. Drawing on the extensive data generated by online English language assessments, this dissertation investigates the efficacy of data preprocessing techniques and their impacts on the performance of ten machine learning classifiers. Two preprocessing sequences were examined: Form A (data cleaning, data transformation, then data reduction) and Form B (data cleaning, data reduction, then data transformation), in the quest to enhance data quality for the application of supervised machine learning algorithms in English language assessments. The current study rigorously evaluated the accuracy, precision, recall, F1-score, and AUC metrics of ten machine learning classifiers on their ability to accurately predict word difficulty in a comprehensive dataset from large-scale English language assessments involving 3,918 test takers and 6,599 words characterized by 38 different lexical and form related features, with a particular focus on eXtreme Gradient Boosting (XGB), Decision Tree, and Random Forest, determining their capacity to generalize well to new, unseen structured data. The results underscore that both data preprocessing sequences enhance supervised machine learning classifier performance comparably, suggesting the choice between two data preprocessing techniques may depend on other factors such as computational resources and desired interpretability. Among all ten machine learning classifiers, the XGB classifier consistently outperformed other classifiers, indicating its robustness and suitability for processing large-scale educational data. A significant contribution of this research study lies in identifying key lexical features--such as word frequency, average lexical decision accuracy of all participants for a given word, standardized lexical decision accuracy reaction time across all participants for a given word, reported age of acquisition score, neighbors determined using phonological Levenshtein distance), raw corpus frequency, and dispersion for a given word--that are predictive of word difficulty. These findings are critical for English as a second language (ESL) educational contexts, where they can inform the development of more effective teaching materials and assessments. This study not only advances the field of educational data analytics by exploring the intersection of data preprocessing and machine learning but also lays the groundwork for future research to further refine these approaches in the context of language assessment. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://bibliotheek.ehb.be:2222/en-US/products/dissertations/individuals.shtml.]
Descriptors: Artificial Intelligence, Computational Linguistics, Language Tests, English (Second Language), Second Language Learning, Accuracy, Prediction, Difficulty Level, Computer Assisted Testing, Algorithms, Vocabulary, Recall (Psychology), Scores, Computer Software, Generalization, Second Language Instruction, Language Processing, Learning Analytics, Reaction Time, Test Items
ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site: http://bibliotheek.ehb.be:2222/en-US/products/dissertations/individuals.shtml
Publication Type: Dissertations/Theses - Doctoral Dissertations
Education Level: N/A
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: N/A
Grant or Contract Numbers: N/A