Accurate Measurement of Lexical Sophistication with Reference to ESL Learner Data.

Naismith, Ben; Han, Na-Rae; Juffs, Alan; Hill, Brianna; Zheng, Daniel

Notes FAQ Contact Us

Back to results

Peer reviewed
PDF on ERIC

Download full text

ERIC Number: ED593225

Record Type: Non-Journal

Publication Date: 2018-Jul

Pages: 7

Abstractor: As Provided

ISBN: N/A

ISSN: N/A

EISSN: N/A

Available Date: N/A

Accurate Measurement of Lexical Sophistication with Reference to ESL Learner Data

Naismith, Ben; Han, Na-Rae; Juffs, Alan; Hill, Brianna; Zheng, Daniel

International Educational Data Mining Society, Paper presented at the International Conference on Educational Data Mining (EDM) (11th, Raleigh, NC, Jul 16-20, 2018)

One commonly used measure of lexical sophistication is the Advanced Guiraud (AG; [9]), whose formula requires frequency band counts (e.g., COCA; [13]). However, the accuracy of this measure is affected by the particular 2000-word frequency list selected as the basis for its calculations [27]. For example, possible issues arise when frequency lists that are based solely on native speaker corpora are used as a target for second language (L2) learners (e.g., [8]) because the exposure frequencies for L2 learners may vary from that of native speakers. Such L2 variation from comparable native speakers may be due to first language (L1) culture, home country teaching materials, or the text types which L2 learners commonly encounter. This paper addresses the aforementioned problem through an English as a Second Language (ESL) frequency list validation. Our validation is established on two sources: (1) the New General Service List (NGSL; [4]) which is based on the Cambridge English Corpus (CEC) and (2) written data from the 4.2 million-word Pitt English Language Institute Corpus (PELIC). Using open-source data science tools and natural language processing technologies, the paper demonstrates that more distinct measurable lexical sophistication differences across levels are discernible when learner-oriented frequency lists (as compared to general corpora frequency lists) are used as part of a lexical measure such as AG. The results from this research will be useful in teaching contexts where lexical proficiency is measured or assessed, and for materials and test developers who rely on such lists as being representative of known vocabulary at different levels of proficiency. This research applies data-driven exploration of learner corpora to vocabulary acquisition and pedagogy, thus closing a loop between educational data mining and classroom applications. [For the full proceedings, see ED593090.]

Descriptors: English (Second Language), Second Language Learning, Computational Linguistics, Native Speakers, Accuracy, Word Frequency, Word Lists, Measurement Techniques, Language Skills, Second Language Instruction, Language Variation, Native Language, Cultural Background, Validity, Language Proficiency, Comparative Analysis

International Educational Data Mining Society. e-mail: admin@educationaldatamining.org; Web site: http://www.educationaldatamining.org

Publication Type: Speeches/Meeting Papers; Reports - Research

Education Level: N/A

Audience: N/A

Language: English

Sponsor: National Science Foundation (NSF)

Authoring Institution: N/A

Grant or Contract Numbers: SBE0836012

Author Affiliations: N/A