Impact of Corpus Size and Dimensionality of LSA Spaces from Wikipedia Articles on AutoTutor Answer Evaluation.

Cai, Zhiqiang; Graesser, Arthur C.; Windsor, Leah C.; Cheng, Qinyu; Shaffer, David W.; Hu, Xiangen

Notes FAQ Contact Us

Back to results

Peer reviewed
PDF on ERIC

Download full text

ERIC Number: ED593098

Record Type: Non-Journal

Publication Date: 2018-Jul

Pages: 10

Abstractor: As Provided

ISBN: N/A

ISSN: N/A

EISSN: N/A

Available Date: N/A

Impact of Corpus Size and Dimensionality of LSA Spaces from Wikipedia Articles on AutoTutor Answer Evaluation

Cai, Zhiqiang; Graesser, Arthur C.; Windsor, Leah C.; Cheng, Qinyu; Shaffer, David W.; Hu, Xiangen

International Educational Data Mining Society, Paper presented at the International Conference on Educational Data Mining (EDM) (11th, Raleigh, NC, Jul 16-20, 2018)

Latent Semantic Analysis (LSA) plays an important role in analyzing text data from education settings. LSA represents meaning of words and sets of words by vectors from a k-dimensional space generated from a selected corpus. While the impact of the value of k has been investigated by many researchers, the impact of the selection of documents and the size of the corpus has never been systematically investigated. This paper tackles this problem based on the performance of LSA in evaluating learners' answers to AutoTutor, a conversational intelligent tutoring system. We report the impact of document sources (Wikipedia vs TASA), selection algorithms (keyword based vs random), corpus size (from 2000 to 30000 documents) and number of dimensions (from 2 to 1000). Two AutoTutor tasks are used to evaluate the performance of different LSA spaces: a phrase level answer assessment (responses to focal prompt questions) and a sentence level answer assessment (responses to hints). We show that a sufficiently large (e.g., 20,000 to 30,000 documents) randomly selected Wikipedia corpus with high enough dimensions (about 300) could provide a reasonably good space. A specifically selected domain corpus could have significantly better performance with a relatively smaller corpus size (about 8000 documents) and much lower dimensionality (around 17). The widely used TASA corpus (37,651 documents scientifically sampled) performs equally well as a randomly selected large Wikipedia corpus (20,000 to 30,000) with a sufficiently high dimensionality (e.g., k>=300). [For the full proceedings, see ED593090.]

Descriptors: Semantics, Discourse Analysis, Computational Linguistics, Intelligent Tutoring Systems, Collaborative Writing, Web Sites, Information Retrieval, Comparative Analysis, Phrase Structure, Cues, Role, Educational Environment, Physics, Science Instruction

International Educational Data Mining Society. e-mail: admin@educationaldatamining.org; Web site: http://www.educationaldatamining.org

Publication Type: Speeches/Meeting Papers; Reports - Research

Education Level: N/A

Audience: N/A

Language: English

Sponsor: N/A

Authoring Institution: N/A

Grant or Contract Numbers: N/A

Author Affiliations: N/A