NotesFAQContact Us
Collection
Advanced
Search Tips
Back to results
ERIC Number: ED618418
Record Type: Non-Journal
Publication Date: 2021
Pages: 14
Abstractor: As Provided
ISBN: N/A
ISSN: N/A
EISSN: N/A
Using Topic Modeling for Code Discovery in Large Scale Text Data
Cai, Zhiqiang; Siebert-Evenstone, Amanda; Eagan, Brendan; Shaffer, David Williamson
Grantee Submission, Paper presented at the International Conference on Quantitative Ethnography (ICQE) (2021)
When text datasets are very large, manually coding line by line becomes impractical. As a result, researchers sometimes try to use machine learning algorithms to automatically code text data. One of the most popular algorithms is topic modeling. For a given text dataset, a topic model provides probability distributions of words for a set of "topics" in the data, which researchers then use to interpret meaning of the topics. A topic model also gives each document in the dataset a score for each topic, which can be used as a non-binary coding for what proportion of a topic is in the document. Unfortunately, it is often difficult to interpret what the topics mean in a defensible way, or to validate document topic proportion scores as meaningful codes. In this study, we examine how keywords from codes developed by human experts were distributed in topics generated from topic modeling. The results show that (1) top keywords of a single topic often contain words from multiple human-generated codes; and conversely, (2) words from human-generated codes appear as high-probability keywords in multiple topic. These results explain why directly using topics from topic models as codes is problematic. However, they also imply that topic modeling makes it possible for researchers to discover codes from short word lists. [This paper was published in: "ICQE 2021," edited by A. R. Ruis and S. B. Lee, Springer Nature Switzerland AG, 2021, pp. 18-31.]
Publication Type: Speeches/Meeting Papers; Reports - Research
Education Level: N/A
Audience: N/A
Language: English
Sponsor: National Science Foundation (NSF)
Authoring Institution: N/A
Grant or Contract Numbers: DRL1661036; 1713110; LDI1934745