Context Matters: Recovering Human Semantic Structure from Machine Learning Analysis of Large-Scale Text Corpora.

Iordan, Marius Catalin; Giallanza, Tyler; Ellis, Cameron T.; Beckage, Nicole M.; Cohen, Jonathan D.

Notes FAQ Contact Us

Back to results

Peer reviewed

Direct link

ERIC Number: EJ1330085

Record Type: Journal

Publication Date: 2022-Feb

Pages: 32

Abstractor: As Provided

ISBN: N/A

ISSN: EISSN-1551-6709

EISSN: N/A

Context Matters: Recovering Human Semantic Structure from Machine Learning Analysis of Large-Scale Text Corpora

Iordan, Marius Catalin; Giallanza, Tyler; Ellis, Cameron T.; Beckage, Nicole M.; Cohen, Jonathan D.

Cognitive Science, v46 n2 e13085 Feb 2022

Applying machine learning algorithms to automatically infer relationships between concepts from large-scale collections of documents presents a unique opportunity to investigate at scale how human semantic knowledge is organized, how people use it to make fundamental judgments ("How similar are cats and bears?"), and how these judgments depend on the features that describe concepts (e.g., size, furriness). However, efforts to date have exhibited a substantial discrepancy between algorithm predictions and human empirical judgments. Here, we introduce a novel approach to generating embeddings for this purpose motivated by the idea that semantic context plays a critical role in human judgment. We leverage this idea by constraining the topic or domain from which documents used for generating embeddings are drawn (e.g., referring to the natural world vs. transportation apparatus). Specifically, we trained state-of-the-art machine learning algorithms using contextually-constrained text corpora (domain-specific subsets of Wikipedia articles, 50+ million words each) and showed that this procedure greatly improved predictions of empirical similarity judgments and feature ratings of contextually relevant concepts. Furthermore, we describe a novel, computationally tractable method for improving predictions of contextually-unconstrained embedding models based on dimensionality reduction of their internal representation to a small number of contextually relevant semantic features. By improving the correspondence between predictions derived automatically by machine learning methods using vast amounts of data and more limited, but direct empirical measurements of human judgments, our approach may help leverage the availability of online corpora to better understand the structure of human semantic representations and how people make judgments based on those.

Descriptors: Artificial Intelligence, Mathematics, Learning Analytics, Semantics, Context Effect, Computational Linguistics, Encyclopedias, Electronic Publishing

Wiley. Available from: John Wiley & Sons, Inc. 111 River Street, Hoboken, NJ 07030. Tel: 800-835-6770; e-mail: cs-journals@wiley.com; Web site: https://bibliotheek.ehb.be:2191/en-us

Publication Type: Journal Articles; Reports - Research

Education Level: N/A

Audience: N/A

Language: English

Sponsor: National Science Foundation (NSF)

Authoring Institution: N/A

Grant or Contract Numbers: 1757554

Data File: URL: https://osf.io/v8qge/?view_only=c745eddd500a40fb8ef8285d701c20e2