Publication Date
In 2025 | 0 |
Since 2024 | 2 |
Since 2021 (last 5 years) | 8 |
Since 2016 (last 10 years) | 31 |
Since 2006 (last 20 years) | 80 |
Descriptor
Comparative Analysis | 117 |
Models | 117 |
Reliability | 63 |
Test Reliability | 42 |
Foreign Countries | 26 |
Scores | 20 |
Validity | 20 |
Correlation | 18 |
Evaluation Methods | 17 |
Statistical Analysis | 17 |
Interrater Reliability | 16 |
More ▼ |
Source
Author
Publication Type
Education Level
Audience
Location
Australia | 4 |
United Kingdom (England) | 4 |
China | 3 |
Connecticut | 3 |
New York | 3 |
Philippines | 3 |
Turkey | 3 |
Canada | 2 |
Egypt | 2 |
Germany | 2 |
Malaysia | 2 |
More ▼ |
Laws, Policies, & Programs
Every Student Succeeds Act… | 2 |
No Child Left Behind Act 2001 | 1 |
Assessments and Surveys
What Works Clearinghouse Rating
Dadi Ramesh; Suresh Kumar Sanampudi – European Journal of Education, 2024
Automatic essay scoring (AES) is an essential educational application in natural language processing. This automated process will alleviate the burden by increasing the reliability and consistency of the assessment. With the advances in text embedding libraries and neural network models, AES systems achieved good results in terms of accuracy.…
Descriptors: Scoring, Essays, Writing Evaluation, Memory
The AI Teacher Test: Measuring the Pedagogical Ability of Blender and GPT-3 in Educational Dialogues
Tack, Anaïs; Piech, Chris – International Educational Data Mining Society, 2022
How can we test whether state-of-the-art generative models, such as Blender and GPT-3, are good AI teachers, capable of replying to a student in an educational dialogue? Designing an AI teacher test is challenging: although evaluation methods are much-needed, there is no off-the-shelf solution to measuring pedagogical ability. This paper reports…
Descriptors: Artificial Intelligence, Dialogs (Language), Bayesian Statistics, Decision Making
Jönsson, Anders; Balan, Andreia – Practical Assessment, Research & Evaluation, 2018
Research on teachers' grading has shown that there is great variability among teachers regarding both the process and product of grading, resulting in low comparability and issues of inequality when using grades for selection purposes. Despite this situation, not much is known about the merits or disadvantages of different models for grading. In…
Descriptors: Grading, Models, Reliability, Validity
Yun Long; Haifeng Luo; Yu Zhang – npj Science of Learning, 2024
This study explores the use of Large Language Models (LLMs), specifically GPT-4, in analysing classroom dialogue--a key task for teaching diagnosis and quality improvement. Traditional qualitative methods are both knowledge- and labour-intensive. This research investigates the potential of LLMs to streamline and enhance this process. Using…
Descriptors: Classroom Communication, Computational Linguistics, Chinese, Mathematics Instruction
Shin, Jinnie; Gierl, Mark J. – Language Testing, 2021
Automated essay scoring (AES) has emerged as a secondary or as a sole marker for many high-stakes educational assessments, in native and non-native testing, owing to remarkable advances in feature engineering using natural language processing, machine learning, and deep-neural algorithms. The purpose of this study is to compare the effectiveness…
Descriptors: Scoring, Essays, Writing Evaluation, Computer Software
Verhavert, San; Bouwer, Renske; Donche, Vincent; De Maeyer, Sven – Assessment in Education: Principles, Policy & Practice, 2019
Comparative Judgement (CJ) aims to improve the quality of performance-based assessments by letting multiple assessors judge pairs of performances. CJ is generally associated with high levels of reliability, but there is also a large variation in reliability between assessments. This study investigates which assessment characteristics influence the…
Descriptors: Meta Analysis, Reliability, Comparative Analysis, Value Judgment
Pedder, Hugo; Boucher, Martin; Dias, Sofia; Bennetts, Margherita; Welton, Nicky J. – Research Synthesis Methods, 2020
Time-course model-based network meta-analysis (MBNMA) has been proposed as a framework to combine treatment comparisons from a network of randomized controlled trials reporting outcomes at multiple time-points. This can explain heterogeneity/inconsistency that arises by pooling studies with different follow-up times and allow inclusion of studies…
Descriptors: Simulation, Randomized Controlled Trials, Meta Analysis, Comparative Analysis
Madison, Matthew J. – Educational Measurement: Issues and Practice, 2019
Recent advances have enabled diagnostic classification models (DCMs) to accommodate longitudinal data. These longitudinal DCMs were developed to study how examinees change, or transition, between different attribute mastery statuses over time. This study examines using longitudinal DCMs as an approach to assessing growth and serves three purposes:…
Descriptors: Longitudinal Studies, Item Response Theory, Psychometrics, Criterion Referenced Tests
Bosch, Nigel; Paquette, Luc – Journal of Learning Analytics, 2018
Metrics including Cohen's kappa, precision, recall, and F[subscript 1] are common measures of performance for models of discrete student states, such as a student's affect or behaviour. This study examined discrete model metrics for previously published student model examples to identify situations where metrics provided differing perspectives on…
Descriptors: Models, Comparative Analysis, Prediction, Probability
Moeller, Julia; Viljaranta, Jaana; Kracke, Bärbel; Dietrich, Julia – Frontline Learning Research, 2020
This article proposes a study design developed to disentangle the objective characteristics of a learning situation from individuals' subjective perceptions of that situation. The term objective characteristics refers to the agreement across students, whereas subjective perceptions refers to inter-individual heterogeneity. We describe a novel…
Descriptors: Student Attitudes, College Students, Lecture Method, Student Interests
van Valkenhoef, Gert; Dias, Sofia; Ades, A. E.; Welton, Nicky J. – Research Synthesis Methods, 2016
Network meta-analysis enables the simultaneous synthesis of a network of clinical trials comparing any number of treatments. Potential inconsistencies between estimates of relative treatment effects are an important concern, and several methods to detect inconsistency have been proposed. This paper is concerned with the node-splitting approach,…
Descriptors: Networks, Meta Analysis, Automation, Models
Botarleanu, Robert-Mihai; Dascalu, Mihai; Watanabe, Micah; Crossley, Scott Andrew; McNamara, Danielle S. – Grantee Submission, 2022
Age of acquisition (AoA) is a measure of word complexity which refers to the age at which a word is typically learned. AoA measures have shown strong correlations with reading comprehension, lexical decision times, and writing quality. AoA scores based on both adult and child data have limitations that allow for error in measurement, and increase…
Descriptors: Age Differences, Vocabulary Development, Correlation, Reading Comprehension
Storme, Martin; Myszkowski, Nils; Baron, Simon; Bernard, David – Journal of Intelligence, 2019
Assessing job applicants' general mental ability online poses psychometric challenges due to the necessity of having brief but accurate tests. Recent research (Myszkowski & Storme, 2018) suggests that recovering distractor information through Nested Logit Models (NLM; Suh & Bolt, 2010) increases the reliability of ability estimates in…
Descriptors: Intelligence Tests, Item Response Theory, Comparative Analysis, Test Reliability
Babette Moeller; Teresa Duncan; Jason Schoeneberger; John Hitchcock; Matthew McLeod – Grantee Submission, 2022
Teacher quality, along with the school context, classroom practice, and student mathematics achievement, are the key components in Clarke and Hollingsworth's (2002) model that depicts teachers' professional growth as an ongoing, dynamic, and interactive enterprise. Comfort and preparedness are dispositions that, along with teacher knowledge, are…
Descriptors: Equal Education, Access to Education, Mathematics Instruction, Faculty Development
Zaidi, Nikki L.; Swoboda, Christopher M.; Kelcey, Benjamin M.; Manuel, R. Stephen – Advances in Health Sciences Education, 2017
The extant literature has largely ignored a potentially significant source of variance in multiple mini-interview (MMI) scores by "hiding" the variance attributable to the sample of attributes used on an evaluation form. This potential source of hidden variance can be defined as rating items, which typically comprise an MMI evaluation…
Descriptors: Interviews, Scores, Generalizability Theory, Monte Carlo Methods