Publication Date
In 2025 | 2 |
Since 2024 | 33 |
Since 2021 (last 5 years) | 134 |
Since 2016 (last 10 years) | 455 |
Since 2006 (last 20 years) | 1164 |
Descriptor
Comparative Analysis | 1930 |
Reliability | 873 |
Test Reliability | 787 |
Foreign Countries | 547 |
Test Validity | 442 |
Correlation | 345 |
Validity | 330 |
Interrater Reliability | 325 |
Statistical Analysis | 321 |
Scores | 274 |
Measures (Individuals) | 236 |
More ▼ |
Source
Author
Reckase, Mark D. | 6 |
Attali, Yigal | 5 |
Coniam, David | 5 |
Brennan, Robert L. | 4 |
Crehan, Kevin D. | 4 |
Feldt, Leonard S. | 4 |
Hakstian, A. Ralph | 4 |
Jones, Ian | 4 |
Kolen, Michael J. | 4 |
Lunz, Mary E. | 4 |
August, Diane | 3 |
More ▼ |
Publication Type
Education Level
Audience
Researchers | 35 |
Practitioners | 29 |
Teachers | 15 |
Administrators | 9 |
Policymakers | 6 |
Counselors | 2 |
Media Staff | 2 |
Parents | 1 |
Support Staff | 1 |
Location
Turkey | 59 |
United States | 47 |
Australia | 36 |
Canada | 32 |
United Kingdom (England) | 32 |
China | 31 |
United Kingdom | 28 |
Germany | 25 |
Netherlands | 24 |
Taiwan | 22 |
Hong Kong | 20 |
More ▼ |
Laws, Policies, & Programs
Assessments and Surveys
What Works Clearinghouse Rating
Meets WWC Standards with or without Reservations | 1 |
Does not meet standards | 1 |
The AI Teacher Test: Measuring the Pedagogical Ability of Blender and GPT-3 in Educational Dialogues
Tack, Anaïs; Piech, Chris – International Educational Data Mining Society, 2022
How can we test whether state-of-the-art generative models, such as Blender and GPT-3, are good AI teachers, capable of replying to a student in an educational dialogue? Designing an AI teacher test is challenging: although evaluation methods are much-needed, there is no off-the-shelf solution to measuring pedagogical ability. This paper reports…
Descriptors: Artificial Intelligence, Dialogs (Language), Bayesian Statistics, Decision Making
Rebernik, Teja; Jacobi, Jidde; Tiede, Mark; Wieling, Martijn – Journal of Speech, Language, and Hearing Research, 2021
Purpose: This study compares two electromagnetic articulographs manufactured by Northern Digital, Inc.: the NDI Wave System (from 2008) and the NDI Vox-EMA System (from 2020). Method: Four experiments were completed: (1) comparison of statically positioned sensors; (2) tracking dynamic movements of sensors manipulated using a motor-driven LEGO…
Descriptors: Measurement Equipment, Articulation (Speech), Accuracy, Reliability
Dae Woong Ham; Luke Miratrix – Grantee Submission, 2024
The consequence of a change in school leadership (e.g., principal turnover) on student achievement has important implications for education policy. The impact of such an event can be estimated via the popular Difference in Difference (DiD) estimator, where those schools with a turnover event are compared to a selected set of schools that did not…
Descriptors: Trend Analysis, Faculty Mobility, Academic Achievement, Principals
Kapsner-Smith, Mara R.; Opuszynski, Amanda; Stepp, Cara E.; Eadie, Tanya L. – Journal of Speech, Language, and Hearing Research, 2021
Purpose: The reliability of auditory-perceptual judgments between listeners is a long-standing problem in the assessment of voice disorders. The purpose of this study was to determine whether a relatively novel experimental scaling method, called visual sort and rate (VSR), yielded stronger reliability than the more frequently used method of…
Descriptors: Voice Disorders, Interrater Reliability, Rating Scales, Severity (of Disability)
Marine Simon; Alexandra Budke – Journal of Geography in Higher Education, 2024
Comparison is an important geographic method and a common task in geography education. Mastering comparison is a complex competency and written comparisons are challenging tasks both for students and assessors. As yet, however, there is no set test for evaluating comparison competency nor tool for enhancing it. Moreover, little is known about…
Descriptors: Geography Instruction, Student Evaluation, Comparative Analysis, Reliability
Seyda Aydin-Karaca; Mustafa Serdar Köksal; Bilkay Bi – Journal of Psychoeducational Assessment, 2024
This study aimed to develop a parent rating scale (PRSG) for screening children for further identification process in terms of giftedness. The participants of the study were 255 parents of gifted and non-gifted students. The PRSG, consisting of 30 items, was created by consulting parents and reviewing instruments existent in the literature. As…
Descriptors: Rating Scales, Parent Attitudes, Scores, Comparative Analysis
Lisa Frances; Frances Quinn; Sue Elliott; Jo Bird – Australian Educational Researcher, 2024
In this article, we explore inconsistencies in the implementation of outdoor learning across Australian early years' education. The benefits of outdoor learning justify regular employment of this pedagogical approach in both early childhood education and primary school settings. Early childhood education services provide daily outdoor learning…
Descriptors: Foreign Countries, Outdoor Education, Program Implementation, Elementary Education
Taylor, Tessa; Lanovaz, Marc J. – Journal of Applied Behavior Analysis, 2022
Behavior analysts typically rely on visual inspection of single-case experimental designs to make treatment decisions. However, visual inspection is subjective, which has led to the development of supplemental objective methods such as the conservative dual-criteria method. To replicate and extend a study conducted by Wolfe et al. (2018) on the…
Descriptors: Visual Perception, Artificial Intelligence, Decision Making, Evaluators
Pinot de Moira, Anne; Wheadon, Christopher; Christodoulou, Daisy – Research in Education, 2022
Writing is generally assessed internationally using rubric-based approaches, but there is a growing body of evidence to suggest that the reliability of such approaches is poor. In contrast, comparative judgement studies suggest that it is possible to assess open ended tasks such as writing with greater reliability. Many previous studies, however,…
Descriptors: Writing Evaluation, Classification, Accuracy, Scoring Rubrics
Kate E. Walton; Cristina Anguiano-Carrasco – ACT, Inc., 2024
Large language models (LLMs), such as ChatGPT, are becoming increasingly prominent. Their use is becoming more and more popular to assist with simple tasks, such as summarizing documents, translating languages, rephrasing sentences, or answering questions. Reports like McKinsey's (Chui, & Yee, 2023) estimate that by implementing LLMs,…
Descriptors: Artificial Intelligence, Man Machine Systems, Natural Language Processing, Test Construction
Karel Kok; Sophia Chroszczinsky; Burkhard Priemer – Physical Review Physics Education Research, 2024
Data comparison problems are used in teaching and science education research that focuses on students' ability to compare datasets and their conceptual understanding of measurement uncertainties. However, the evaluation of students' decisions in these problems can pose a problem: e.g., students making a correct decision for the wrong reasons.…
Descriptors: Secondary School Students, Undergraduate Students, Comparative Analysis, Evaluation Methods
Jönsson, Anders; Balan, Andreia – Practical Assessment, Research & Evaluation, 2018
Research on teachers' grading has shown that there is great variability among teachers regarding both the process and product of grading, resulting in low comparability and issues of inequality when using grades for selection purposes. Despite this situation, not much is known about the merits or disadvantages of different models for grading. In…
Descriptors: Grading, Models, Reliability, Validity
Maïano, Christophe; Morin, Alexandre J. S.; Tietjens, Maike; Bastos, Tânia; Luiggi, Maxime; Corredeira, Rui; Griffet, Jean; Sánchez-Oliva, David – Measurement in Physical Education and Exercise Science, 2023
The present study sought to examine the psychometric properties of new German, Portuguese, and Spanish versions of the Revised Short Form of the Physical Self-Inventory (PSI-S-"R"), and to contrast these properties against those from the original French version of this instrument. Participants (n = 1802) were 288 French youth, 177 German…
Descriptors: German, Portuguese, Spanish, Test Construction
Igor Esnaola; Albert Sesé; Lorea Azpiazu; Yina Wang – British Journal of Educational Psychology, 2024
Background: Modelling academic self-concept through second-order factors or bifactor structures is an important issue with substantive and practical implications; besides, the bifactor model has not been analysed with a Chinese sample and cross-cultural studies in the academic self-concept are scarce. Likewise, latent structure validity evidence…
Descriptors: Academic Achievement, Self Concept, Psychometrics, Validity
Jordan M. Wheeler; Allan S. Cohen; Shiyu Wang – Journal of Educational and Behavioral Statistics, 2024
Topic models are mathematical and statistical models used to analyze textual data. The objective of topic models is to gain information about the latent semantic space of a set of related textual data. The semantic space of a set of textual data contains the relationship between documents and words and how they are used. Topic models are becoming…
Descriptors: Semantics, Educational Assessment, Evaluators, Reliability