ERIC - Search Results

Publication Date

In 2025	0
Since 2024	2
Since 2021 (last 5 years)	7
Since 2016 (last 10 years)	23
Since 2006 (last 20 years)	53

Source

Applied Measurement in…

111

Publication Type

Journal Articles	111
Reports - Research	68
Reports - Evaluative	39
Reports - Descriptive	5
Speeches/Meeting Papers	5
Information Analyses	1

Education Level

Higher Education	7
Grade 8	6
Elementary Education	5
Elementary Secondary Education	5
Grade 5	5
Grade 4	4
High Schools	4
Middle Schools	4
Postsecondary Education	4
Secondary Education	4
Grade 3	3
Early Childhood Education	2
Grade 10	2
Grade 7	2
Intermediate Grades	2
Junior High Schools	2
Primary Education	2
Grade 1	1
Grade 12	1
Grade 2	1
Grade 6	1
More ▼

Audience

Location

California	3
Canada	2
Arizona	1
Australia	1
California (Los Angeles)	1
Germany	1
Hawaii	1
Idaho	1
Indiana	1
Israel	1
Louisiana	1
Netherlands	1
North Carolina	1
Oman	1
Tennessee	1
United Kingdom	1
United States	1
Vermont	1
More ▼

Laws, Policies, & Programs

No Child Left Behind Act 2001

Assessments and Surveys

National Assessment of…	3
SAT (College Admission Test)	2
Advanced Placement…	1
Iowa Tests of Basic Skills	1
National Teacher Examinations	1
Program for International…	1
Stanford Achievement Tests	1
Texas Assessment of Academic…	1
United States Medical…	1

What Works Clearinghouse Rating

Applied Measurement in Education X

Showing 1 to 15 of 111 results Save | Export

New Tests of Rater Drift in Trend Scoring

Peer reviewed

Direct link

John R. Donoghue; Carol Eckerly – Applied Measurement in Education, 2024

Trend scoring constructed response items (i.e. rescoring Time A responses at Time B) gives rise to two-way data that follow a product multinomial distribution rather than the multinomial distribution that is usually assumed. Recent work has shown that the difference in sampling model can have profound negative effects on statistics usually used to…

Descriptors: Scoring, Error of Measurement, Reliability, Scoring Rubrics

Bayesian Maximal Reliability Evaluation Using Latent Variable Modeling

Peer reviewed

Direct link

Tenko Raykov; George A. Marcoulides; Natalja Menold – Applied Measurement in Education, 2024

We discuss an application of Bayesian factor analysis for estimation of the optimal linear combination and associated maximal reliability of a multi-component measuring instrument. The described procedure yields point and credibility interval estimates of this reliability coefficient, which are readily obtained in educational and behavioral…

Descriptors: Bayesian Statistics, Test Reliability, Error of Measurement, Measurement Equipment

Reviewing the Test Reviews: Quality Judgments and Reviewer Agreements in the Mental Measurements Yearbook

Peer reviewed

Direct link

Hogan, Thomas; DeStefano, Marissa; Gilby, Caitlin; Kosman, Dana; Peri, Joshua – Applied Measurement in Education, 2021

Buros' "Mental Measurements Yearbook (MMY)" has provided professional reviews of commercially published psychological and educational tests for over 80 years. It serves as a kind of conscience for the testing industry. For a random sample of 50 entries in the "19th MMY" (a total of 100 separate reviews) this study determined…

Descriptors: Test Reviews, Interrater Reliability, Psychological Testing, Educational Testing

Accuracy and Sensitivity of Coefficient Alpha and Its Alternatives with Unidimensional and Contaminated Scales

Peer reviewed

Direct link

Xiao, Leifeng; Hau, Kit-Tai – Applied Measurement in Education, 2023

We compared coefficient alpha with five alternatives (omega total, omega RT, omega h, GLB, and coefficient H) in two simulation studies. Results showed for unidimensional scales, (a) all indices except omega h performed similarly well for most conditions; (b) alpha is still good; (c) GLB and coefficient H overestimated reliability with small…

Descriptors: Test Theory, Test Reliability, Factor Analysis, Test Length

Determining Reliability of Daily Measures: An Illustration with Data on Teacher Stress

Peer reviewed

Direct link

van Alphen, Thijmen; Jak, Suzanne; Jansen in de Wal, Joost; Schuitema, Jaap; Peetsma, Thea – Applied Measurement in Education, 2022

Intensive longitudinal data is increasingly used to study state-like processes such as changes in daily stress. Measures aimed at collecting such data require the same level of scrutiny regarding scale reliability as traditional questionnaires. The most prevalent methods used to assess reliability of intensive longitudinal measures are based on…

Descriptors: Test Reliability, Measures (Individuals), Anxiety, Data Collection

Violation of Conditional Independence in the Many-Facets Rasch Model

Peer reviewed

Direct link

DeMars, Christine E. – Applied Measurement in Education, 2021

Estimation of parameters for the many-facets Rasch model requires that conditional on the values of the facets, such as person ability, item difficulty, and rater severity, the observed responses within each facet are independent. This requirement has often been discussed for the Rasch models and 2PL and 3PL models, but it becomes more complex…

Descriptors: Item Response Theory, Test Items, Ability, Scores

Evaluating Human Scoring Using Generalizability Theory

Peer reviewed

Direct link

Bimpeh, Yaw; Pointer, William; Smith, Ben Alexander; Harrison, Liz – Applied Measurement in Education, 2020

Many high-stakes examinations in the United Kingdom (UK) use both constructed-response items and selected-response items. We need to evaluate the inter-rater reliability for constructed-response items that are scored by humans. While there are a variety of methods for evaluating rater consistency across ratings in the psychometric literature, we…

Descriptors: Scoring, Generalizability Theory, Interrater Reliability, Foreign Countries

An Information-Based Approach to Identifying Rapid-Guessing Thresholds

Peer reviewed

Direct link

Wise, Steven L. – Applied Measurement in Education, 2019

The identification of rapid guessing is important to promote the validity of achievement test scores, particularly with low-stakes tests. Effective methods for identifying rapid guesses require reliable threshold methods that are also aligned with test taker behavior. Although several common threshold methods are based on rapid guessing response…

Descriptors: Guessing (Tests), Identification, Reaction Time, Reliability

Coefficient [beta] as Extension of KR-21 Reliability for Summed and Scaled Scores for Polytomously-Scored Tests

Peer reviewed

Direct link

Almehrizi, Rashid S. – Applied Measurement in Education, 2021

KR-21 reliability and its extension (coefficient [alpha]) gives the reliability estimate of test scores under the assumption of tau-equivalent forms. KR-21 reliability gives the reliability estimate for summed scores for dichotomous items when items are randomly sampled from an infinite pool of similar items (randomly parallel forms). The article…

Descriptors: Test Reliability, Scores, Scoring, Computation

Validating Rubric Scoring Processes: An Application of an Item Response Tree Model

Peer reviewed

Direct link

Myers, Aaron J.; Ames, Allison J.; Leventhal, Brian C.; Holzman, Madison A. – Applied Measurement in Education, 2020

When rating performance assessments, raters may ascribe different scores for the same performance when rubric application does not align with the intended application of the scoring criteria. Given performance assessment score interpretation assumes raters apply rubrics as rubric developers intended, misalignment between raters' scoring processes…

Descriptors: Scoring Rubrics, Validity, Item Response Theory, Interrater Reliability

Measuring the Reliability of Diagnostic Mastery Classifications at Multiple Levels of Reporting

Peer reviewed

Direct link

Thompson, W. Jake; Clark, Amy K.; Nash, Brooke – Applied Measurement in Education, 2019

As the use of diagnostic assessment systems transitions from research applications to large-scale assessments for accountability purposes, reliability methods that provide evidence at each level of reporting are needed. The purpose of this paper is to summarize one simulation-based method for estimating and reporting reliability for an…

Descriptors: Test Reliability, Diagnostic Tests, Classification, Computation

Evaluating Random and Systematic Error in Student Growth Percentiles

Peer reviewed

Direct link

Wells, Craig S.; Sireci, Stephen G. – Applied Measurement in Education, 2020

Student growth percentiles (SGPs) are currently used by several states and school districts to provide information about individual students as well as to evaluate teachers, schools, and school districts. For SGPs to be defensible for these purposes, they should be reliable. In this study, we examine the amount of systematic and random error in…

Descriptors: Growth Models, Reliability, Scores, Error Patterns

Classification Consistency and Accuracy for Mixed-Format Tests

Peer reviewed

Direct link

Kim, Stella Y.; Lee, Won-Chan – Applied Measurement in Education, 2019

This study explores classification consistency and accuracy for mixed-format tests using real and simulated data. In particular, the current study compares six methods of estimating classification consistency and accuracy for seven mixed-format tests. The relative performance of the estimation methods is evaluated using simulated data. Study…

Descriptors: Classification, Reliability, Accuracy, Test Format

Partial Credit in Answer-Until-Correct Multiple-Choice Tests Deployed in a Classroom Setting

Peer reviewed

Direct link

Slepkov, Aaron D.; Godfrey, Alan T. K. – Applied Measurement in Education, 2019

The answer-until-correct (AUC) method of multiple-choice (MC) testing involves test respondents making selections until the keyed answer is identified. Despite attendant benefits that include improved learning, broad student adoption, and facile administration of partial credit, the use of AUC methods for classroom testing has been extremely…

Descriptors: Multiple Choice Tests, Test Items, Test Reliability, Scores

Statistically Comparing the Performance of Multiple Automated Raters across Multiple Items

Peer reviewed

Direct link

Kieftenbeld, Vincent; Boyer, Michelle – Applied Measurement in Education, 2017

Automated scoring systems are typically evaluated by comparing the performance of a single automated rater item-by-item to human raters. This presents a challenge when the performance of multiple raters needs to be compared across multiple items. Rankings could depend on specifics of the ranking procedure; observed differences could be due to…

Descriptors: Automation, Scoring, Comparative Analysis, Test Items

Previous Page | Next Page »

Pages: 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

Feldt, Leonard S.	7
Hambleton, Ronald K.	3
Johnson, Robert L.	3
Qualls, Audrey L.	3
Wise, Steven L.	3
Yen, Wendy M.	3
Bandalos, Deborah L.	2
Brennan, Robert L.	2
Enders, Craig K.	2
Ercikan, Kadriye	2
Fitzpatrick, Anne R.	2
Frisbie, David A.	2
Gao, Xiaohong	2
Kane, Michael	2
Klein, Stephen P.	2
Lane, Suzanne	2
Linn, Robert L.	2
Mehrens, William A.	2
Penny, Jim	2
Shavelson, Richard J.	2
Stone, Clement A.	2
Thissen, David	2
Wainer, Howard	2
Webb, Noreen M.	2
More ▼

Reliability	50
Test Reliability	38
Interrater Reliability	28
Scores	28
Test Items	25
Scoring	21
Test Construction	21
Item Response Theory	20
Validity	20
Error of Measurement	15
Correlation	14
Educational Assessment	14
Performance Based Assessment	14
Test Validity	14
Generalizability Theory	13
Mathematics Tests	13
Evaluation Methods	12
Multiple Choice Tests	11
Standard Setting (Scoring)	11
Comparative Analysis	10
Elementary Secondary Education	9
Evaluators	9
Item Analysis	9
Licensing Examinations…	9
Computation	8
More ▼