Publication Date
In 2025 | 0 |
Since 2024 | 2 |
Since 2021 (last 5 years) | 7 |
Since 2016 (last 10 years) | 23 |
Since 2006 (last 20 years) | 53 |
Descriptor
Reliability | 50 |
Test Reliability | 38 |
Interrater Reliability | 28 |
Scores | 28 |
Test Items | 25 |
Scoring | 21 |
Test Construction | 21 |
Item Response Theory | 20 |
Validity | 20 |
Error of Measurement | 15 |
Correlation | 14 |
More ▼ |
Source
Applied Measurement in… | 111 |
Author
Publication Type
Journal Articles | 111 |
Reports - Research | 68 |
Reports - Evaluative | 39 |
Reports - Descriptive | 5 |
Speeches/Meeting Papers | 5 |
Information Analyses | 1 |
Education Level
Higher Education | 7 |
Grade 8 | 6 |
Elementary Education | 5 |
Elementary Secondary Education | 5 |
Grade 5 | 5 |
Grade 4 | 4 |
High Schools | 4 |
Middle Schools | 4 |
Postsecondary Education | 4 |
Secondary Education | 4 |
Grade 3 | 3 |
More ▼ |
Audience
Location
California | 3 |
Canada | 2 |
Arizona | 1 |
Australia | 1 |
California (Los Angeles) | 1 |
Germany | 1 |
Hawaii | 1 |
Idaho | 1 |
Indiana | 1 |
Israel | 1 |
Louisiana | 1 |
More ▼ |
Laws, Policies, & Programs
No Child Left Behind Act 2001 | 1 |
Assessments and Surveys
What Works Clearinghouse Rating
Kettler, Ryan J.; Rodriguez, Michael C.; Bolt, Daniel M.; Elliott, Stephen N.; Beddow, Peter A.; Kurz, Alexander – Applied Measurement in Education, 2011
Federal policy on alternate assessment based on modified academic achievement standards (AA-MAS) inspired this research. Specifically, an experimental study was conducted to determine whether tests composed of modified items would have the same level of reliability as tests composed of original items, and whether these modified items helped reduce…
Descriptors: Multiple Choice Tests, Test Items, Alternative Assessment, Test Reliability
Osborn Popp, Sharon E.; Ryan, Joseph M.; Thompson, Marilyn S. – Applied Measurement in Education, 2009
Scoring rubrics are routinely used to evaluate the quality of writing samples produced for writing performance assessments, with anchor papers chosen to represent score points defined in the rubric. Although the careful selection of anchor papers is associated with best practices for scoring, little research has been conducted on the role of…
Descriptors: Writing Evaluation, Scoring Rubrics, Selection, Scoring
Taylor, Catherine S.; Lee, Yoonsun – Applied Measurement in Education, 2010
Item response theory (IRT) methods are generally used to create score scales for large-scale tests. Research has shown that IRT scales are stable across groups and over time. Most studies have focused on items that are dichotomously scored. Now Rasch and other IRT models are used to create scales for tests that include polytomously scored items.…
Descriptors: Measures (Individuals), Item Response Theory, Robustness (Statistics), Item Analysis
Shumate, Steven R.; Surles, James; Johnson, Robert L.; Penny, Jim – Applied Measurement in Education, 2007
Increasingly, assessment practitioners use generalizability coefficients to estimate the reliability of scores from performance tasks. Little research, however, examines the relation between the estimation of generalizability coefficients and the number of rubric scale points and score distributions. The purpose of the present research is to…
Descriptors: Generalizability Theory, Monte Carlo Methods, Measures (Individuals), Program Effectiveness
Johnson, Robert L.; Penny, Jim; Fisher, Steve; Kuhs, Therese – Applied Measurement in Education, 2003
When raters assign different scores to a performance task, a method for resolving rating differences is required to report a single score to the examinee. Recent studies indicate that decisions about examinees, such as pass/fail decisions, differ across resolution methods. Previous studies also investigated the interrater reliability of…
Descriptors: Test Reliability, Test Validity, Scores, Interrater Reliability

Myford, Carol M. – Applied Measurement in Education, 2002
Studied the use of descriptive graphic rating scales by 11 raters to evaluate students' work, exploring different design features. Used a Rasch-model based rating scale analysis to determine that all the continuous scales could be considered to have at least five points, and that defined midpoints did not result in higher student separation…
Descriptors: Evaluators, Rating Scales, Reliability, Test Construction
Hogan, Thomas P.; Murphy, Gavin – Applied Measurement in Education, 2007
We determined the recommendations for preparing and scoring constructed-response (CR) test items in 25 sources (textbooks and chapters) on educational and psychological measurement. The project was similar to Haladyna's (2004) analysis for multiple-choice items. We identified 12 recommendations for preparing CR items given by multiple sources,…
Descriptors: Test Items, Scoring, Test Construction, Educational Indicators
Webb, Norman L. – Applied Measurement in Education, 2007
A process for judging the alignment between curriculum standards and assessments developed by the author is presented. This process produces information on the relationship of standards and assessments on four alignment criteria: Categorical Concurrence, Depth of Knowledge Consistency, Range of Knowledge Correspondence, and Balance of…
Descriptors: Educational Assessment, Academic Standards, Item Analysis, Interrater Reliability

Krus, David J.; Blackman, Harold S. – Applied Measurement in Education, 1988
Test homogeneity and internal consistency reliability indices were developed on the basis of theoretical considerations of properties of hierarchical structures of data matrices. This reconceptualization, in terms of ordinal test theory, has potential for explication of the mutual relationship of test reliability and homogeneity. (TJH)
Descriptors: Equations (Mathematics), Statistics, Test Reliability, Test Theory

Feldt, Leonard S. – Applied Measurement in Education, 1990
Sampling theory for the intraclass reliability coefficient, a Spearman-Brown extrapolation of alpha to a single measurement for each examinee, is less recognized and less cited than that of coefficient alpha. Techniques for constructing confidence intervals and testing hypotheses for the intraclass coefficient are presented. (SLD)
Descriptors: Hypothesis Testing, Measurement Techniques, Reliability, Sampling

Nichols, Paul; Kuehl, Barbara Jean – Applied Measurement in Education, 1999
An approach is presented that can predict internal consistency of cognitively complex assessments on two dimensions, those of adding tasks with similar or different solution strategies and adding test takers with different solution strategies. Data from the 1992 National Assessment of Educational Progress mathematics assessment are used to…
Descriptors: Cognitive Tests, Mathematics Tests, Prediction, Test Reliability

Bandalos, Deborah L.; Enders, Craig K. – Applied Measurement in Education, 1996
Computer simulation indicated that reliability increased with the degree of similarity between underlying and observed distributions when the observed categorical distribution was deliberately constructed to match the shape of the underlying distribution of the trait being measured. Reliability also increased with correlation among variables and…
Descriptors: Computer Simulation, Correlation, Likert Scales, Reliability

Klein, Stephen P.; Stecher, Brian M.; Shavelson, Richard J.; McCaffrey, Daniel; Ormseth, Tor; Bell, Robert M.; Comfort, Kathy; Othman, Abdul R. – Applied Measurement in Education, 1998
Two studies involving 368 elementary and high school students and 29 readers were conducted to investigate reader consistency, score reliability, and reader time requirements of three hands-on science performance tasks. Holistic scores were as reliable as analytic scores, and there was a high correlation between them after they were disattenuated…
Descriptors: Elementary School Students, Elementary Secondary Education, Hands on Science, High School Students

Chinn, Roberta N.; Hertz, Norman R. – Applied Measurement in Education, 2002
Compared two Angoff standard-setting methods, percentage, and yes-no, in the work of four groups of judges (n=24) given behavioral descriptors or incidents to use in making ratings. Results indicate that passing scores based on percentage estimates were stable from initial to final ratings, but those based on dichotomous (yes-no) ratings had…
Descriptors: Certification, Judges, Licensing Examinations (Professions), Reliability
Kane, Michael; Case, Susan M. – Applied Measurement in Education, 2004
The scores on 2 distinct tests (e.g., essay and objective) are often combined to create a composite score, which is used to make decisions. The validity of the observed composite can sometimes be evaluated relative to an external criterion. However, in cases where no criterion is available, the observed composite has generally been evaluated in…
Descriptors: Validity, Weighted Scores, Reliability, Student Evaluation