NotesFAQContact Us
Collection
Advanced
Search Tips
Source
Applied Measurement in…111
Audience
Laws, Policies, & Programs
No Child Left Behind Act 20011
What Works Clearinghouse Rating
Showing 76 to 90 of 111 results Save | Export
Peer reviewed Peer reviewed
Fitzpatrick, Anne R.; Yen, Wendy M. – Applied Measurement in Education, 2001
Examined the effects of test length and sample size on the alternate forms reliability and equating of simulated mathematics tests composed of constructed response items scaled using the two-parameter partial credit model. Results suggest that, to obtain acceptable reliabilities and accurate equated scores, tests should have at least 8 6-point…
Descriptors: Constructed Response, Equated Scores, Mathematics Tests, Reliability
Peer reviewed Peer reviewed
Kane, Michael – Applied Measurement in Education, 1996
This overview of the role of error and tolerance for error in measurement asserts that the generic precision associated with a measurement procedure is defined as the root mean square error, or standard error, in some relevant population. This view of precision is explored in several applications of measurement. (SLD)
Descriptors: Error of Measurement, Error Patterns, Generalizability Theory, Measurement Techniques
Peer reviewed Peer reviewed
Sicoly, Fiore – Applied Measurement in Education, 2002
Calculated year-1 to year-2 stability of assessment data from 21 states and 2 Canadian provinces. The median stability coefficient was 0.78 in mathematics and reading, and lower in writing. A stability coefficient of 0.80 is recommended as the standard for large-scale assessments of student performance. (SLD)
Descriptors: Educational Testing, Elementary Secondary Education, Foreign Countries, Mathematics
Peer reviewed Peer reviewed
Yen, Wendy M.; Candell, Gregory L. – Applied Measurement in Education, 1991
Empirical reliabilities of scores based on item-pattern scoring, using 3-parameter item-response theory and number-correct scoring, were compared within each of 5 score metrics for at least 900 elementary school students for 5 content areas. Average increases in reliability were produced by item-pattern scoring. (SLD)
Descriptors: Elementary Education, Elementary School Students, Grade Equivalent Scores, Item Response Theory
Peer reviewed Peer reviewed
Lunz, Mary E.; And Others – Applied Measurement in Education, 1990
An extension of the Rasch model is used to obtain objective measurements for examinations graded by judges. The model calibrates elements of each facet of the examination on a common log-linear scale. Real examination data illustrate the way correcting for judge severity improves fairness of examinee measures. (SLD)
Descriptors: Certification, Difficulty Level, Interrater Reliability, Judges
Peer reviewed Peer reviewed
Johnson, Robert L.; Penny, James; Gordon, Belita – Applied Measurement in Education, 2000
Studied four forms of score resolution used by testing agencies and investigated the effect that each has on the interrater reliability associated with the resulting operational scores. Results, based on 120 essays from the Georgia High School Writing Test, show some forms of resolution to be associated with higher reliability and some associated…
Descriptors: Essay Tests, High School Students, High Schools, Interrater Reliability
Peer reviewed Peer reviewed
Haladyna, Thomas M.; Downing, Steven M. – Applied Measurement in Education, 1989
A taxonomy of 43 rules for writing multiple-choice test items is presented, based on a consensus of 46 textbooks. These guidelines are presented as complete and authoritative, with solid consensus apparent for 33 of the rules. Four rules lack consensus, and 5 rules were cited fewer than 10 times. (SLD)
Descriptors: Classification, Interrater Reliability, Multiple Choice Tests, Objective Tests
Peer reviewed Peer reviewed
Fitzpatrick, Anne R.; Ercikan, Kadriye; Yen, Wendy M.; Ferrara, Steven – Applied Measurement in Education, 1998
The consistency between raters over three years of a high-stakes performance assessment was examined in two studies involving a total of approximately 3,000 students in grades three, five, and eight. Results show that raters in different years differ in severity, with raters in mathematics most consistent, and those in language arts least…
Descriptors: Elementary Education, Elementary School Students, High Stakes Tests, Interrater Reliability
Peer reviewed Peer reviewed
Direct linkDirect link
Wise, Steven L.; Kong, Xiaojing – Applied Measurement in Education, 2005
When low-stakes assessments are administered, the degree to which examinees give their best effort is often unclear, complicating the validity and interpretation of the resulting test scores. This study introduces a new method, based on item response time, for measuring examinee test-taking effort on computer-based test items. This measure, termed…
Descriptors: Psychometrics, Validity, Reaction Time, Test Items
Peer reviewed Peer reviewed
Direct linkDirect link
Wise, Steven L. – Applied Measurement in Education, 2006
In low-stakes testing, the motivation levels of examinees are often a matter of concern to test givers because a lack of examinee effort represents a direct threat to the validity of the test data. This study investigated the use of response time to assess the amount of examinee effort received by individual test items. In 2 studies, it was found…
Descriptors: Computer Assisted Testing, Motivation, Test Validity, Item Response Theory
Peer reviewed Peer reviewed
Crone, Linda J.; And Others – Applied Measurement in Education, 1994
Scores from 324 Louisiana schools on the Louisiana Graduation Exit Examination and a within-school split sample of 255 schools indicate that a single subject or grade provides a less consistent and more narrow perspective on school effectiveness than a subcomposite made up of 2 subject areas. (SLD)
Descriptors: Classification, Effective Schools Research, Elementary Secondary Education, Exit Examinations
Peer reviewed Peer reviewed
Shapley, Kelly S.; Bush, M. Joan – Applied Measurement in Education, 1999
Examined the validity and reliability of the 1995-96 reading/language arts portfolio assessment developed in the Dallas (Texas) public schools for prekindergarten through second grade. Ratings by 42 teachers show that portfolio contents do not provide a valid sample of student work and the assessment reliability is low. (SLD)
Descriptors: Language Arts, Portfolio Assessment, Portfolios (Background Materials), Primary Education
Peer reviewed Peer reviewed
Hambleton, Ronald K.; Plake, Barbara S. – Applied Measurement in Education, 1995
Several extensions to the Angoff method of standard setting are described that can accommodate characteristics of performance-based assessment. A study involving 12 panelists supported the effectiveness of the new approach but suggested that panelists preferred an approach that was at least partially conjunctive. (SLD)
Descriptors: Educational Assessment, Evaluation Methods, Evaluators, Interrater Reliability
Peer reviewed Peer reviewed
Norcini, John; Shea, Judy – Applied Measurement in Education, 1992
Two studies involving a total of 99 experts examined the reproducibility of standards for 2 medical certifying examinations set under different conditions. Together, results of both studies provide evidence that a modified version of the Angoff method is quite reliable and produces stable results under varying conditions. (SLD)
Descriptors: Academic Standards, Evaluators, Groups, Higher Education
Peer reviewed Peer reviewed
Frisbie, David A.; Becker, Douglas F. – Applied Measurement in Education, 1990
Seventeen educational measurement textbooks were reviewed to analyze current perceptions regarding true-false achievement testing. A synthesis of the rules for item writing is presented, and the purported advantages and disadvantages of the true-false format derived from those texts are reviewed. (TJH)
Descriptors: Achievement Tests, Higher Education, Methods Courses, Objective Tests
Pages: 1  |  2  |  3  |  4  |  5  |  6  |  7  |  8