Examining the Effects of Item Difficulty and Rating Method on Rating Reliability and Construct Validity of Constructed-Response and Essay Items on English Examinations.

Yao, Yuan

Notes FAQ Contact Us

Back to results

Direct link

ERIC Number: ED599166

Record Type: Non-Journal

Publication Date: 2019

Pages: 159

Abstractor: As Provided

ISBN: 978-1-3921-6359-7

ISSN: EISSN-

EISSN: N/A

Examining the Effects of Item Difficulty and Rating Method on Rating Reliability and Construct Validity of Constructed-Response and Essay Items on English Examinations

Yao, Yuan

ProQuest LLC, Ph.D. Dissertation, Niagara University

Under the framework of item response theory (IRT) and generalizability (G-) theory, this study examined the effects of item difficulty on rating reliability and construct validity on both the constructed-response (CR) items and essay items on English examinations. The data collected for this study were students' scores and responses on the two kinds of items as well as the teachers' ratings at a Chinese university. As to the CR items, it was found that the rating was dubious when investigating the two items simultaneously due to the varying difficulty level of the two items. Although analytic rating had more rating reliability on the CR items than did holistic rating, the easy item (translation 2) always had more rating reliability than did the difficult item (translation 1), whichever kind of rating method was used. In terms of construct validity, analytic rating increased both the convergent and discriminant validity. When using analytic rating, the four teachers had more consensus on the meaning sub-dimension on translation 1, whereas they had more consensus on the grammar sub-dimension on translation 2. The situation was a little different for the essay items. The rating was questionable when investigating the two essay items simultaneously, which echoed with the findings of the CR items. Although analytic rating had more rating reliability on the essay items than holistic rating, the hard item (essay 1) always had more rating reliability than the easy item (essay 2), which was just the opposite to the findings of the CR items. As to construct validity, contrary to the results of the CR items, holistic rating increased both the convergent and discriminant validity. When using the analytic rating method, the four teachers had more consensus on the structure and the meaning sub-dimension of the two items. The ratings were problematic on the vocabulary sub-dimension for essay 1 and on the grammar sub-dimension for essay 2. To sum up, this study had the following three major findings: 1) the ratings of the items can be problematic when the items vary in their difficulty levels; 2) the CR items had more construct validity than the essay items; and 3) the analytic rating method is effective in decreasing rating variability and increasing rating reliability. Based upon the findings, this dissertation concluded with the suggestions for practice and future research directions. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://bibliotheek.ehb.be:2222/en-US/products/dissertations/individuals.shtml.]

Descriptors: Foreign Countries, College Students, Second Language Learning, English (Second Language), Language Tests, Essay Tests, Test Items, Difficulty Level, Item Response Theory, Construct Validity, Test Reliability, Scores, Translation, Vocabulary, Grammar

ProQuest LLC. 789 East Eisenhower Parkway, P.O. Box 1346, Ann Arbor, MI 48106. Tel: 800-521-0600; Web site: http://bibliotheek.ehb.be:2222/en-US/products/dissertations/individuals.shtml

Publication Type: Dissertations/Theses - Doctoral Dissertations

Education Level: Higher Education; Postsecondary Education

Audience: N/A

Language: English

Sponsor: N/A

Authoring Institution: N/A

Identifiers - Location: China

Grant or Contract Numbers: N/A