ERIC Number: ED663533
Record Type: Non-Journal
Publication Date: 2024-Sep-19
Pages: N/A
Abstractor: As Provided
ISBN: N/A
ISSN: N/A
EISSN: N/A
Quantile Reliability: Beyond Global Estimates of Internal Consistency
Jeffrey Shero; Jessica Logan
Society for Research on Educational Effectiveness
Background/Context: Previous research in educational assessment has consistently emphasized the importance of reliability as a cornerstone of test quality. Traditional measures of reliability, such as test-retest and split-half reliability, offer a broad view of how internally consistent a measure is but overlook the variability in this internal consistency across different levels of performance. This oversight can have significant impacts in special education, where accurately identifying students at risk for or having learning disabilities rests on the precision of assessment tools at the specific portions of the distribution of student outcomes where these students are present. The current literature lacks exploration on how reliability and internal consistency may vary across the distribution of assessment scores, particularly for measures with characteristics such ceiling/floor effects or for those designed to provide greater insights for one portion of the distribution over others. Purpose/Objective/Research Question: This study introduces a novel approach to evaluating the reliability of educational assessments, quantile reliability, which is able to examine the consistency of a measure at different quantiles throughout distribution of scores. Our research aims to uncover whether assessments exhibit differential reliability for students at varying performance levels, particularly for those at the lower end of the achievement spectrum. We hypothesize that assessments will show varied reliability metrics across quantiles, with potential implications for the identification and support of students with learning disabilities. Further, we hypothesize that for measures with ceiling or floor effects that the reliability of the measure will be significantly reduced where said effects occur due to the reduced variance surrounding the scores at that point of the distribution. Setting: The analysis was conducted using publicly available data from the Western Reserve Reading and Math Project (WRRMP), a longitudinal study with over 15 years of data collection on reading and math achievement, and was accessed via the LDBase.org data repository. Select waves from the sample were used that provided the necessary measures needed to conduct the proposed analyses. Population/Participants/Subjects: The sample comprised 794 students, primarily White (91%) and from higher socioeconomic backgrounds (94% with some college education or more). Although this is a largely homogenous sample in terms of demographics, this fact provides a broad basis for assessing the generalizability of the reliabilities within similar demographic contexts, which we believe is an unintended strength of the sample. That is, given the lack of variance in demographic and contextual predictors, we can be more confident that any differences observed in reliability throughout the distribution are due to the portion of the distribution assessed and not to confounding effects of covariates. Research Design: This study aimed to assess the quantile reliability of multiple measures within the broader WRRMP study. To assess quantile reliability, two approaches were taken that align with how reliability is typically assessed. First, measures were prepared for either split-half reliability [character omitted] dividing measures into even and odd numbered items and correlating these halves [character omitted] or test-retest reliability [character omitted] assessing a student using a measure multiple times in a row and correlating these separate repeated assessments. Next, conditional quantile regression (CQR) was adapted to provide quantile specific correlation values at any quantile of interest. Following this, split-half reliability and test-retest reliability were assessed at each quantile, using the adapted CQR approach to estimate quantile specific correlations among the split-halves and the repeated assessments at every 10th quantile of the distributions. As such, the results of these analyses provide separate split-half or test-retest reliability estimates at every 10th quantile of the distributions of the measures. Data Collection and Analysis: Data were drawn from multiple waves of the WRRMP, selecting the waves that provided the largest sample size and necessary data for the planned analyses. In turn, we focus on three norm-referenced standardized assessments: DIBELS Oral Reading Fluency (ORF), Stanford Binet Routing and Vocabulary, and Gates MacGinitie Reading Test (GMRT-4). We computed test-retest reliability for DIBELS ORF, which is typically assessed by averaging scores across three passages. This measure was selected specifically for its known floor effects. Split-half reliability was then assessed for Stanford Binet Vocabulary and GMRT, which are scored as items correct. Findings/Results: Our findings indicate differences depending on the assessment examined. For DIBELS ORF, reliability was notably lower at the lower end of the score distribution with values of around 0.80-0.85 and were higher in the upper ends of the distribution with values of around 0.95-1.00. This measure is assessed with three repeated, and we found consistent results across all pairs of. This pattern suggests that ORF may be less reliable for students with weaker reading skills, raising concerns about its use in identifying reading disabilities. This finding aligned with our hypothesis that reliability would be weaker where ceiling or floor effects are present. Conversely, Stanford Binet Vocabulary exhibited fairly consistent reliability across all quantiles, aligning with expectations for a measure designed to route students to appropriate assessment levels. GMRT-4 found the starkest differences in reliabilities throughout the distribution, with reliability ranging from 0.95 in the lowest quantiles to 0.65 in the highest quantiles. Conclusions: The study introduces quantile reliability as a critical concept in educational assessment, highlighting the need to consider differential reliability across score distributions in the evaluation of assessment tools. The findings highlight the importance of examining reliability at specific points in the distribution to ensure accurate identification and support for students with learning disabilities, specifically highlighting how ceiling/floor effects may dilute reliability at the portions of the distribution where we care about it the most. While the study is limited by its demographic homogeneity and the specific assessments analyzed, we believe that it sufficiently proves the concept of quantile reliability, making a strong case for its use in the evaluation and development of educational assessments and screeners. Recommendations include the use of quantile reliability estimation to validate assessment processes and to further investigate the mechanisms that underly differential reliability across the distribution. Finally, connections to psychometrics and IRT are discussed, highlighting the joint exploration of these as a future direction for this research.
Descriptors: Educational Assessment, Special Education, Students with Disabilities, Learning Disabilities, Test Reliability, Generalizability Theory, Elementary School Students, Emergent Literacy, Reading Fluency, Reading Tests, Norm Referenced Tests, Intelligence Tests, Cognitive Ability, Psychometrics
Society for Research on Educational Effectiveness. 2040 Sheridan Road, Evanston, IL 60208. Tel: 202-495-0920; e-mail: contact@sree.org; Web site: https://www.sree.org/
Publication Type: Reports - Research
Education Level: Elementary Education
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: Society for Research on Educational Effectiveness (SREE)
Identifiers - Assessments and Surveys: Dynamic Indicators of Basic Early Literacy Skills (DIBELS); Stanford Binet Intelligence Scale
Grant or Contract Numbers: N/A