NotesFAQContact Us
Collection
Advanced
Search Tips
Back to results
Peer reviewed Peer reviewed
Direct linkDirect link
ERIC Number: ED656799
Record Type: Non-Journal
Publication Date: 2021-Sep-28
Pages: N/A
Abstractor: As Provided
ISBN: N/A
ISSN: N/A
EISSN: N/A
Do We Know What and How to Test? Improving Measures of Student Achievement in Development Economics
Masha Bertling
Society for Research on Educational Effectiveness
Background/Context: Impact evaluations of schooling reforms in developing countries typically focus on tests of student achievement that are designed and implemented by researchers. Are these tests any good? What practical and principled guidance should researchers in the field follow? We aim to answer these questions. Test scores have, of course, been used in economics research for decades, both as outcomes and explanatory variables. Yet, there is an important difference between their use in the US and Europe and the relatively recent wave of impact evaluations in development economics. In the US or Europe, test scores nearly always come from secondary data sources such as the NLSY, the PISA assessments or the administrative data collected by schooling systems. These assessments were designed and administered independently, frequently involving large teams of psychometricians and testing experts. While economists using these data must make important choices about how to analyze the data (see Jacob and Rothstein 2017), they do not typically control what the tests assess, how they are administered or how they are scored. In contrast, development economists often field their own assessments and, thus, must make these consequential decisions themselves. While providing substantial opportunities for tailoring assessments to the relevant population and research question, this also poses significant challenges: as with all quantitative research, any conclusions drawn are circumscribed by data quality and even the most rigorous experimental design can be compromised by bad measurement. Unfortunately, unlike many other aspects of survey design, few resources exist to guide researchers fielding student assessments in developing countries about these choices. Purpose/Objective/Research Question: With this paper we aim to understand--How do tests used in international RCTs hold up to psychometric standards? We then provide concrete examples from our published studies to highlight principles and practicalities of appropriate test design in developing countries and present a template of recommendations for researchers to consider. We begin by articulating the uses of test score data in impact evaluations. These guide later our choices on what to measure and how, and the trade-offs involved in these decisions. Following this overview, we focus first on considerations of what to measure: the domains of student achievement, the level of achievement that is targeted and the extent to which test design needs to interact both with the nature of the sample and the nature of the intervention. Then, we focus on issues of aggregation and comparability: specifically, (a) how should we combine the data from students' responses to multiple test questions into a single aggregate test score, (b) how this score may be made comparable to other samples and (c) how we may facilitate easier interpretations of magnitudes? We pay special attention to the types of comparability that we may target in a typical impact evaluation and on the use of Item Response Theory (IRT) methods for generating these estimates. Methodology: We conduct a systematic review of 164 experimental studies to characterize current practice in development economics. We code each study on a range of 75 characteristic reflective of current psychometric standards. For a subset of studies (n=41), for which we were able to obtain raw item-level data and question wordings, we conduct more rigorous item-level analysis using both Classical Test Theory and Item Response Theory (IRT) approaches. Findings/Results/Conclusion: First, based on the systematic review, a surprising minority of tests used in international impact evaluations meet modern standards for transparency and quality. While the focus is predominantly on literacy and mathematics, designs vary widely in scope, content, administration, and analysis. For example, 10% make sample items available, 7% report reliability, 35% of those reporting reliabilities have reliabilities above 0.7. The distribution of reliabilities based on raw data is not much better, with 73% of test forms having estimates above 0.7 (Fig 1). For comparison, reliability estimates of all standardized tests in the US (e.g., NAEP, MCAS, SAT, ACT) are above 0.98. Consequently, magnitudes of treatment effects are not currently comparable across studies; this problem is not fixed by expressing scores in standard deviations. Second, researchers rarely engage with the appropriateness of their test design to their estimands in reported analyses. Yet, the interpretation of any estimates is necessarily sensitive to the measurement of the core variables, even where treatments are randomly assigned. To take one concrete example, the metric of "standard-deviations-per-100-USD", which is used in the literature to compare cost effectiveness of potential interventions, is likely to be uninformative when the studies themselves measure different constructs, follow different procedures to generate aggregate test scores, and the tests themselves are targeted at different levels of difficulty and are of different length. Further, these differ across populations, often in systematic ways: for instance, dispersion in PISA test scores is lower in countries with low average achievement (Fig 2) and is often lower in younger grades within the same contexts. Finally, for concreteness, we rely on data from Muralidharan, Singh, and Ganimian (2019) to highlight principles and practicalities of appropriate test design in developing countries and present a template of recommendations for researchers to consider. For example, we discuss how to recover smooth distributions of student achievement without ceiling or floor effects in the test and to capture treatment effects across the ability distribution and for making the baseline tests informative and predictive, which, in a small trial, helped us substantially with statistical power (see Fig 3).
Society for Research on Educational Effectiveness. 2040 Sheridan Road, Evanston, IL 60208. Tel: 202-495-0920; e-mail: contact@sree.org; Web site: https://www.sree.org/
Publication Type: Reports - Research
Education Level: Higher Education; Postsecondary Education; Secondary Education
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: Society for Research on Educational Effectiveness (SREE)
Identifiers - Assessments and Surveys: National Assessment of Educational Progress; Massachusetts Comprehensive Assessment System; Stanford Achievement Tests; ACT Assessment; Program for International Student Assessment
Grant or Contract Numbers: N/A