What Works for Whom: Evidence Gap Maps of Study Data in the What Works Clearinghouse.

Betsy Wolf

Introduction: The What Works Clearinghouse (WWC) reviews rigorous research on educational interventions with a goal of identifying "what works" and making that information accessible to educators and policymakers. In rating the quality of causal research, the WWC has historically prioritized internal validity over external validity. One critique of the WWC, and quantitative research in general, is that there has been too little attention to external validity (Hedges, 2018; Joyce, 2019; Ming & Goldenberg, 2021). The larger concern is that paying too little attention to external validity could lead to inaccurate conclusions about the potential effectiveness of interventions in different contexts (Briggs, 2008; William, 2019). This white paper uses publicly available WWC study data to explore the settings, student populations, and outcome domains included in the WWC evidence base. The purpose of this paper is to map the available evidence to highlight where there is an abundance of evidence as well as where more high-quality research is needed. This white paper addresses two questions: (1) For which study settings, outcome domains, and student populations is there high-quality research (or a lack thereof) according to the WWC? (2) Which study settings, outcome domains, and student populations are represented in the WWC's evidence base about 'what works' in education? In other words, for which settings, outcome domains, and populations did studies find positive and statistically significant effects according to the WWC? Data: The advantage of using WWC study data is that studies must demonstrate internal validity and outcome measures must demonstrate reliability to be included (WWC, 2022d). Therefore, WWC study data are one source for high-quality research on educational effectiveness in education. The final study sample includes 4,048 findings in 1,064 studies. Each study may include more than one finding. Method: The evidence base is presented visually in the form evidence gap maps (EGMs). EGMs are designed to show where relatively little research exists and more research is needed, as well as where research may be ripe for new syntheses of evidence (Saran & White, 2018). The target audience of this paper is funders of research, researchers, and educators looking for the availability of evidence. Findings: The WWC database offers the most evidence of educational intervention effectiveness on student outcomes in math- and literacy-related outcome domains. In contrast, the WWC offers relatively little evidence in other academic subjects, such as science and social studies, and on educator outcomes. This limits the WWC's capacity to inform evidence-based practices in those domains. The available evidence generally appears to align with the WWC's systematic review efforts. Therefore, to increase the number of findings on outcomes beyond general math and literacy, the WWC may need to commission systematic reviews in other topic areas, and funders of research might also consider funding additional research focused in these areas. Educators hoping to identify interventions that are effective for specific student populations, or better understand the types of students who participated in the studies reviewed by the WWC, may encounter difficulties doing so. Students with individualized education plans (IEPs) appear to be underrepresented in the WWC evidence base. Excluding studies with missing data, 61% of studies had 10% or fewer students with IEPs, while 15% of public school students nationwide received special education services in 2020-21 (NCES, 2022a). Moreover, only 6% of studies included more than 20% students with IEPs, and of these studies, very few or none included outcomes on college readiness, postsecondary, science, social studies, functional skills (i.e., life skills), or English language development. Similarly, only 14% of studies included more than 20% English learners, and of these studies, very few to none of these examined outcomes relating to college readiness, postsecondary education, school progress, behavior, school climate, science, social studies, or educators. However, missing data in the WWC database complicate our understanding of its true representativeness: IEP data were missing in 67% of studies and English learner data were missing in 50% of studies. Given the importance of understanding "what works for whom," it is critical that researchers adequately report--and WWC reviewers consistently record--the characteristics of student samples in impact studies. Evidence maps depicting the extent to which the WWC assigned evidence tiers to studies--indicating that the studies found positive and statistically significant effects--demonstrated considerable variation across outcome domains. Some outcome domains were assigned evidence tiers at higher rates than others. For example, the K-12 school progress, postsecondary progress, and school climate domains were assigned evidence tiers at higher rates than other domains. Domains that are more narrow in scope, such as numbers and operations and writing, were also assigned evidence tiers at higher rates than other domains. One potential explanation is that effect sizes are typically larger--and therefore the studies are more likely to receive an evidence tier--when the outcome domains are "narrow" in scope, when there is lower variation in scores, and when the measures themselves are "narrow" and more tightly aligned with an intervention as compared to when the outcome is a broad measure of interest, like a state test (Boyd et al., 2008; Kraft, 2020; Kraft, 2023; Ruiz-Primo, Shavelson, Hamilton, & Klein, 2002; Somers et al., 2022; Wolf & Harbatkin, 2022). Consumers of research should be careful to review the outcome measures and domains in studies to ensure overlap and generalizability with their constructs of interest. Moreover, comparing effect sizes across studies is problematic because it may lead to greater implementation of interventions whose impacts have only been studied on narrow measures and whose impacts on standardized measures of student achievement are unknown. While prior work has focused on why effect sizes might differ across outcome domains and measures (Kraft, 2020; Wolf & Harbatkin, 2022), there has been less research on how to make sense of new research findings given differences in effect size distributions (Newcomer, Hall, Pandey, Reginal, & White, 2023). More methodological work is needed. In conclusion, researchers should contextualize the findings in program evaluations and provide appropriate nuance and caveats. Avoiding overly simplistic conclusions about complex educational interventions can only help to build trust in research findings.