NotesFAQContact Us
Collection
Advanced
Search Tips
Back to results
Peer reviewed Peer reviewed
Direct linkDirect link
ERIC Number: ED656901
Record Type: Non-Journal
Publication Date: 2021-Sep-28
Pages: N/A
Abstractor: As Provided
ISBN: N/A
ISSN: N/A
EISSN: N/A
Available Date: N/A
Effects of Varying Inclusion Criteria in Published Meta-Analyses: Two Case Studies
Amanda J. Neitzel; Qiyang Zhang; Robert E. Slavin
Society for Research on Educational Effectiveness
Background: Over the years, the quantity and quality of educational research has been rapidly improving. This can be attributed to the growing call to use evidence of effectiveness in decision-making by policymakers and practitioners. In fact, evidence sufficient to establish programs as "small", "moderate", or "promising" has been defined and its use recommended and even required in some cases by the Every Student Succeeds Act of 2015. In addition, the American Rescue Plan Act of 2021 calls for addressing pandemic-related learning loss with "evidence-based interventions". This combination of more (and better) research as well as its use by non-academics has created a demand for research syntheses. There have been frequent efforts to improve the quality of these reviews and syntheses efforts through training supported by IES and NSF, as well as establishing guidance for these approaches (Alexander, 2020; Pigott & Polanin, 2020). This is important, because the ability of evidence-based reform to meaningfully improve educational outcomes depends on the use of valid and reliable conclusions from research. However, the quality of meta-analyses and research syntheses continues to vary considerably. One of the most variable factors is the inclusion criteria employed by each review. It is widely known and accepted that effect sizes are associated with numerous methodological factors such as sample size (Cheung & Slavin, 2016; Kraft, 2020), outcome type (Cheung & Slavin, 2016; de Boer et al., 2014; Kraft, 2020; Wolf, 2021), publication status (Cheung & Slavin, 2016; Polanin et al., 2016), and researcher independence (Borman et al., 2003; Wolf et al., 2020). Because these factors are known to have a relationship with the magnitude of impacts, it can be assumed that decisions around inclusion criteria for these factors could have an important effect on the conclusions reached by a meta-analysis. For example, it is well established that independent measures tend to produce smaller effect sizes than researcher- or developer-made measures (Cheung & Slavin, 2016; de Boer et al., 2014; Kraft, 2020; Wolf, 2021). For example, Cheung and Slavin's (2016) meta-analysis of 646 otherwise acceptable studies of reading, mathematics, science, and other topics found a mean effect of +0.40 for researcher-made measures, but only +0.20 for independent measures. Therefore, a review that allowed researcher- and developer-made measures would likely claim higher average effect sizes than a review limited to only independent measures. These differences in inclusion criteria could even produce contradictory conclusions about the same interventions. Purpose: The purpose of the present study is to examine how changes to the inclusion criteria of published reviews shift the conclusions of those studies. Research Design: The present study uses a case-study approach, which identifies two example reviews, and uses meta-analytic methods to re-analyze those data using varying inclusion criteria. Data Collection: To locate studies for the case studies, the table of contents of issues of the "Review of Educational Research" in the past twenty years were read, to identify meta-analyses of instructional interventions in K-12 populations with reading or math outcomes. For this case study, two publications were identified: a meta-analysis of intelligent tutoring systems (Kulik & Fletcher, 2016) and a review of reading studies of Direct Instruction (Stockard et al., 2018). Data for each study included in each of the reviews were taken from the published papers and supplemental materials available online. Data included the effect sizes and sample sizes for both studies. For the intelligent tutoring systems review, data coded also included the intervention durations and types of outcomes (included in the review). For the Direct Instruction review, data coded also included the research design (included in the review), as well as whether it met other general quality standards (i.e. ITT analysis, no confounding factors, substantial attrition). In practice, studies that did not meet this standard were either limited to completers or those who had received a certain dose of the treatment (not an ITT analysis), had over-involved research team members as implementers, or had confounding factors (one cluster unit compared to one cluster unit). Data for both case studies are available on GitHub. Data Analysis: Each review was synthesized multiple times. First, the original review was repeated without changes, simply to reproduce (as closely as possible) the overall effect size. This process was then repeated, but excluding some studies by implementing more rigorous standards (such as by excluding researcher- and developer-made measures). We used the package "metafor" (Viechtbauer, 2010) to estimate all random-effects models in R statistical software (R Core Team, 2020). Findings: Results of the re-analyses of Kulik and Fletcher (2016) are shown in Table 1. The original inclusion standards were that the study used either an RCT or QED design, and had a duration of at least 30 minutes. The mean effect size of those 50 studies, as analyzed in the paper (without weighting by inverse variance), was +0.65. However, as appropriate weights are applied, and then various additional exclusion criteria are applied (sample size, duration, and outcome type), the number of studies included decreases, as does the mean effect size. Under the most selective set of standards, the mean effect size of the 9 most rigorous studies is an insignificant +0.09. Results of the re-analyses of Stockard et al. (2018) are shown in Table 2. The mean effect size of the 275 included studies, calculated without weighting by inverse variance (as in the original paper), is +0.52. The same pattern exists as above, where as weights and more stringent inclusion criteria are applied, both the number of included studies and the mean effect size decrease. With the most selective set of standards (only RCTs and QEDs, no small studies, and meeting general quality standards), the mean effect size is -0.02 (n.s.) across six studies. Conclusions: The choice of inclusion criteria is vitally important. Research syntheses and meta-analyses are relied upon to provide trustworthy conclusions about what works in education. Yet those conclusions are heavily influenced by the stringency of the inclusion criteria. As this paper has demonstrated: not all evidence is credible. Evidence used for decision-making should utilize results from high-quality meta-analyses employing rigorous standards.
Society for Research on Educational Effectiveness. 2040 Sheridan Road, Evanston, IL 60208. Tel: 202-495-0920; e-mail: contact@sree.org; Web site: https://www.sree.org/
Publication Type: Reports - Research
Education Level: Elementary Secondary Education; Early Childhood Education; Elementary Education; Kindergarten; Primary Education
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: Society for Research on Educational Effectiveness (SREE)
Identifiers - Laws, Policies, & Programs: Every Student Succeeds Act 2015; American Rescue Plan Act 2021
Grant or Contract Numbers: N/A
Author Affiliations: N/A