NotesFAQContact Us
Collection
Advanced
Search Tips
Back to results
Peer reviewed Peer reviewed
Direct linkDirect link
ERIC Number: ED663013
Record Type: Non-Journal
Publication Date: 2024-Sep-19
Pages: N/A
Abstractor: As Provided
ISBN: N/A
ISSN: N/A
EISSN: N/A
Do Reported Treatment Effects Generalize to Other Measures of the Same Construct: A Specification Test
Peter F. Halpin
Society for Research on Educational Effectiveness
Background: Meta-analyses of educational interventions have consistently documented the importance of methodological factors related to the choice of outcome measures. In particular, when interventions are evaluated using measures developed by researchers involved with the intervention or its evaluation, the effect sizes tend to be larger than when using independently developed measures of the same or similar constructs (Cheung and Slavin, 2016; de Boer et al., 2014; Lipsey et al., 2012; Lynch et al., 2019; Ruiz-Primo et al., 2002; Wolf and Harbatkin, 2023). The type of outcome measure is often more strongly related to reported effect sizes than any other methodological factor, including whether or not the intervention was randomized. The differences associated with the type of outcome are often an order of magnitude larger than the focal comparisons among educational interventions. Purpose: In this project, I conceptualize the choice of outcome measure as a pervasive but understudied source of treatment effect heterogeneity and consider how modern psychometric theory can inform current understandings of this heterogeneity. I focus on describing how treatment effects can vary over the individual items that make up educational and psychological assessments (e.g., achievement tests, self-report surveys). Recent research has documented extensive item-level treatment effect heterogeneity in math and literacy education, showing that item-level effects can be "masked" by aggregate null effects (Gilbert et al., 2023), and that aggregate effects can vary substantially due to item-level heterogeneity (Ahmed et al., 2023). More generally, when treatment effects vary over items, this is a clear indication that different assessments of the same construct (i.e., assessments with different items) will lead to different research findings. In this study, I build on this intuitive idea by developing a Hausman-like specification test for evaluating the extent to which observed treatment effects are dependent upon item-level treatment effect heterogeneity that would not be expected to generalize to other assessments of the same construct. Significance: The conceptual model underlying this work is presented in Figure 1. As recently recognized in the literature on causal inference, the model in panel (a) is tacitly assumed when taking a univariate summary of the assessment items (e.g., the unweighted mean) and using it to study the treatment effect on the target construct (VanderWeele and Vansteelandt, 2022). In the literature on item response theory (IRT), the treatment effect on the target construct (i.e., the blue path) is called impact, and the item-specific treatment effects (i.e., the red paths) are referred to as differential item functioning (DIF; Holland and Wainer, 1993). In the present context, the focal issue is whether the observed treatment effect is attributable to impact or may also reflect DIF. I argue that distinguishing these two cases is important for establishing internal validity. By definition, impact generalizes across assessments of the same construct. For this reason, impact is a valid causal mechanism through which observed treatment effects can arise. On the other hand, DIF with respect to treatment is dependent on the specific items that appear on an assessment, and therefore does not generalize to assessments that do not use those same items. The extent to which observed treatment effects are due to impact or DIF is an empirical question that can be answered using the methodology developed in this study. Methodology: Halpin (2022a, b) developed a novel IRT-based estimator of impact that is highly robust to DIF. The proposed specification test is constructed by differencing the robust estimator and a more efficient estimator that naively aggregates treatment effects over all items. The latter corresponds to the usual practice of estimating treatment effects using the unweighted mean over assessment items. The null hypothesis that both are consistent estimators of the true treatment impact leads to a "Hausman-like" specification test (Hausman, 1978) of whether the naive estimator is biased by item-level treatment effect heterogeneity. This specification test directly addresses the question of whether observed treatment effects are due to impact on the target construct instead of item-level heterogeneity that would not be expected to generalize beyond the specific outcome measure used. Figure 2 summarizes a simulation study comparing the two estimators in a situation where progressively more biased items are added, each biased in the same direction (e.g. favoring treatment). The robust estimator achieves the theoretical maximum breakdown point for any translation equivariant estimator, which is 1/2 (Huber and Ronchetti, 2009). This means the proposed estimator is guaranteed not to break down if fewer than 50% of the items exhibit DIF, and that no other estimator of impact can exceed this level of robustness (although other estimators may perform differently on the way to breakdown). Having a robust estimate in hand, the next step is to obtain the distribution of its difference with the naive estimator, d. Using Theorem 1 of Halpin (2022a), the asymptotic distribution of d can be derived in closed form. The derivation does not require Hausman's lemma regarding the covariance of an efficient estimator and its difference with an inefficient estimator, which is the reason I refer to the proposed test as "Hausman-like." Table 1 describes the performance of the specification test using the same simulation study reported in Figure 2. When N.DIF = 0 we see that the Type I Error rate control is nominal, corroborating the analytical results on the null distribution of the test. As the number of items with DIF increases, we see that the true bias (delta, reported in SD units of the latent trait), the estimated bias, and the statistical power all increase. Summary: This presentation will explain the specification test in more detail and illustrate its application using pre-post gains on a measure of college-level STEM identity. I am especially excited to hear critical feedback from the SREE methodology community about the potential of this specification test to address concerns about the use of researcher-developed assessments in impact evaluations.
Society for Research on Educational Effectiveness. 2040 Sheridan Road, Evanston, IL 60208. Tel: 202-495-0920; e-mail: contact@sree.org; Web site: https://www.sree.org/
Publication Type: Reports - Research
Education Level: Higher Education; Postsecondary Education
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: Society for Research on Educational Effectiveness (SREE)
Grant or Contract Numbers: N/A