NotesFAQContact Us
Collection
Advanced
Search Tips
Back to results
Peer reviewed Peer reviewed
Direct linkDirect link
ERIC Number: ED657159
Record Type: Non-Journal
Publication Date: 2021-Sep-28
Pages: N/A
Abstractor: As Provided
ISBN: N/A
ISSN: N/A
EISSN: N/A
Statistical Power When Adjusting for Multiple Hypothesis Tests: Methodology Expansions and Software Tools
Kristin Porter; Luke Miratrix; Kristen Hunter
Society for Research on Educational Effectiveness
Background: Researchers are often interested in testing the effectiveness of an intervention on multiple outcomes, for multiple subgroups, at multiple points in time, or across multiple treatment groups. The resulting multiplicity of statistical hypothesis tests can lead to spurious findings of effects. Multiple testing procedures (MTPs) counteract this problem by adjusting p-values. Without an MTP, the probability of false positive findings increases, sometimes dramatically, with the number of tests. With an MTP, this probability is controlled. However, an important consequence of applying MTPs is a change in statistical power that can be substantial. Unfortunately, while researchers are increasingly using MTPs, they frequently ignore the power implications of their use when designing studies. Consequently, studies can easily be underpowered to detect effects. In some circumstances studies may be powered more than anticipated, with unnecessarily large sample sizes. Purpose: Our current research builds on methods developed by one of the authors, which estimated statistical power for a multisite, randomized controlled trial (RCT) when applying any of five common MTPs: Bonferroni, Holm, single-step and step-down versions of Westfall-Young, and Benjamini-Hochberg. While the earlier work focused only on RCTs with blocked randomization of individuals in which effects are estimated assuming constant effects across blocks, our current paper extends the core method to a broader range of RCT designs (individual, blocked, and cluster randomized designs of one, two, and three level data) and estimation strategies (regression specifications using fixed and/or random effects). This work complements the existing literature on statistical power in education studies, which does not take multiplicity into account (Dong & Maynard, 2013; Hedges & Rhoads, 2010; Raudenbush et al., 2011; Spybrook et al., 2011). Our methods apply to the multiple definitions of statistical power that exist in studies with multiplicity (Chen, Luo, Liu, & Mehrotra, 2011; Dudoit, Shaffer, & Bodrick, 2003; Senn & Bretz, 2007; Westfall, Tobias, & Wolfinger, 2011). Power is diminished if one focuses on individual power, which is the probability of detecting an effect of a particular size for a particular hypothesis test. This reduction in power may not hold for alternative definitions of power. For example, when testing for effects on multiple outcomes, one might consider 1-minimal power: the probability of detecting effects of at least a particular size on at least one outcome. Similarly, one might consider ½minimal power: the probability of detecting effects of at least a particular size on at least half of the outcomes. At the other extreme is complete power: the power to detect effects of at least a particular size on all outcomes. The choice of definition of power depends on the objectives of the study and on how the success of the intervention is defined. Along with our methodological approach, we are producing an R package and a user-friendly web application that will guide researchers. Methods: Our methods build from the following insights for estimating power when adjusting for multiple hypothesis tests: 1. When one assumes a correlational structure for M test statistics, the joint null distribution of the test statistics (t0) for the tests is known. 2. For a given RCT design, we can derive the standard error of the treatment effect estimates in effect size units. Then, when one specifies a minimum detectable effect size (MDES) for each outcome, the joint alternative distribution of the M test statistics (t1) for the tests is also known. 3. Therefore, the test statistics t0 and t1 can be generated (i.e., simulated) with statistical software. That is, one can generate a large number of test statistics under H0 and H1, as if the study had been repeated many times. For example, one may simulate test statistics that correspond to results from 10K draws from the assumed population. Doing so results in a matrix of 10K rows and M columns for both t0 and t1, which can be converted to 10KxM matrix of p-values, p0 and p1. Then, any of the MTPs can be implemented in order to obtain a 10KxM matrix of adjusted p-values, which can then be scored for significance. Doing so results in a 10KxM matrix of hypothesis rejection indicators from which all definitions of power can be computed. In effect, we rely on simulation, but rather than simulating a large number of datasets, carrying out impact analyses on each simulated dataset, and adjusting the resulting p-values, the approach skips to the third step, saving complexity and computing time. We validate our power estimates with two strategies: (1) for estimates of individual power for a "single hypothesis test," we replicated with PowerUp! (Dong & Maynard, 2013), and (2) for all definitions of power, we compared our results with power estimates obtained from Monte Carlo full data simulations where we repeatedly generate the entire dataset and analyze it according to the different MTP strategies to obtain true baseline powers. Our approach of simulating test statistics builds on work by Bang, Jung, and George (2005), who use simulated test statistics to identify critical values based on the distribution of the maximum test statistics. Their approach coincides with the approach described here for the single-step Westfall-Young MTP. Chen et al. (2011) derived explicit formulas for d-minimal powers of stepwise procedures and for complete power of single-step procedures, but only for up to three tests. The approach presented here is more generally applicable, as it can be used for all MTPs, for any number of tests, and for all definitions of power discussed. Conclusions: Our methods and software provide crucial improvements to current practice in analyses that adjust for multiplicity. They apply to RCT designs and multiple testing procedures commonly used in education. The project's impacts on future research are the potential for "more accurate" estimates of power (or of MDES's or sample size for a given power requirement) and the potential for "more appropriate" types of power than those that are currently used.
Society for Research on Educational Effectiveness. 2040 Sheridan Road, Evanston, IL 60208. Tel: 202-495-0920; e-mail: contact@sree.org; Web site: https://www.sree.org/
Publication Type: Reports - Research
Education Level: N/A
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: Society for Research on Educational Effectiveness (SREE)
Grant or Contract Numbers: N/A