ERIC Number: ED658635
Record Type: Non-Journal
Publication Date: 2022-Sep-22
Pages: N/A
Abstractor: As Provided
ISBN: N/A
ISSN: N/A
EISSN: N/A
Determining Power of Correspondence Measures for Assessing Replication Success
Patrick Sheehan; Peter M. Steiner
Society for Research on Educational Effectiveness
Background: Research reproducibility and effect replication has become a topic of major concern throughout the social sciences. During the last decade, low replication rates of published research findings became a major issue across disciplines in the social and behavioral sciences, leading to the public proclamation of a "replication crisis" (e.g., Ioannidis 2005; Klein et al., 2014; Makel & Plucker 2014; Open Science Collaboration, 2015; Valentine et al., 2011). As this topic grows in prominence, it is important to consider how replication success is assessed because replication success or failure crucially depends on the chosen metric for assessing correspondence in effect estimates. Unfortunately, there is not yet a wide-spread consensus on which measures are appropriate to use in replication research (Hung & Fithian, 2020; Schauer & Hedges, 2021; Steiner & Wong, 2018; Valentine et al., 2011). Additionally, the properties of the methods used to assess replication are not necessarily clear, especially with regards to the power requirements for conducting a pairwise replication study. This paper derives power formulas for two methods for assessing replication success in two studies, discusses using those to determine the necessary sample sizes for a pairwise replication study, and demonstrates this with an application. Purpose: We focus on two methods for assessing correspondence in a pairwise replication study. The first, correspondence in significance pattern (CSP), is the most common method (e.g., Open Science Collaboration, 2015). It involves examining whether both effects are significant and have the same sign. If they are (or if neither are significant), the effect is considered to have been replicated. However, this method is sub-optimal, as the probability of showing replication depends on the magnitude of the true effect and it does not consider both studies jointly. The correspondence test (CT; Steiner & Wong, 2018; Tryon & Lewis, 2008) instead conducts two significance tests on the difference between effects, considering both studies jointly. It consists of a difference test to see if the effects are significantly different from each other, and an equivalence test to see if the effects differ beyond a pre-specified equivalence threshold, [delta subscript E]. Depending on which of these tests are significant, four results are possible: equivalence, difference, trivial difference, or indeterminacy. Because replication generally focuses on showing that two effects are the same, we focus on the equivalence result as the goal for planning studies. For both methods, the formula for determining the minimum detectable effect size (MDES) for a given replication probability is derived, which can be used for sample size calculations. The MDES is a transformation of the standard error of an effect estimate, [SEE PDF]. To properly use these methods, studies must be planned such that the selected methods have a reasonable chance for showing replication. Thus, we derive formulas necessary for computing the target sample size for both studies comprising a pairwise replication effort. We assume that the replication effort consists of two perfectly-implemented RCTs estimating treatment effects [tau subscript k] for studies k = 1, 2. The necessary sample sizes will have implications on the best use of these methods and the feasibility of pairwise replication efforts as a whole. Results: Figures 1 and 2 show the replication probabilities for each method. Each plot's axes are in terms of a ratio between the effect (or effect difference) and a study's MDES. For the CT, the ratio between [delta subscript E] and the MDES also influences the replication probability. Additionally, while CSP requires that each study be represented separately on the graph, the correspondence test can be represented in terms of an "average" MDES. CSP is most likely when both studies' [tau subscript k] are large relative to their MDES. The CT is most likely to show equivalence when the difference between effects is close to 0, and [delta subscript E] is large relative to the studies' average MDES. Based on the formula for each method's replication probability, it is possible to derive the formula for the required MDES. Equation 2 shows the formula for the MDES for CSP for study k (See Appendix for proofs): [SEE PDF]. Equation 3 shows the formula for required average MDES across both studies for the CT (for an outcome of equivalence): [SEE PDF]. Once the MDES is computed, the necessary sample size is computed by first converting the MDES to a standard error using Equation 1. Then, to calculate the required sample size for each study, one can use the standard error estimator corresponding to the effect estimator (e.g., a regression estimator). For instance, assume the effect is estimated as a simple treatment-control contrast (and being tested with a z-test), and that the Type I and II error rates for all tests are 0.05 and 0.2, respectively. Table 1 and 2 show the required MDES and n for each study for CSP and the CT, respectively. Table 1 shows these values at different levels of [tau subscript k], while 2 shows these values at different levels of [delta subscript E]. Conclusions: Based on the power formulas, the required sample sizes needed for pairwise replication studies are large, especially for the CT. However, CSP is a poor method for determining replication success, as it does not consider both effects jointly and defines replication success such that vastly different effect estimates can still show replication success. Thus, a pairwise replication study using a high-quality measure for replication success will require a large sample size to have a reasonable chance of showing replication. Based on the required sample sizes, pairwise replication studies require considerable effort and planning to credibly show replication success. Methods that can consider multiple, smaller studies, such as response surface modeling or meta-analytic methods may be more practical for judging whether an effect is replicated (Hedges & Schauer 2018; Rubin, 1992). However, pairwise measures, specifically the CT, can still be useful for specific study designs such as within-study comparisons. These give researchers some degree of control over sources of effect heterogeneity. Additionally, the overlapping sample between the two arms produces a dependence between the two effect estimates that can reduce the overall sampling variability and thus the required sample size.
Descriptors: Social Science Research, Behavioral Science Research, Replication (Evaluation), Statistical Significance, Generalization, Comparative Testing
Society for Research on Educational Effectiveness. 2040 Sheridan Road, Evanston, IL 60208. Tel: 202-495-0920; e-mail: contact@sree.org; Web site: https://www.sree.org/
Publication Type: Reports - Research
Education Level: N/A
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: Society for Research on Educational Effectiveness (SREE)
Grant or Contract Numbers: N/A