Methods to calculate uncertainty in the estimated overall effect size from a random-effects meta-analysis
Abstract
Meta-analyses are an important tool within systematic reviews to estimate the overall effect size and its confidence interval for an outcome of interest. If heterogeneity between the results of the relevant studies is anticipated, then a random-effects model is often preferred for analysis. In this model, a prediction interval for the true effect in a new study also provides additional useful information. However, the DerSimonian and Laird method—frequently used as the default method for meta-analyses with random effects—has been long challenged due to its unfavorable statistical properties. Several alternative methods have been proposed that may have better statistical properties in specific scenarios. In this paper, we aim to provide a comprehensive overview of available methods for calculating point estimates, confidence intervals, and prediction intervals for the overall effect size under the random-effects model. We indicate whether some methods are preferable than others by considering the results of comparative simulation and real-life data studies.
1 INTRODUCTION
Systematic reviews and meta-analyses provide a method for collecting and synthesizing research and are often used to inform decision making. The number of these publications has increased substantially since the 1990s.1 Meta-analysis is a valuable technique to summarize study-specific results and often reduces bias and uncertainty from individual studies. Guidelines and Health Technology Assessment panels, as well as international organizations, including the World Health Organization,2 the European Medicines Agency,3 and governmental agencies worldwide, such as the Canadian Agency for Drugs and Technologies in Health,4 the Institute for Quality and Efficiency in Health Care (IQWiG),5 and the National Institute for Health and Clinical Excellence,6 recognize the need to ensure that the best available evidence informs clinical practice and health care decision making. This typically involves conducting a high-quality knowledge synthesis and meta-analysis. Quantitative results of meta-analyses of relevant studies, in the form of a point estimate and a confidence interval (CI) for the effect size parameter of interest, are invariably considered together with judgements about the quality of the evidence to produce recommendations for practice.7
Quantification of uncertainty in the estimated overall effect size is important in the process of drawing conclusions from a meta-analysis. This uncertainty should ideally account for between-study heterogeneity in the intervention effects across study settings and populations.8, 9 For this reason, the random-effects model is often employed, which includes a between-study variance parameter. The uncertainty of the estimate of the overall effect size can be described by the corresponding CI under the random-effects model, and its width depends on the magnitude of the between-study variance, the number of studies, the precision of the study-specific effect sizes, and the significance level.10
The estimation of the CI for overall effect size is often conducted with the Wald-type (WT) method using a normal distribution, with variance equal to the inverse of the sum of the study weights, and the DerSimonian and Laird11 estimator for the between-study variance, and this has been used routinely in many meta-analyses.12 However, numerous shortcomings of this approach have been raised, such that the CI for the overall effect size generally does not retain its coverage probability (ie, the proportion of times that the interval includes the true value), and hence, it underestimates the statistical error, producing overconfident results.12-18 This is mainly because the WT CI is based on a large-sample approximation (in terms of the number of studies), and the number of studies is usually small. Typically, the number of studies synthesized in a meta-analysis in medical research is less than 20, suggesting that any large-sample approximation is likely to be inaccurate.19, 23 Several attempts to improve the standard WT CI approach have been suggested, each of which has different statistical properties.
Another important aim in decision making is the prediction of the true effect size in an individual (future) study and setting. Higgins et al24 suggested the use of prediction intervals under the random-effects model for this purpose. The use of prediction intervals has been promoted, and although they have not often been employed in practice, they provide useful additional information.25, 26
In this paper, we aim to provide a comprehensive overview of available methods in the methodological literature for calculating a CI for the overall effect size under the random-effects model and to indicate whether some methods are preferable to others by considering the results of comparative simulation and real-life data studies. We also examine potential issues surrounding the computation of prediction intervals under the random-effects model.
The article is structured as follows. In Section 2, we present the conventional meta-analytic models and set up our notation (section 2.1) and describe our review methodology (section 2.2). In Section 3, we describe the statistical methods found in our literature review. In section 3.1, we describe 15 identified methods to calculate a CI for the overall effect size. In section 3.2, we discuss the comparative performances of different methods for computing a CI for the overall effect size, as described in previous studies, and summarize recommendations made by their respective authors. In section 3.3, we discuss methods for computing prediction intervals. We conclude with a discussion of all intervals (confidence and prediction intervals) in Section 4.
2 METHODS
2.1 Meta-analysis models and notation
The conventional fixed- and random-effects models are the two main meta-analysis models to synthesize the study results.27 The random-effects model accounts for two sources of variation, quantified by the within-study variance (vi) and the between-study variance (τ2). When
, the fitted random-effects model collapses to the fixed-effect model (also known as common-effect model),28, 29 and therefore, the random-effects model can be considered a generalization of the fixed-effect model (ie, that the fixed-effect model is a special case of the random-effects model). Confidence intervals under the fixed-effect model can have poor properties even for low but nonzero heterogeneity.13, 30
Both the conventional fixed- and random-effects models require an estimated effect size yi (such as log-odds ratio) and an estimated (within-study) variance vi (var(yi) = vi) from every included study i, i = 1, …, k. The choice between the two models has been widely discussed in the literature31-33 and summarized in the Cochrane Handbook.9 In this paper, we focus on the random-effects meta-analysis model using inverse-variance weighting. Other techniques to combine study information to calculate the overall effect size are also available, such as weighting by sample size34, 35 or using confidence distributions.36, 37 Also, dichotomous outcomes do not require inverse-variance methods, as they can be modeled directly using one-step models (eg, generalized linear mixed models).38 Alternative methods accounting for heterogeneity by the use of a multiplicative parameter, where study weights are independent of observed heterogeneity, are also available.10, 39, 40 The description of these methods is beyond the scope of this review.







The uncertainty in an estimated effect size for a given study, in relation to its study-specific true effect size, is expressed via the within-study variance vi. The standard approach described above assumes that the estimated within-study variances vi are fixed and known although they have to be estimated from the data. This assumption is justifiable when each study size is sufficiently large. The vi estimation is sensitive not only to the study size but also to the data type and effect size used. For example, the vi estimator for continuous outcomes when using the standardized mean difference depends on the estimated effect size. Hence, although vi is assumed fixed and known, vi is in fact estimated with some uncertainty. Several authors point out that this assumption could affect the estimation of the overall effect size, its variance, and related inferences.13, 41-45 Therefore, calculating study weights through within-study variances and assuming that they are known constants may have less desirable properties. This issue has previously been discussed, and some attempts have been made to account for the uncertainty in the weights.46-49 Hence, among other factors, the performance of the CI methods depends on how well we estimate the study weights.
Similarly, the estimation of τ2 is performed with some uncertainty, and this uncertainty depends on the size and number of studies in the meta-analysis, as well as the size of the between-study variability. Factors such as these have implications for the accuracy of the standard statistical methods described in this paper, which has motivated many of the attempts to improve this. There are also many methods to estimate τ2, any of which can be used in some of the CI methods described below, and we refer to a previous publication on this topic.50 Also, for a review of the simulation studies evaluating the comparative performance of τ2 estimators, we direct the reader elsewhere.51
2.2 Review methods
We searched PubMed from inception until April 29, 2016, to identify full-text research articles that describe or compare methods for calculating CI for μ in simulations or in real data sets. We scanned the references of the selected articles for additional relevant articles, and we conducted general internet searches using the web search engine Google. We also used our networks of professional collaborations to identify potentially relevant articles. We included all studies that report the development or comparison of methods to calculate a CI for the overall effect size under the random-effects model. We also included studies reporting on prediction interval methods identified from our internet searches and networks of collaborations. We excluded commentaries, abstracts, and studies written in languages other than English, and studies relating to the hypothesis tests for the overall effect size. We restricted our investigation to CI methods developed under the random-effects meta-analysis model that assume the true study-specific effects are normally distributed, while we excluded CI methods developed for network meta-analysis, one-stage individual patient data meta-analysis, meta-analysis of diagnostic test accuracy studies, meta-analysis of multiple outcomes, and meta-regression analysis. One reviewer (AAV) summarized methods and studies' conclusions from each included article and recorded any conclusions from comparative articles (studies that compare at least two methods). The information extracted refers to the performance of the various methods and the judgements deducted about their related advantages, and this information was checked by all coauthors. Disagreements were resolved by discussion. The PubMed search strategy is included in Appendix S1.
We describe known properties of the methods in terms of coverage probability and CI width in section 3.2. The closer the coverage probability is to the nominal level (usually 0.95), the better the CI is considered to be. A CI is exact when the actual coverage equals the nominal coverage. The coverage probability is closely related to the type I error of the hypothesis test on the overall effect size: Assuming the null hypothesis is true, one minus the type I error rate is the coverage probability. A further criterion for comparing methods is that methods that provide narrower CIs, whilst retaining the correct coverage probability, are preferable because they increase precision and hence are more informative. All statistics presented in this paper refer to two-tailed tests.
3 RESULTS
The database search returned 5628 matches in PubMed and 20 records identified through other sources and searching reference lists. In total, 69 publications met the eligibility criteria, which are listed in Appendix S2. We identified 15 methods to compute a CI for the overall effect size. The properties of those methods have been evaluated in 31 research papers, including 30 simulation studies and 32 real-life data evaluations of two or more methods. Below, we present the 15 identified approaches in seven broad categories, and as a separate section, we present the comparative results of the identified simulations and studies using real data sets (see Appendices S3-S4 for simulation scenarios and study characteristics and Appendix S5 for a summary of performance measures in simulation studies). In Table 1, we summarize the methods available in several software options.112
Software | License type | Confidence/Credible Interval Methods | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
WTz (Method 1) | WTt (Method 2) | HKSJc (Method 4) | Modified HKSJ (Method 5) | PL (Method 6) |
Skovgaard (Method 8) |
HCb (Method 9) | BTd (Method 10) | Bootstrap (Methods 12, 13) | FP (Method 14) | Bayes (Method 15) | ||
Comprehensive Meta-analysis | Commercial | Yes | - | Yes | - | - | - | - | - | - | - | |
Excel—MetaEasy AddIn | Freeware | Yes | Yes | - | - | Yes | - | - | - | - | Yes | - |
Excel—MetaXL AddIn | Freeware | Yes | - | - | - | - | - | - | - | - | - | - |
HLM | Commercial | - | - | - | - | Yes | - | - | - | - | - | - |
Meta-DiSc | Freeware | Yes | - | - | - | Yes | - | - | - | - | - | - |
Metawin | Commercial | Yes | - | - | - | - | - | - | - | Yes | - | - |
MIX | Commercial | Yes | - | - | - | - | - | - | - | - | - | |
MLwin | Freeware | Yes | - | - | Yes | - | - | - | Yes | - | Yes | |
Open Meta Analyst | Freeware | Yes | - | - | - | - | - | - | - | - | - | |
RevMan | Freeware | Yes | - | - | - | - | - | - | - | - | - | - |
R | Freeware |
Yes MAd, meta, metafor, metagen, metalik, metamisc, metaSEM, metatest, metaplus, mvmeta, mvtmeta, netmeta, rmeta |
Yes metaplus |
Yes MAd, meta, metafor, metamisc, metatest |
- |
Yes metaLik, metaplus |
Yes metaLik, metatest |
Yes metafor |
Yes metaxae |
Yes metaplus, boot |
Yes metafor |
Yes bayesmetaf, blme, BRugs, gemtc, metamisc R2WinBUGS, SASBUGS rjugs |
SAS | Commercial | Yes marandom.sas, PROCs GLM and MIXED |
Yes PROCs GLM and MIXED |
- | - |
Yes marandom.sas, PROC NLP |
- | - | - | - | - |
Yes PROC MCMC |
Stata | Commercial |
Yes metaan, metan, metareg, mvmeta, xtreg |
- | - |
Yes metareg |
Yes gllamm, metaan |
- | - | - | Yes bootstrap |
Yes metaan |
- |
SPSS | Commercial |
Yes meanes.sps, metaf.sps, metareg.sps |
- | - | - | - | - | - | - | - | - | - |
BUGS, OpenBUGS, WinBUGS | Freeware | - | - | - | - | - | - | - | Yes |
- Abbreviations: BT, Biggerstaff and Tweedie; FP, Follmann and Proschan; HC, Henmi and Copas; HKSJ, Hartung-Knapp/Sidik-Jonkman; PL, Profile likelihood; WTz, Wald-type with a normal distribution; WTt, Wald-type with a t distribution; WTqa, Quantile approximation; ZL, Zeng and Lin.
- ‘-’ denote that the underlying method is not available in the specific software.
- a A resampling test is available in the R package metatest.52
- b Henmi and Copas30 provide an R code to implement the HC method.
- c IntHout et al16 provide an approach to easily convert WTz CIs to HKSJ CIs.
- d Biggerstaff and Tweedie53 provide a SAS code to implement the BT method.
- e This package uses the exact random-effects weights in the Biggerstaff and Tweedie approach.54
- f Bayesian approaches can be implemented using the Markov Chain Monte Carlo (MCMC) techniques in several software, such as OpenBUGS,55 WinBUGS56 or without MCMC as described by Turner et al...,57 in the R package bayesmeta.58
- Comprehensive Meta-Analysis59 www.meta-analysis.com/
- Excel using the MetaEasy AddIn60 https://www.jstatsoft.org/article/view/v030i07 or MetaXL AddIn http://www.epigear.com/
- HLM61 http://www.ssicentral.com/hlm/
- Meta-DiSc62 ftp://ftp.hrc.es/pub/programas/metadisc/
- Metawin63 http://www.metawinsoft.com/
- MIX64 www.mix-for-meta-analysis.info/
- MLwin65 http://www.bristol.ac.uk/cmm/software/mlwin/
- Open Meta Analyst66 http://www.cebm.brown.edu/openmeta/
- RevMan67 www.cochrane.org/
- R68http://www.r-project.org/ Packages: bayesmeta,58 blme,69 boot,70, 71 BRugs,72 Mad,73 mada,74 meta,75 gemtc,76 metafor,77 metagen,78 metaLik,79, 80 metamisc,81 metaplus,82 metaSEM,83 metatest,52, 84 metaxa,54 mvmeta,85 mvtmeta,86 netmeta87) R2WinBUGS,88 rjugs,72 rmeta.89
- SAS90 http://www.sas.com/technologies/analytics/statistics/stat/ Macros: marandom.sas,91 PROC IML,92 PROC MIXED,93, 94 PROC GLIMMIX,95 SASBUGS,96 RASmacro,97 PROC MCMC.98
- Stata99 www.stata.com / Routines: bootstrap,100 gllamm,101 metaan,102 metareg,103 metan,104 mvmeta,105 xtreg.106
- SPSS107 http://www.spss.co.in/ Macros: meanes.sps,108 metaf.sps,109 metareg.sps.110
- BUGS,111 OpenBUGS,55 WinBUGS56 www.mrc-bsu.cam.ac.uk/bugs/.
3.1 CI for the overall effect size
3.1.1 WT methods
- Wald-type normal distribution (WTz) CIs (method 1)



- Wald-type t-distribution (WTt) CIs (method 2)

- Quantile approximation (WTqa) CIs (method 3)



However, both number of studies k and the magnitude of τ2 may impact on the performance of the WTqa method,122 and changes in the distribution of the within-study variances can importantly impact on bk.127 Although WTqa approach has been criticized on the grounds that it is, at best, very difficult to obtain suitable critical values bk that apply to all meta-analyses,127 we include it in this paper for completeness. As a conservative approach, Jackson and Bowden127 suggested the use of the standard normal quantile instead, and to assess the robustness of the findings via a sensitivity analysis of alternative quantiles. Brockwell and Gordon122 developed the WTqa method using the DerSimonian and Laird11 estimator of τ2, but WTqa could, in principle, be implemented for any alternative τ2 estimator. However, it is not advised to develop the WTqa CI further.127
3.1.2 Hartung-Knapp/Sidik-Jonkman CIs (methods 4 and 5)





When all variance components, including the between-study variances, are fixed and known, the expected value of Qgen is k − 1, which equals the degrees of freedom of the associated χ2 distribution.129-131 Hence, the small-sample adjustment q will tend to be close to 1. However, q may in fact turn out to be much smaller than 1, such as in cases where the effect sizes are very homogeneous or when the number and/or size of studies is small. This leads to a narrower CI than the WTt approach and can also lead to a narrower CI compared with the WTz method.132, 133 Although the t quantile is always larger than the z quantile associated with the WTz method, explaining in part why the HKSJ CI performs better than the WTz CI, in the case of
, the HKSJ CI will be narrower than the WTz CI.







This is a robust estimator of
, where the inverse of the study weights (
) is estimated through the squared sample residuals (
) from the data, rather than assuming
.
However, Sidik and Jonkman120 state that
is biased when k is small, and hence, they suggest a bias corrected estimator of
(see Sidik and Jonkman120 for details). An alternative approach based on the expected information and on appropriately modified degrees of freedom of the t distribution was suggested by Kenward and Roger.136 These alternative expressions for
could also be used in WT CIs but have not been adopted in practice, so we do not explore their use further here.
3.1.3 Likelihood-based methods
- Profile likelihood CIs (method 6)





- Higher order likelihood inference methods (methods 7 and 8)
As Reid137 explains, the main asymptotic properties of likelihood-based inference include (a) consistent, asymptotically normal, and efficient maximum likelihood estimators; (b) an asymptotically normally distributed score statistic with mean zero; and (c) an asymptotic chi-squared distributed likelihood ratio statistic. For example, the PL CI in the previous section relies upon the third of these standard results. As Reid137 also explains, higher order asymptotic results for likelihood-based inference are also available. Some higher order likelihood inference methods have recently been applied to meta-analysis, which is a situation where they may be thought to be especially valuable. This is because the number of studies is often small, so that the commonly used “lower-order” asymptotic approximations to the likelihood function will be inadequate. Higher order likelihood–based methods therefore have the potential to produce more accurate results in meta-analysis, and several proposals for this have been made. We briefly summarize the methods here, but the details are technical, and so we refer the reader to the articles cited below for more information.
The Bartlett-type correction of the likelihood ratio statistic was first introduced by Bartlett (method 7).138 Noma17 explains how to apply this to random-effects meta-analysis and so use a higher-order approximation than the PL method above. Noma17 also explains how to use the score statistic to compute CIs and subsequently derives a higher order Bartlett-type adjustment to this score. Skovgaard139 proposed an alternative higher order approximation to the profile log-likelihood (method 8), and Guolo79 explains how to apply this to random-effects meta-analysis. For details on the method, we direct the reader elsewhere.79, 140 The higher order asymptotic methods have higher degree of accuracy, but in some cases (eg, when the between-study variance is close to zero), they may produce numerically unstable maximum likelihood estimates due to the discontinuity of the statistic.79, 140-142 In such cases, a bias reduction approach is suggested.143 Hence, the Bartlett-type correction (method 7) and the Skovgaard statistic (method 8) are the two main proposals for higher order approximations when using methods based on the PL.79
3.1.4 Henmi and Copas CIs (method 9)


3.1.5 Biggerstaff and Tweedie CIs (method 10)








3.1.6 Resampling methods
- Zeng and Lin CIs (method 11)
Zeng and Lin (ZL)146 examine the distribution of the estimated overall effect size under the random-effects model and find that it is not asymptotically normally distributed for a finite number of studies k. This makes intuitive sense, because the textbook result that a linear combination of normal random variables is normally distributed requires that the coefficients in this linear combination are constants. When estimating the overall effect size however, these coefficients are proportional to the weights and so are functions of the estimated between-study variance. We require a large number of studies in order to estimate this variance accurately enough to treat the weights as fixed constants.

- Bootstrap CIs (methods 12 and 13)
Nonparametric bootstrapping is a way to approximate the sampling distribution of a statistic by resampling, from the sample itself, with replacement. Parametric bootstrapping instead samples from a fitted model. Both forms of bootstrapping can be used to make a variety of inferences but are most usually used to quantify the uncertainty in point estimates through the computation of standard errors and CIs. Briefly, bootstrap datasets are sampled (either nonparametrically, method 12,147, 148 or parametrically, method 13),149 from which the bootstrap statistics (the statistic of interest calculated using the bootstrap datasets) are calculated. Then, the empirical distribution of the bootstrap statistics is taken to approximate the distribution of the statistic of interest. Hence, measures of the uncertainty in the statistic, such as standard errors and CIs, can be calculated from this empirical distribution. In our context, this statistic is the estimated overall effect size.
- Follmann and Proschan CIs (method 14)
Permutation tests have been suggested primarily to assess the true statistical significance of an observed finding under the null hypothesis of the absence of effect, especially in meta-analyses with a small number of studies.152 This method can be extended and used for calculating CIs for the effect size. These tests are especially appropriate when the included studies in a meta-analysis may not be considered randomly sampled from a larger population of studies. Confidence intervals can be constructed by inverting hypothesis tests, where parameter values that are not rejected by the hypothesis test lie within the corresponding CI.








The procedure described above can be extended so that we instead test the hypothesis H0 : μ = c, and invert this hypothesis test to give the bounds of CIs. For further details on the FP method, we refer the reader to Follman and Proschan.117 This method has been suggested as an alternative approach to the WT and likelihood-based approaches that assume normality of the observed effects, but it can be computationally demanding. The discrete nature of the permutation distribution will ensure that the CI maintains the desired coverage probability, but in general, this coverage probability will be larger than the nominal level.
3.1.7 Bayesian credible intervals (method 15)
Bayesian credible intervals (CrIs) for the overall effect size can be obtained within a Bayesian framework using specialized software and the Markov chain Monte Carlo (MCMC) technique, such as WinBUGS153 or SAS PROC MCMC. Some advantages of the Bayesian approach include (1) incorporation of uncertainty in model parameters (μ, τ2); (2) derivation of CrIs from the posterior distribution; and (3) use of informative prior distributions on the model parameters. However, the use of informative priors for the effect size parameters has been discouraged by some researchers due to potential inclusion of bias.154 The use of vague priors allows the analysis to remain data driven. On the contrary, the use of informative priors for the between-study variance has been suggested to increase confidence in the overall effect size, especially when few studies are included in a meta-analysis.20, 21 Informative priors for τ2 under several treatment comparison types and outcome settings are available for dichotomous data20 and for continuous data.21 Friede et al155 suggest Bayesian CrIs perform well even in rare diseases with a small number of studies when the appropriate prior for τ2 is applied. In rare diseases and small populations, the use of half-normal priors, with expectation 0 and variance 0.25 or 1 for τ2, has been recommended when log-odds ratios are used to measure the effect size.155 Vague priors can also be applied for τ2, but caution is needed, as results are sensitive to the prior specification, especially when the number of studies is small.156 This is because the choice of prior may impact on the estimation of the between-study variance and consequently on the estimated overall effect size and the width of its CI. Other difficulties that have been associated with the derivation of Bayesian CrIs include the complication of determining whether convergence is achieved, the need to burn-in when using MCMC, and the impact of MC error. Alternative methods to implement a Bayesian meta-analysis are available by using numerical integration, importance sampling, and data augmentation as described by Turner et al57 and Rhodes et al.157 For practical application, the R package bayesmeta is available.58
3.2 Comparative evaluation of the methods
The properties of the 15 CI approaches have been evaluated in 31 research papers, including 30 simulation studies and 32 real data evaluations of the methods (for simulation scenarios and study characteristics, see Appendices S3-S4). Published articles suggested that the different approaches can provide noticeably different or even conflicting results and their performance can vary regarding coverage and CI width. Below, we discuss the comparative results as presented in the identified studies. However, it is hard to compare simultaneously all 15 CI approaches, as they have never all been compared under the same conditions and simulation scenarios. The presentation of results follows the same CI presentation order with section 3.1.
3.2.1 WT methods (methods 1, 2, and 3)
The performance of the popular WTz method has been assessed in several studies, and it is poor when compared with other methods. Simulations suggest that the WTz performs worse in terms of coverage for small numbers of studies (k < 16) compared with the PL and the WTt methods, whereas for large k, all three methods perform well.158 The performance of the WTz method though does not only depend on the number of studies but also depend on the τ2 estimator employed and its magnitude.45 The WTz coverage has been found to differ by up to 0.05 between different τ2 estimators, up to 0.30 between meta-analysis samples, and up to 0.20 across between-study variance values ranging from small to large τ2. Coverage has been found to be as low as 0.65 (at 0.95 nominal level) when I2 (defined as the percentage of the total variability in a set of effect sizes that is due to between-study variability beyond what is expected by within-study random error) is 90% and two or three studies are included in a meta-analysis, but it tends towards the nominal level as the number of studies increases.113
To increase coverage, the t distribution can be used, which produces wider CIs than those obtained by the standard normal distribution, especially when τ2 and k are small.15, 117 The coverage probability is therefore higher with the WTt approach, but it depends on the estimator and the magnitude of τ2, as well as on the number of studies.45 Simulation showed that the WTt CI is less affected by the number of studies compared with the WTz CI.52 Although WTt coverage may be more robust to changes in the τ2 magnitude compared with WTz when few studies are included in a meta-analysis, it has been found to differ by up to 0.05 depending on the τ2 estimator used and the number of studies.113 For large meta-analysis samples (eg, k ≥ 20), the coverage of the 95% WTt CI may be below the nominal level, but it becomes conservative (close to 1) when k is small.113, 122, 158
Alternatively, the WTqa method is easy to implement and produces intervals with better coverage in comparison with the WTz method.122 A simulation study45 showed that different estimators of the between-study variance may impact on coverage and that the WTqa method is associated with higher coverage than WTz and HKSJ CIs, but the HKSJ method produced values closer to the nominal level. The same study showed that the WTqa method has similar coverage to the WTt method. For small k, coverage of the WTt method is well above the nominal level and higher than that for the WTqa method, but as k increases, the differences in coverage are not so important.122 Simulations have also shown that WTqa outperforms WTz, PL, and ZL approaches but it is very conservative.146
3.2.2 HKSJ methods (method 4 and 5)
The HKSJ approach (method 4) is often preferred, as in the case of small k, it is conservative and on average produces wider CIs with more adequate type I error compared with the WTz method.16, 119, 141, 152, 159 The HKSJ method provides exact inference when all study sizes are equal and the random-effects model is true, resulting in better inference than WTz,160 but also provides more accurate inference in small meta-analyses with different study sizes.14, 15, 116 Several studies suggested that the HKSJ method has coverage close to the nominal level and that it is not influenced by the magnitude or estimator of τ2.16, 45, 113, 115, 119, 121, 125, 132, 135, 155 Nevertheless, Knapp and Hartung125 recommend using the PM161 estimator for the between-study variance along with the HKSJ method to obtain CIs for μ so as to get a cohesive approach based on Qgen.161, 162 Sanchez-Meca and Marin Martinez45 recommend using the HKSJ method, as it is additionally insensitive to the number of trials. Simulation studies suggest that HKSJ has good coverage when the effect measure is the log-odds ratio,15, 119 the standardized mean difference,121 the mean difference, and the risk difference.118 The coverage of the 95% HKSJ CI is generally better than the WTz and WTt coverages, but it is suboptimal in meta-analyses with binary outcomes and rare events, as shown in simulated meta-analyses where the odds-ratio was used as the measure of effect.113
A real-life data study of 920 Cochrane meta-analyses with k ≥ 3 showed that the WTz method yielded more often statistically significant results compared with the HKSJ method (45% vs 35% of meta-analyses).163 IntHout et al16 found similar results in their real-life data study with 434 Cochrane meta-analyses with dichotomous data (43% vs 34%) and 255 Cochrane meta-analyses with continuous data (51% vs 40%). It is recommended that caution is needed when fewer than five studies of unequal sizes are included in the meta-analysis.16 Wiksten et al132 in their empirical evaluation including 157 meta-analyses with dichotomous data and k ≥ 4 found that in the presence of heterogeneity (using the DerSimonian and Laird11 estimator [
] or the Cochran Q statistic [P < 0.10]),11, 164 the P value for the overall effect size was typically greater when using the HKSJ than the WTz method. However, they comment that the HKSJ method is not always more conservative when
.
It has been shown that in the absence of heterogeneity, the coverage of HKSJ may be smaller than the WTz coverage providing narrower CIs.15, 115, 118, 121, 132, 133, 165 This was more prevalent in cases with rare events.132 Jackson et al133 raise a variety of concerns about the use of the HKSJ method, including (1) the implications of the modification for any given meta-analysis are hard to predict, (2) HKSJ can result in shorter CIs for the overall effect size than the WTz method, and (3) the coverage of the HKSJ CI might be anticipated to be low when
. However, in simulation studies conducted by Röver et al,135 Viechtbauer et al,134 and Sanchez-Meca and Marin Martinez,45 HKSJ worked well even in the absence of heterogeneity. This is in line with the simulations by Gonnermann et al,166 but in the presence of only two studies for τ2 = 0, HKSJ is associated with very low power compared with WTz (15% vs 60%), which may be due to the wider CI, whereas for mild to moderate τ2, both methods have poor control of type I error. A simulation study compared HKSJ with the small sample modification suggested by Knapp and Hartung125 and indicated that the use of the modified HKSJ (method 5) is preferable when few studies of varying size and precision are available.135 Another simulation study suggested the use of the modified HKSJ approach instead of the common HKSJ and WTz approaches when dichotomous data are considered.167 However, for few studies (and particularly for k = 2) and as the between-study variance decreases, the modified HKSJ tends to be overconservative, and selection between the methods is a matter of power vs type I error.133-135
3.2.3 Likelihood-based methods (methods 6, 7, and 8)
The PL method is often preferred to the WTz method, as it is associated with a higher coverage closer to the nominal level, even when k is relatively small.122, 143 Jackson et al158 showed that the PL method performed well and better than the WTz and WTt methods in meta-analyses with few studies (k ≤ 8) with coverage close to the nominal level. However, coverage decreases as τ2 increases and/or k decreases.43 Simulations suggest that the PL CI is less affected by the number of studies in a meta-analysis compared with the WTz CI, but both WTz and PL have poor coverage control, as they yield values below the nominal level.52
Simulations found that the Bartlett-type correction CI (method 7) improves coverage properties over the WTz, WTt, and PL methods that their coverage deviates the nominal level as τ2 increases and/or k decreases.17, 52 Although the Bartlett-type correction CI has a satisfactory power compared with the WTz, WTt, and PL CIs52 and performs well when τ2 = 0,142 caution is needed for k ≤ 5, as it tends to be overconservative.17 The Skovgaard statistic CI (method 8) is associated with coverage closer to the nominal level compared with the WTz and PL CIs, which is remarkable for small k.17, 79, 141 Both the Skovgaard statistic CI and the Bartlett-type correction CI perform very satisfactorily regarding coverage and yield similar results.141
3.2.4 HC and BT (methods 9 and 10)
Simulations showed that in the absence of publication bias and for k > 10, the HC method yields better coverage than WTz, HKSJ, PL, and BT methods, whereas for k < 10, the HKSJ and PL methods perform best.30 The same study showed that when publication bias is present and for k > 10, HC improved coverage compared with WTz, HKSJ, PL, and BT methods and showed less bias than the fixed-effect model. Also, the WTz and BT methods have comparable coverage probabilities with coverage below the nominal level,54, 122 but coverage is increased for the exact weights.54
3.2.5 Resampling methods (methods 11, 12, 13, and 14)
Zeng and Lin146 showed that the ZL CI outperforms both WTz and PL CIs for small k in terms of coverage. Another simulation study showed that the FP CI controls coverage better than WTz, WTt, and PL and is closely followed by the Bartlett-type correction CI, but the latter is slightly more powerful especially for small k.52 The same study showed that the FP CI and the Bartlett-type correction CI were less affected by the number of studies than WTz, PL, and WTt methods.
3.2.6 Bayesian CrIs (method 15)
Simulation studies showed that Bayesian intervals produce intervals with coverage closer to the nominal level compared with the HKSJ, modified HKSJ, and PL CIs,155, 168 and they tend to be smaller than the HKSJ CI even in situations with similar or larger coverage.155 However, the performance of the Bayesian CrIs may vary depending on the prior assigned to the between-study variance.156
3.3 Prediction intervals



It is worth noting that the prediction interval does not inform the statistical significance of
; it instead describes the region within which the true study effects of new studies are expected to be found. A prediction interval can help understand the uncertainty about whether an intervention is expected to work and reflects the potential effect in future study participants.169 Prediction intervals are particularly helpful when excess between-study heterogeneity exists, and the combination of individual studies into an overall effect size would not be advisable. IntHout et al170 found that in more than 70% of the statistically significant meta-analyses in the Cochrane Database of Systematic Reviews with
, the 95% prediction interval suggested that the effect size in a new study could be null or even in the opposite direction from the overall result in some patient populations. The prediction interval can also be used to calculate the probability that a new trial will have a negative result and to improve the calculations of the power of a new trial. Conclusions drawn from a prediction interval are based on the assumption the study-effects are normally distributed. The prediction interval estimation will be imprecise if the estimates of the overall effect size and τ2 are away from the true parameter. Partlett and Riley171 assessed the performance of the ad hoc 95% prediction interval in a simulation study, and they concluded that the method is only accurate when heterogeneity is large (I2 > 30%) and the study sizes are similar. However, for small heterogeneity and different study sizes, the coverage of prediction interval can be as low as 78% depending on the between-study variance estimator.171 Lee and Thompson172 highlight the importance when calculating a prediction interval to allow for potential skewing and heavy tails in the random-effects distributions. Prediction intervals can be implemented in several software, such as R (using, for example, metafor77 and meta75 packages) Stata (using, for example, metan104 and mvmeta105 commands).
4 DISCUSSION
The estimation of the overall effect size is one of the primary aims in meta-analysis. Therefore, the computation of a CI/CrIs is crucial in order to interpret the uncertainty in the estimated overall effect size. Wald-type methods, and in particular the WTz CI using the DerSimonian and Laird11 estimator for the between-study variance, are commonly used and are the default option in several meta-analysis software (eg, RevMan).67 However, the accuracy of these standard CI methods is not optimal, as the coverage probability associated with these CIs can deviate considerably from the nominal coverage probability in small meta-analyses.12-18 This is not surprising as the WT methods rely upon large-sample approximations requiring many studies to be included in a meta-analysis. However, meta-analyses often include a small number of studies, and large-sample approximations can be inaccurate.19-23 Perhaps because of this property, several other CI methods have been proposed to improve the standard WT CIs, including likelihood-based and resampling methods and, more recently, higher order likelihood inference methods. In the present study, we provide a comprehensive review of the CIs for the overall effect size under the random-effects model.
Our review identified 15 methods for calculating a CI for the overall effect size, each of which has different statistical properties. The selection of a method for computing a CI should be based on its statistical performance according to the corresponding meta-analysis' characteristics, as well as on the method's computational and conceptual complexity. Usually, one of these comes at the price of another. For example, the likelihood-based methods are associated with coverage closer to the nominal level compared with the commonly used WTz method but are computationally more demanding than the WTz CI. Also, the use of some methods (eg, ZL) is limited in meta-analyses, due to complex calculations with standard software or their unavailability in statistical software. Simulations have assessed the performance of various methods and showed that it mostly depends on the magnitude of the between-study variance and number of studies in a meta-analysis. However, additional items should be considered when selecting a CI. These may include the type of outcome data and the study size.
The selection of the most preferable methods to calculate a CI for the overall effect size can be mostly based on coverage, as this measure was the only one consistently reported across the identified studies. The 15 methods identified in this review have never all been compared in one simulation study under the same conditions, and hence, making any clear recommendations about these methods would be difficult. Also, none of the methods had an optimal coverage across all settings. Therefore, we can only offer tentative recommendations based on the available evidence, but these depend on the study findings, their simulation scenarios, and the CIs examined. It would require an extensive simulation study to assess the performance of all of these methods, under the same, realistic settings. Future studies should evaluate the CIs for all relevant properties, including coverage, precision, complexity, and power of the corresponding tests. In addition, further research is necessary to make judgements on the performance of CIs. In particular, a comprehensive simulation study informed by real-life data included in a meta-analysis would help determine the factors that impact the performance of the CI methods. Factors to consider in this analysis may include number and size of studies, baseline risk variability, magnitude and estimator of the between-study variance, frequency of events in dichotomous outcome data, type of outcome data, choice of effect size, distribution of effect sizes, sensitivity to small-study effects or publication bias, and different meta-analytical approaches (eg, Mantel–Haenszel, Peto, or one-step methods).
To date, limited evidence exists to inform which method performs best, especially when studies are few in number (<5), and given that the Bayesian intervals have not been assessed extensively in comparative studies. Overall, studies suggest that the HKSJ method has one of the best performance profiles. It performs well even in meta-analyses with fewer than 10 studies28 and is robust to the use of different estimators for the between-study variance and to changes in the magnitude of the between-study variance.45, 113 However, it should be considered that HKSJ is not always conservative compared with a fixed-effect meta-analysis.132 If the estimated between-study variance is zero, the variance of the estimated overall effect size can be inaccurately small, and hence, the HKSJ CI will be too narrow.132 In such cases, it has been suggested to use the modified HKSJ to avoid inaccurate narrow CIs.125 Also, caution is needed in meta-analyses with rare events, where the HKSJ coverage has been found to be as low as 85%113 and meta-analyses with fewer than five studies.28 In the case of few studies, the modified HKSJ has been suggested,135 but in the case of k = 2, the modified HKSJ tends to be overly conservative.135, 166 The likelihood-based methods, and in particular the higher order methods Skovgaard statistic CI and Bartlett-type correction CI, are also associated with good coverage properties.141 However, the higher order likelihood methods have never been compared directly with HKSJ, which would help make informed decisions on the CI selection. Alternatively, Bayesian intervals may be considered preferable to frequentist intervals in situations where prior information is available and can be considered suitable for use in the meta-analysis. Bender et al28 recommend the use of the HKSJ method as a standard approach but in case studies have considerably different precisions the modified HKSJ should be preferred. The same authors also suggest that when reliable prior information on the between-study variance is available, then the Bayesian intervals with (weakly) informative prior distributions for the heterogeneity should be preferred.
The computation of prediction intervals in meta-analysis is also valuable, as they provide additional information about the overall effect size, and we believe that they should be used more frequently. We propose to use k − 1 degrees of freedom rather than k − 2 to calculate prediction intervals, so that the CIs using a t distribution (eg, WTt and HKSJ CIs) and prediction intervals are identical when
. Although some concerns have been raised about prediction intervals, including their actual coverage probability of the true effect in a new study and their sensitivity to distributional assumptions,171, 172 their advantages outweigh their disadvantages, as they are a nice and easy way for people to interpret the implications of the between-study heterogeneity implied by their fitted model. A comprehensive simulation study assessing the different types of prediction intervals under a variety of meta-analytical scenarios and different between-study variance estimators would help critically examine the issues associated with the calculation and interpretation of prediction intervals.
In conclusion, there are multiple methods to compute a CI for the overall effect size, and none of the methods clearly performs best across all meta-analytical settings. We hope that bringing them all together in one place will facilitate investigators in forming their own judgements about the most appropriate method for their needs. Overall, based on the existing literature and consensus among the coauthors of this paper, we tentatively suggest the application of the HKSJ method as a standard approach, at least in a meta-analysis with five or more studies. We recommend conducting a sensitivity analysis using a variety of methods (with at least two to three methods) to assess the robustness of findings and conclusions, especially in a meta-analysis with fewer than 10 studies. It should be highlighted that these results refer to normally distributed true study-specific effects, and simulation studies are necessary to compare the performance of the 15 methods described in this review. For example, Kontopantelis and Reeves173 used various nonnormal distributions for the effect sizes and compared the WTz, WTqa, HKSJ, PL, BT, and FP methods. The authors showed that simulation results were broadly consistent across different effect size distributions (normally distributed, skew-normal, and extreme nonnormal study effects) with PL providing the best coverage, but with wide CI. Also, the FP method provided coverage close to the nominal level, regardless the included number of studies, at the expense of highly lengthy CIs. The HKSJ method had a consistent 94% coverage for nonnormal study effects and small heterogeneity, but with larger heterogeneity, the FP method performed better than the HKSJ CI. For a small number of studies (≤5), the WTz and PL methods performed best, with WTz outperforming PL only when the between-study variance was small. However, more simulation studies with nonnormal true study-specific effects are required to draw robust conclusions for the 15 CIs across different meta-analytical scenarios.
We also recommend the calculation of prediction intervals as a supplement to a CI to illustrate the degree of heterogeneity, particularly when large between-study heterogeneity is present. However, caution is needed for small between-study heterogeneity and unequal study sizes. In this case, it is advisable to prefer prediction intervals derived in Bayesian framework using, for example, informative prior distributions.20, 21 Should any new methods become available, we recommend that these are compared with most, or ideally all, of the methods described in this review, and under the same circumstances both using real-life data and simulation studies. In Appendices S6 to S10, we present a selection of the identified methods for computing a CI to four illustrative, and contrasting, real data examples.174 We hope that our codes presented in Appendices S8A-C and S9 can help to make this possible. This will help obtain a clearer picture about the performance of these methods when these are compared with each other.
List of Abbreviations
-
- BT
-
- Biggerstaff and Tweedie
-
- CI
-
- confidence interval
-
- CrIs
-
- credible intervals
-
- FP
-
- Follmann and Proschan
-
- HC
-
- Henmi and Copas
-
- HKSJ
-
- Hartung-Knapp/Sidik-Jonkman
-
- IQWiG
-
- Institute for Quality and Efficiency in Health Care
-
- PL
-
- profile likelihood
-
- WT
-
- Wald-type
-
- WTqa
-
- quantile approximation
-
- WTt
-
- Wald-type with a t distribution
-
- WTz
-
- Wald-type with a normal distribution
-
- ZL
-
- Zeng and Lin
AVAILABILITY OF DATA AND MATERIALS
All data are from a published study and are available in the supporting file.
DETAILS OF CONTRIBUTORS
Areti Angeliki Veroniki, Dan Jackson, Ralf Bender, Oliver Kuss, Dean Langan, Julian P.T. Higgins, Guido Knapp, and Georgia Salanti contributed to the conception and design of the study and helped to draft the manuscript. Areti Angeliki Veroniki searched PubMed, screened identified articles for eligibility, contacted original study authors when needed, and conducted the statistical analysis. All authors read and approved the final manuscript.
FUNDING INFORMATION
This work did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. Ralf Bender is employed by the Institute for Quality and Efficiency in Health Care, Cologne, Germany.
ACKNOWLEDGMENTS
We thank Dr. Jack Bowden and Dr. Wolfgang Viechtbauer for providing their feedback on a previous draft of the paper. We thank Annamaria Guolo and Michael Preuss for responding to our questions regarding their approaches. We also thank Shazia Siddiqui and Myanca Rodrigues for formatting the manuscript.
CONFLICT OF INTEREST
The authors declare that there is no conflict of interest.