Volume 10, Issue 1 p. 23-43
RESEARCH ARTICLE
Full Access

Methods to calculate uncertainty in the estimated overall effect size from a random-effects meta-analysis

Areti Angeliki Veroniki

Corresponding Author

Areti Angeliki Veroniki

Li Ka Shing Knowledge Institute, St. Michael's Hospital, Toronto, Canada

Department of Primary Education, School of Education, University of Ioannina, Ioannina, Greece

Correspondence

Areti Angeliki Veroniki, PhD, MSc, Department of Primary Education, School of Education, University of Ioannina, Ioannina 455 00, Greece.

Email: [email protected]

Search for more papers by this author
Dan Jackson

Dan Jackson

MRC Biostatistics Unit, Statistical Innovation Group, AstraZeneca, Cambridge, UK

Search for more papers by this author
Ralf Bender

Ralf Bender

Department of Medical Biometry, Institute for Quality and Efficiency in Health Care, Cologne, Germany

Search for more papers by this author
Oliver Kuss

Oliver Kuss

Institute for Biometrics and Epidemiology, German Diabetes Center, Leibniz Institute for Diabetes Research at Heinrich Heine University, Düsseldorf, Germany

Institute of Medical Statistics, Heinrich Heine University Düsseldorf, Düsseldorf, Germany

Search for more papers by this author
Dean Langan

Dean Langan

Institute of Child Health, University College London, London, UK

Search for more papers by this author
Julian P.T. Higgins

Julian P.T. Higgins

Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, UK

Search for more papers by this author
Guido Knapp

Guido Knapp

Department of Statistics, TU Dortmund University, Dortmund, Germany

Search for more papers by this author
Georgia Salanti

Georgia Salanti

Institute of Social and Preventive Medicine, University of Bern, Bern, Switzerland

Search for more papers by this author
First published: 21 August 2018
Citations: 129

Abstract

Meta-analyses are an important tool within systematic reviews to estimate the overall effect size and its confidence interval for an outcome of interest. If heterogeneity between the results of the relevant studies is anticipated, then a random-effects model is often preferred for analysis. In this model, a prediction interval for the true effect in a new study also provides additional useful information. However, the DerSimonian and Laird method—frequently used as the default method for meta-analyses with random effects—has been long challenged due to its unfavorable statistical properties. Several alternative methods have been proposed that may have better statistical properties in specific scenarios. In this paper, we aim to provide a comprehensive overview of available methods for calculating point estimates, confidence intervals, and prediction intervals for the overall effect size under the random-effects model. We indicate whether some methods are preferable than others by considering the results of comparative simulation and real-life data studies.

1 INTRODUCTION

Systematic reviews and meta-analyses provide a method for collecting and synthesizing research and are often used to inform decision making. The number of these publications has increased substantially since the 1990s.1 Meta-analysis is a valuable technique to summarize study-specific results and often reduces bias and uncertainty from individual studies. Guidelines and Health Technology Assessment panels, as well as international organizations, including the World Health Organization,2 the European Medicines Agency,3 and governmental agencies worldwide, such as the Canadian Agency for Drugs and Technologies in Health,4 the Institute for Quality and Efficiency in Health Care (IQWiG),5 and the National Institute for Health and Clinical Excellence,6 recognize the need to ensure that the best available evidence informs clinical practice and health care decision making. This typically involves conducting a high-quality knowledge synthesis and meta-analysis. Quantitative results of meta-analyses of relevant studies, in the form of a point estimate and a confidence interval (CI) for the effect size parameter of interest, are invariably considered together with judgements about the quality of the evidence to produce recommendations for practice.7

Quantification of uncertainty in the estimated overall effect size is important in the process of drawing conclusions from a meta-analysis. This uncertainty should ideally account for between-study heterogeneity in the intervention effects across study settings and populations.8, 9 For this reason, the random-effects model is often employed, which includes a between-study variance parameter. The uncertainty of the estimate of the overall effect size can be described by the corresponding CI under the random-effects model, and its width depends on the magnitude of the between-study variance, the number of studies, the precision of the study-specific effect sizes, and the significance level.10

The estimation of the CI for overall effect size is often conducted with the Wald-type (WT) method using a normal distribution, with variance equal to the inverse of the sum of the study weights, and the DerSimonian and Laird11 estimator for the between-study variance, and this has been used routinely in many meta-analyses.12 However, numerous shortcomings of this approach have been raised, such that the CI for the overall effect size generally does not retain its coverage probability (ie, the proportion of times that the interval includes the true value), and hence, it underestimates the statistical error, producing overconfident results.12-18 This is mainly because the WT CI is based on a large-sample approximation (in terms of the number of studies), and the number of studies is usually small. Typically, the number of studies synthesized in a meta-analysis in medical research is less than 20, suggesting that any large-sample approximation is likely to be inaccurate.19, 23 Several attempts to improve the standard WT CI approach have been suggested, each of which has different statistical properties.

Another important aim in decision making is the prediction of the true effect size in an individual (future) study and setting. Higgins et al24 suggested the use of prediction intervals under the random-effects model for this purpose. The use of prediction intervals has been promoted, and although they have not often been employed in practice, they provide useful additional information.25, 26

In this paper, we aim to provide a comprehensive overview of available methods in the methodological literature for calculating a CI for the overall effect size under the random-effects model and to indicate whether some methods are preferable to others by considering the results of comparative simulation and real-life data studies. We also examine potential issues surrounding the computation of prediction intervals under the random-effects model.

The article is structured as follows. In Section 2, we present the conventional meta-analytic models and set up our notation (section 2.1) and describe our review methodology (section 2.2). In Section 3, we describe the statistical methods found in our literature review. In section 3.1, we describe 15 identified methods to calculate a CI for the overall effect size. In section 3.2, we discuss the comparative performances of different methods for computing a CI for the overall effect size, as described in previous studies, and summarize recommendations made by their respective authors. In section 3.3, we discuss methods for computing prediction intervals. We conclude with a discussion of all intervals (confidence and prediction intervals) in Section 4.

2 METHODS

2.1 Meta-analysis models and notation

The conventional fixed- and random-effects models are the two main meta-analysis models to synthesize the study results.27 The random-effects model accounts for two sources of variation, quantified by the within-study variance (vi) and the between-study variance (τ2). When urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0001, the fitted random-effects model collapses to the fixed-effect model (also known as common-effect model),28, 29 and therefore, the random-effects model can be considered a generalization of the fixed-effect model (ie, that the fixed-effect model is a special case of the random-effects model). Confidence intervals under the fixed-effect model can have poor properties even for low but nonzero heterogeneity.13, 30

Both the conventional fixed- and random-effects models require an estimated effect size yi (such as log-odds ratio) and an estimated (within-study) variance vi (var(yi) = vi) from every included study i, i = 1, …, k. The choice between the two models has been widely discussed in the literature31-33 and summarized in the Cochrane Handbook.9 In this paper, we focus on the random-effects meta-analysis model using inverse-variance weighting. Other techniques to combine study information to calculate the overall effect size are also available, such as weighting by sample size34, 35 or using confidence distributions.36, 37 Also, dichotomous outcomes do not require inverse-variance methods, as they can be modeled directly using one-step models (eg, generalized linear mixed models).38 Alternative methods accounting for heterogeneity by the use of a multiplicative parameter, where study weights are independent of observed heterogeneity, are also available.10, 39, 40 The description of these methods is beyond the scope of this review.

The conventional random-effects model assumes that the estimated effect size from the ith study yi is
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0002
where the study-specific random error (εi) and the underlying true effect sizes in the individual studies (θi) are normally distributed as
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0003
The random-effects estimated overall effect size urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0004, and its variance can be estimated as
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0005
with weights urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0006, where it can be seen that these weights are the inverse of the estimated total study variances. Similarly, the fixed-effect weights can be calculated as wi,FE = 1/vi, and the estimated overall effect size urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0007, and its variance are given by
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0008

The uncertainty in an estimated effect size for a given study, in relation to its study-specific true effect size, is expressed via the within-study variance vi. The standard approach described above assumes that the estimated within-study variances vi are fixed and known although they have to be estimated from the data. This assumption is justifiable when each study size is sufficiently large. The vi estimation is sensitive not only to the study size but also to the data type and effect size used. For example, the vi estimator for continuous outcomes when using the standardized mean difference depends on the estimated effect size. Hence, although vi is assumed fixed and known, vi is in fact estimated with some uncertainty. Several authors point out that this assumption could affect the estimation of the overall effect size, its variance, and related inferences.13, 41-45 Therefore, calculating study weights through within-study variances and assuming that they are known constants may have less desirable properties. This issue has previously been discussed, and some attempts have been made to account for the uncertainty in the weights.46-49 Hence, among other factors, the performance of the CI methods depends on how well we estimate the study weights.

Similarly, the estimation of τ2 is performed with some uncertainty, and this uncertainty depends on the size and number of studies in the meta-analysis, as well as the size of the between-study variability. Factors such as these have implications for the accuracy of the standard statistical methods described in this paper, which has motivated many of the attempts to improve this. There are also many methods to estimate τ2, any of which can be used in some of the CI methods described below, and we refer to a previous publication on this topic.50 Also, for a review of the simulation studies evaluating the comparative performance of τ2 estimators, we direct the reader elsewhere.51

2.2 Review methods

We searched PubMed from inception until April 29, 2016, to identify full-text research articles that describe or compare methods for calculating CI for μ in simulations or in real data sets. We scanned the references of the selected articles for additional relevant articles, and we conducted general internet searches using the web search engine Google. We also used our networks of professional collaborations to identify potentially relevant articles. We included all studies that report the development or comparison of methods to calculate a CI for the overall effect size under the random-effects model. We also included studies reporting on prediction interval methods identified from our internet searches and networks of collaborations. We excluded commentaries, abstracts, and studies written in languages other than English, and studies relating to the hypothesis tests for the overall effect size. We restricted our investigation to CI methods developed under the random-effects meta-analysis model that assume the true study-specific effects are normally distributed, while we excluded CI methods developed for network meta-analysis, one-stage individual patient data meta-analysis, meta-analysis of diagnostic test accuracy studies, meta-analysis of multiple outcomes, and meta-regression analysis. One reviewer (AAV) summarized methods and studies' conclusions from each included article and recorded any conclusions from comparative articles (studies that compare at least two methods). The information extracted refers to the performance of the various methods and the judgements deducted about their related advantages, and this information was checked by all coauthors. Disagreements were resolved by discussion. The PubMed search strategy is included in Appendix S1.

We describe known properties of the methods in terms of coverage probability and CI width in section 3.2. The closer the coverage probability is to the nominal level (usually 0.95), the better the CI is considered to be. A CI is exact when the actual coverage equals the nominal coverage. The coverage probability is closely related to the type I error of the hypothesis test on the overall effect size: Assuming the null hypothesis is true, one minus the type I error rate is the coverage probability. A further criterion for comparing methods is that methods that provide narrower CIs, whilst retaining the correct coverage probability, are preferable because they increase precision and hence are more informative. All statistics presented in this paper refer to two-tailed tests.

3 RESULTS

The database search returned 5628 matches in PubMed and 20 records identified through other sources and searching reference lists. In total, 69 publications met the eligibility criteria, which are listed in Appendix S2. We identified 15 methods to compute a CI for the overall effect size. The properties of those methods have been evaluated in 31 research papers, including 30 simulation studies and 32 real-life data evaluations of two or more methods. Below, we present the 15 identified approaches in seven broad categories, and as a separate section, we present the comparative results of the identified simulations and studies using real data sets (see Appendices S3-S4 for simulation scenarios and study characteristics and Appendix S5 for a summary of performance measures in simulation studies). In Table 1, we summarize the methods available in several software options.112

Table 1. Software options (with packages or macros) for each CI method. Το our knowledge, routines for WTqa (method 3), Bartlett-type correction (method 7), and ZLa (method 11) CIs are not available in any of the software options listed below
Software License type Confidence/Credible Interval Methods
WTz (Method 1) WTt (Method 2) HKSJc (Method 4) Modified HKSJ (Method 5) PL (Method 6)

Skovgaard

(Method 8)

HCb (Method 9) BTd (Method 10) Bootstrap (Methods 12, 13) FP (Method 14) Bayes (Method 15)
Comprehensive Meta-analysis Commercial Yes - Yes - - - - - - -
Excel—MetaEasy AddIn Freeware Yes Yes - - Yes - - - - Yes -
Excel—MetaXL AddIn Freeware Yes - - - - - - - - - -
HLM Commercial - - - - Yes - - - - - -
Meta-DiSc Freeware Yes - - - Yes - - - - - -
Metawin Commercial Yes - - - - - - - Yes - -
MIX Commercial Yes - - - - - - - - -
MLwin Freeware Yes - - Yes - - - Yes - Yes
Open Meta Analyst Freeware Yes - - - - - - - - -
RevMan Freeware Yes - - - - - - - - - -
R Freeware

Yes

MAd,

meta, metafor,

metagen,

metalik, metamisc,

metaSEM, metatest,

metaplus, mvmeta,

mvtmeta,

netmeta,

rmeta

Yes

metaplus

Yes

MAd,

meta,

metafor,

metamisc,

metatest

-

Yes

metaLik,

metaplus

Yes

metaLik,

metatest

Yes

metafor

Yes

metaxae

Yes metaplus,

boot

Yes metafor

 

Yes

bayesmetaf, blme,

BRugs, gemtc,

metamisc R2WinBUGS, SASBUGS rjugs

SAS Commercial Yes marandom.sas, PROCs GLM and MIXED

Yes

PROCs GLM and MIXED

- -

Yes

marandom.sas,

PROC NLP

- - - - -

Yes

PROC MCMC

Stata Commercial

Yes metaan, metan, metareg, mvmeta, xtreg

- -

Yes

metareg

Yes

gllamm, metaan

- - - Yes bootstrap

Yes

metaan

-
SPSS Commercial

Yes

meanes.sps, metaf.sps, metareg.sps

- - - - - - - - - -
BUGS, OpenBUGS, WinBUGS Freeware - - - - - - - Yes
  • Abbreviations: BT, Biggerstaff and Tweedie; FP, Follmann and Proschan; HC, Henmi and Copas; HKSJ, Hartung-Knapp/Sidik-Jonkman; PL, Profile likelihood; WTz, Wald-type with a normal distribution; WTt, Wald-type with a t distribution; WTqa, Quantile approximation; ZL, Zeng and Lin.
  • ‘-’ denote that the underlying method is not available in the specific software.
  • a A resampling test is available in the R package metatest.52
  • b Henmi and Copas30 provide an R code to implement the HC method.
  • c IntHout et al16 provide an approach to easily convert WTz CIs to HKSJ CIs.
  • d Biggerstaff and Tweedie53 provide a SAS code to implement the BT method.
  • e This package uses the exact random-effects weights in the Biggerstaff and Tweedie approach.54
  • f Bayesian approaches can be implemented using the Markov Chain Monte Carlo (MCMC) techniques in several software, such as OpenBUGS,55 WinBUGS56 or without MCMC as described by Turner et al...,57 in the R package bayesmeta.58
  • Comprehensive Meta-Analysis59 www.meta-analysis.com/
  • Excel using the MetaEasy AddIn60 https://www.jstatsoft.org/article/view/v030i07 or MetaXL AddIn http://www.epigear.com/
  • HLM61 http://www.ssicentral.com/hlm/
  • Meta-DiSc62 ftp://ftp.hrc.es/pub/programas/metadisc/
  • Metawin63 http://www.metawinsoft.com/
  • MIX64 www.mix-for-meta-analysis.info/
  • MLwin65 http://www.bristol.ac.uk/cmm/software/mlwin/
  • Open Meta Analyst66 http://www.cebm.brown.edu/openmeta/
  • RevMan67 www.cochrane.org/
  • R68http://www.r-project.org/ Packages: bayesmeta,58 blme,69 boot,70, 71 BRugs,72 Mad,73 mada,74 meta,75 gemtc,76 metafor,77 metagen,78 metaLik,79, 80 metamisc,81 metaplus,82 metaSEM,83 metatest,52, 84 metaxa,54 mvmeta,85 mvtmeta,86 netmeta87) R2WinBUGS,88 rjugs,72 rmeta.89
  • SAS90 http://www.sas.com/technologies/analytics/statistics/stat/ Macros: marandom.sas,91 PROC IML,92 PROC MIXED,93, 94 PROC GLIMMIX,95 SASBUGS,96 RASmacro,97 PROC MCMC.98
  • Stata99 www.stata.com / Routines: bootstrap,100 gllamm,101 metaan,102 metareg,103 metan,104 mvmeta,105 xtreg.106
  • SPSS107 http://www.spss.co.in/ Macros: meanes.sps,108 metaf.sps,109 metareg.sps.110
  • BUGS,111 OpenBUGS,55 WinBUGS56 www.mrc-bsu.cam.ac.uk/bugs/.

3.1 CI for the overall effect size

3.1.1 WT methods

  1. Wald-type normal distribution (WTz) CIs (method 1)
The WTz approach is the most popular technique for calculating a CI for μ,11 and a 95% CI is given by
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0009
where z0.975 is the 0.975 quantile of the standard normal distribution. Any τ2 estimator can be used when computing a WTz CI.50, 113, 114 This method often has coverage probability considerably below nominal 0.95 level14, 15, 45, 115-121 when k is small and/or τ2 is large.13, 45, 52, 79, 118, 119, 122-125 Brockwell and Gordon13 stated that the greatest source of error in the method is the use of a normal approximation for urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0010. Despite the widespread use of the WTz method, it ignores uncertainty of the estimates of τ2 and vi in urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0011.
  1. Wald-type t-distribution (WTt) CIs (method 2)
A slight modification of the WTz CI is the WTt approach, where the t distribution with k − 1 degrees of freedom is used, as opposed to the normal distribution. Although the two distributions converge asymptotically, the t quantile is larger than the z quantile associated with the WTz method. Hence, the WTt approach results in a wider CI and was proposed in order to increase the coverage probability, especially when the number of studies is small.117, 126 A 95% CI can be obtained by
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0012
where tk−1,0.975 is the 0.975 quantile of the t distribution with k − 1 degrees of freedom. Any τ2 estimator can be used to compute a WTt CI.50, 113
  1. Quantile approximation (WTqa) CIs (method 3)
Brockwell and Gordon122 proposed the WTqa method as an alternative to WTz method in an attempt to achieve better coverage. The method resembles WTz and WTt, but instead of using normal or t distributions, it approximates the 0.025 and 0.975 quantiles of the distribution of the statistic
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0013
that are required for the 95% CI for μ. Hence, the WTqa uses different quantiles than the ones used in the WTz and WTt approaches. Let bk be the quantile approximation function, which monotonically decreases as a nonlinear function of k; then, a 95% CI is calculated as
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0014
The quantiles bk are estimated via a Monte Carlo simulation process of samples of the M statistic with bk equal to the average of 0.025 and 0.975 absolute quantiles of the distribution, thus accounting for any small asymmetry in the distribution of M around zero.122 To obtain the function bk, Brockwell and Gordon fit a regression equation for the quantiles as a function of k. The resulting regression equation (for k = 1, 2, …, 30) is
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0015

However, both number of studies k and the magnitude of τ2 may impact on the performance of the WTqa method,122 and changes in the distribution of the within-study variances can importantly impact on bk.127 Although WTqa approach has been criticized on the grounds that it is, at best, very difficult to obtain suitable critical values bk that apply to all meta-analyses,127 we include it in this paper for completeness. As a conservative approach, Jackson and Bowden127 suggested the use of the standard normal quantile instead, and to assess the robustness of the findings via a sensitivity analysis of alternative quantiles. Brockwell and Gordon122 developed the WTqa method using the DerSimonian and Laird11 estimator of τ2, but WTqa could, in principle, be implemented for any alternative τ2 estimator. However, it is not advised to develop the WTqa CI further.127

3.1.2 Hartung-Knapp/Sidik-Jonkman CIs (methods 4 and 5)

Hartung and Knapp14 and Sidik and Jonkman15 independently introduced the Hartung-Knapp/Sidik-Jonkman (HKSJ) CI (method 4) to handle meta-analyses that include a small number of studies. This method is based on the S statistic, which follows a t distribution with k − 1 degrees of freedom,
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0016
with
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0017
and urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0018 where Qgen is the generalized Q statistic:
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0019
A 95% CI for μ is given by
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0020
Although the method does not take into account the uncertainty in τ2, the use of a different statistical approximation to the usual WT CI may improve accuracy. Also, the HKSJ method can be applied with any τ2 estimator and is exact for known variance components.14 For meta-analysis software where this method is not available yet, IntHout et al16 suggested an approach to convert WTz CIs easily to HKSJ CIs. The extension to meta-regression was investigated by Knapp and Hartung,125 and a generalization of the method to multivariate meta-analysis was explored by Jackson and Riley.128

When all variance components, including the between-study variances, are fixed and known, the expected value of Qgen is k − 1, which equals the degrees of freedom of the associated χ2 distribution.129-131 Hence, the small-sample adjustment q will tend to be close to 1. However, q may in fact turn out to be much smaller than 1, such as in cases where the effect sizes are very homogeneous or when the number and/or size of studies is small. This leads to a narrower CI than the WTt approach and can also lead to a narrower CI compared with the WTz method.132, 133 Although the t quantile is always larger than the z quantile associated with the WTz method, explaining in part why the HKSJ CI performs better than the WTz CI, in the case of urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0021, the HKSJ CI will be narrower than the WTz CI.

Wiksten et al132 show that if urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0022, then we estimate urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0023, and further that the variance of the estimate of μ simplifies to urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0024, with urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0025 and Q ≤ (k − 1) when the DerSimonian and Laird11 estimator of τ2 is used. Therefore, the variance of the estimated effect size is always smaller or equal for the HKSJ method than the WTz method when this estimator of the between-study variance is zero. The possibility that the variance of the estimated effect size from the HKSJ method can be smaller than the variance of the WTz method was discussed by Knapp and Hartung,125 who proposed a simple modification to the procedure. The authors125 suggested using q* instead of q (method 5)
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0026
to ensure more conservative results. However, this practice may be overly conservative, leading to loss of power.134, 135
Sidik and Jonkman15 recommend using the HKSJ CI, but instead of urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0027, they suggest applying the sandwich variance estimator:
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0028

This is a robust estimator of urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0029, where the inverse of the study weights ( urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0030) is estimated through the squared sample residuals ( urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0031) from the data, rather than assuming urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0032.

However, Sidik and Jonkman120 state that urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0033 is biased when k is small, and hence, they suggest a bias corrected estimator of urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0034 (see Sidik and Jonkman120 for details). An alternative approach based on the expected information and on appropriately modified degrees of freedom of the t distribution was suggested by Kenward and Roger.136 These alternative expressions for urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0035 could also be used in WT CIs but have not been adopted in practice, so we do not explore their use further here.

3.1.3 Likelihood-based methods

  1. Profile likelihood CIs (method 6)
The profile likelihood (PL) method has been established in meta-analysis by Hardy and Thompson43 and is based on the likelihood ratio statistic, which unlike the WTz approach allows for asymmetric intervals. For yi~Ν(μ, vi + τ2), the log-likelihood function of the parameter vector (μ, τ2) is given by
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0036
Maximum likelihood estimates of (μ, τ2) can be found by maximizing lnL(μ, τ2) under the restriction τ2 ≥ 0. The PL function is based on the log-likelihood function and uses an iterative process that provides CIs for μ that allow for the fact that τ2 needs to be estimated as well. Since the PL approach profiles over τ2, it accounts for, but does not fully allow for, the uncertainty in τ2. This is because asymptotic results are required when using this method. However, the PL method is anticipated to be more accurate than the WT methods in smaller samples.
The profile log-likelihood for μ is defined as
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0037
where urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0038(μ) is the maximum likelihood estimator for τ2as μ varies.43 A 95% CI for μ can be obtained as the values that satisfy (see Hardy and Thompson43 equation 11)
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0039
where urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0040 is the 0.05 quantile of the χ2 distribution with 1 degree of freedom. It has been shown that for small k and τ2, iterative algorithms are less likely to converge to a single value.13
  1. Higher order likelihood inference methods (methods 7 and 8)

As Reid137 explains, the main asymptotic properties of likelihood-based inference include (a) consistent, asymptotically normal, and efficient maximum likelihood estimators; (b) an asymptotically normally distributed score statistic with mean zero; and (c) an asymptotic chi-squared distributed likelihood ratio statistic. For example, the PL CI in the previous section relies upon the third of these standard results. As Reid137 also explains, higher order asymptotic results for likelihood-based inference are also available. Some higher order likelihood inference methods have recently been applied to meta-analysis, which is a situation where they may be thought to be especially valuable. This is because the number of studies is often small, so that the commonly used “lower-order” asymptotic approximations to the likelihood function will be inadequate. Higher order likelihood–based methods therefore have the potential to produce more accurate results in meta-analysis, and several proposals for this have been made. We briefly summarize the methods here, but the details are technical, and so we refer the reader to the articles cited below for more information.

The Bartlett-type correction of the likelihood ratio statistic was first introduced by Bartlett (method 7).138 Noma17 explains how to apply this to random-effects meta-analysis and so use a higher-order approximation than the PL method above. Noma17 also explains how to use the score statistic to compute CIs and subsequently derives a higher order Bartlett-type adjustment to this score. Skovgaard139 proposed an alternative higher order approximation to the profile log-likelihood (method 8), and Guolo79 explains how to apply this to random-effects meta-analysis. For details on the method, we direct the reader elsewhere.79, 140 The higher order asymptotic methods have higher degree of accuracy, but in some cases (eg, when the between-study variance is close to zero), they may produce numerically unstable maximum likelihood estimates due to the discontinuity of the statistic.79, 140-142 In such cases, a bias reduction approach is suggested.143 Hence, the Bartlett-type correction (method 7) and the Skovgaard statistic (method 8) are the two main proposals for higher order approximations when using methods based on the PL.79

3.1.4 Henmi and Copas CIs (method 9)

Henmi and Copas (HC)30 propose an alternative strategy for obtaining intervals for μ that are less sensitive to publication bias than the widely used WTz method. Since the fixed-effect estimates assign larger weight to bigger studies, and study size is one component among others that is associated with the overall effect size in the presence of publication bias, this method centres the CI on a fixed-effect estimate. This is because the fixed-effect estimates are less sensitive to publication bias than the random-effects estimates. To allow for heterogeneity, they first estimate the variance of the fixed-effect estimate under the random-effects model as
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0042
Henmi and Copas30 then derive an approximation to the resulting pivot G that is used for making inferences about μ
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0043
assuming that the DerSimonian and Laird11 estimator of the between-study variance is used. Hence, approximate CIs can be computed. This can be thought of as a hybrid approach, where the fixed-effect estimate is accompanied by a CI that allows for between-study heterogeneity under the assumptions made in the random-effects model. A limitation of the approach is that the fixed-effect estimate is not fully efficient under the random-effects model, but Henmi and Copas30 argue that it is “better to use a method that is more robust to publication bias, even if this means sacrificing some efficiency under the standard setting.” A much simpler, but less accurate, way to implement Henmi and Copas' idea would be to assume that the pivot G approximately follows a standard normal distribution, but this would ignore all uncertainty in the between-study variance. This simpler approach could also be used with alternative estimators of the between-study variance. Alternatively, one could apply the IVher model suggested by Doi et al,144 which uses quasi-likelihood approaches and is performed under the fixed-effect assumption. Doi et al144 show that the IVher model favors larger trials, retains the nominal coverage probability, and exhibits lower variance of the overall effect size as opposed to the random-effects model irrespective of the degree of the between-study heterogeneity.

3.1.5 Biggerstaff and Tweedie CIs (method 10)

Biggerstaff and Tweedie (BT)53 proposed the use of different study-specific weights to those more conventionally used in the random-effects model, and estimated μ along with its variance using the weight urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0045, so as to acknowledge for τ2 variability in the computation of CI for μ. Acknowledging the uncertainty in the weights allows greater uncertainty in the estimation of μ.53 The urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0046 weights are the expected value of the random-effects weights (calculated using the estimated τ2) rather than the usual random-effects observed weights:
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0047
The urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0048 weights depend on the density form of τ2 and were derived using the DerSimonian and Laird11 estimator for the between-study variance. Alternative estimators could also be used, in principle, when using this method, provided that their distribution, and so the expected weights used by the method, can be evaluated. The variance of urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0049 is estimated as
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0050
Assuming normality, a 95% CI can be obtained as
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0051
Biggerstaff and Tweedie53 use an approximate distribution to obtain the expected weights, but this has been improved upon by Preuß and Ziegler54 who used the exact weights through the exact cumulative distribution function of Q, where urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0052145 Biggerstaff and Tweedie53 provided the algorithm to implement the method in SAS.

3.1.6 Resampling methods

  1. Zeng and Lin CIs (method 11)

Zeng and Lin (ZL)146 examine the distribution of the estimated overall effect size under the random-effects model and find that it is not asymptotically normally distributed for a finite number of studies k. This makes intuitive sense, because the textbook result that a linear combination of normal random variables is normally distributed requires that the coefficients in this linear combination are constants. When estimating the overall effect size however, these coefficients are proportional to the weights and so are functions of the estimated between-study variance. We require a large number of studies in order to estimate this variance accurately enough to treat the weights as fixed constants.

Recognizing that the estimated overall effect size is not asymptotically normally distributed for small k, Zeng and Lin146 suggest a resampling procedure to obtain the distribution of this estimate, assuming that the DerSimonian and Laird11 estimator of τ2 is to be used. Briefly, they simulate values of τ2 using the DerSimonian and Laird11 estimating equation (where the individual study results used in this estimation are simulated from the fitted random-effects model). They then simulate estimated average effect sizes using the sampled τ2 to calculate the weights in the estimating equation for urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0053 (as given in section 2.1, where the individual study results used in this estimation are simulated from the random-effects model centred at the estimated overall effect and where the between study variance is taken to be the sampled value used to compute the weights). By repeating both aspects of this sampling process B times, B2 estimates provide an empirical distribution of estimated overall effects that can be used to compute CIs and make inferences.146 This resampling procedure could be modified to accommodate alternative estimators of τ2, by instead calculating alternative estimates at the first stage, but this idea would need to be critically evaluated before it could be accepted.
  1. Bootstrap CIs (methods 12 and 13)

Nonparametric bootstrapping is a way to approximate the sampling distribution of a statistic by resampling, from the sample itself, with replacement. Parametric bootstrapping instead samples from a fitted model. Both forms of bootstrapping can be used to make a variety of inferences but are most usually used to quantify the uncertainty in point estimates through the computation of standard errors and CIs. Briefly, bootstrap datasets are sampled (either nonparametrically, method 12,147, 148 or parametrically, method 13),149 from which the bootstrap statistics (the statistic of interest calculated using the bootstrap datasets) are calculated. Then, the empirical distribution of the bootstrap statistics is taken to approximate the distribution of the statistic of interest. Hence, measures of the uncertainty in the statistic, such as standard errors and CIs, can be calculated from this empirical distribution. In our context, this statistic is the estimated overall effect size.

There is a variety of ways in which the bootstrap samples can be sampled under the random-effects model. For example, we could either sample estimated effect sizes and their standard errors directly or instead sample the individual patient data in situations where this is available (this can readily be derived for dichotomous outcome data from the frequency and sample size). A full discussion of all the possibilities is beyond the scope of this paper, but Van Den Noortgate and Onghena150 describe four different bootstrapping procedures, where two of these are parametric and the other two are nonparametric. We refer the reader to this paper for full details of the sampling methods used. Parametric bootstrap CIs have also been advocated by Turner et al149 and nonparametric bootstrap CIs by Efron.151
  1. Follmann and Proschan CIs (method 14)

Permutation tests have been suggested primarily to assess the true statistical significance of an observed finding under the null hypothesis of the absence of effect, especially in meta-analyses with a small number of studies.152 This method can be extended and used for calculating CIs for the effect size. These tests are especially appropriate when the included studies in a meta-analysis may not be considered randomly sampled from a larger population of studies. Confidence intervals can be constructed by inverting hypothesis tests, where parameter values that are not rejected by the hypothesis test lie within the corresponding CI.

Follmann and Proschan (FP)117 begin by considering a permutation method for testing the null hypothesis H0 : μ = 0. Their argument assumes that the distributions of the outcome data yi are symmetric117 and this is implied by the random-effects model. Under the null hypothesis, the sign of yi is equally likely to be positive or negative for a symmetric yi. There are 2k possible permutations of the signs of the values of the outcome data yi. We take Zp = ( urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0054, urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0055,..., urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0056) to be the pth of these 2k permutations; for example, with k = 5 studies, (+1, +1, −1, −1, +1) is one of the 32 possible permutations. We define urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0057 to be the pth permutation of the outcome data corresponding to Zp. The central idea is that under the null hypothesis, H0 : μ = 0 and because the distributions of the yi are symmetric, all 2k permutations Xp are equally likely. Hence, Follman and Proschan117 propose the null distribution where all values of
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0058
are equally likely, where urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0059 are the normalized (sum to one) random-effects weights described in section 2.1, where the between-study variance is estimated using Xp as outcome data. Hence, urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0060is the estimated average effect size using the pth permutation of signs of the outcome data. The two-sided P value based on the group permutation method proposed by Follmann and Proschan117 is simply the proportion of the absolute values of urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0061 that are more than or equal to the absolute value of the estimated average effect under the random-effects model using the observed data. Follman and Proschan117 describe their procedure in terms of the DerSimonian and Laird11 estimator of the between-study variance, but, in principle, alterative estimators could be used. If k is too large for all permutations to be evaluated, then the permutation distribution can be approximated by instead simulating a large number of permutations.117

The procedure described above can be extended so that we instead test the hypothesis H0 : μ = c, and invert this hypothesis test to give the bounds of CIs. For further details on the FP method, we refer the reader to Follman and Proschan.117 This method has been suggested as an alternative approach to the WT and likelihood-based approaches that assume normality of the observed effects, but it can be computationally demanding. The discrete nature of the permutation distribution will ensure that the CI maintains the desired coverage probability, but in general, this coverage probability will be larger than the nominal level.

3.1.7 Bayesian credible intervals (method 15)

Bayesian credible intervals (CrIs) for the overall effect size can be obtained within a Bayesian framework using specialized software and the Markov chain Monte Carlo (MCMC) technique, such as WinBUGS153 or SAS PROC MCMC. Some advantages of the Bayesian approach include (1) incorporation of uncertainty in model parameters (μ, τ2); (2) derivation of CrIs from the posterior distribution; and (3) use of informative prior distributions on the model parameters. However, the use of informative priors for the effect size parameters has been discouraged by some researchers due to potential inclusion of bias.154 The use of vague priors allows the analysis to remain data driven. On the contrary, the use of informative priors for the between-study variance has been suggested to increase confidence in the overall effect size, especially when few studies are included in a meta-analysis.20, 21 Informative priors for τ2 under several treatment comparison types and outcome settings are available for dichotomous data20 and for continuous data.21 Friede et al155 suggest Bayesian CrIs perform well even in rare diseases with a small number of studies when the appropriate prior for τ2 is applied. In rare diseases and small populations, the use of half-normal priors, with expectation 0 and variance 0.25 or 1 for τ2, has been recommended when log-odds ratios are used to measure the effect size.155 Vague priors can also be applied for τ2, but caution is needed, as results are sensitive to the prior specification, especially when the number of studies is small.156 This is because the choice of prior may impact on the estimation of the between-study variance and consequently on the estimated overall effect size and the width of its CI. Other difficulties that have been associated with the derivation of Bayesian CrIs include the complication of determining whether convergence is achieved, the need to burn-in when using MCMC, and the impact of MC error. Alternative methods to implement a Bayesian meta-analysis are available by using numerical integration, importance sampling, and data augmentation as described by Turner et al57 and Rhodes et al.157 For practical application, the R package bayesmeta is available.58

3.2 Comparative evaluation of the methods

The properties of the 15 CI approaches have been evaluated in 31 research papers, including 30 simulation studies and 32 real data evaluations of the methods (for simulation scenarios and study characteristics, see Appendices S3-S4). Published articles suggested that the different approaches can provide noticeably different or even conflicting results and their performance can vary regarding coverage and CI width. Below, we discuss the comparative results as presented in the identified studies. However, it is hard to compare simultaneously all 15 CI approaches, as they have never all been compared under the same conditions and simulation scenarios. The presentation of results follows the same CI presentation order with section 3.1.

3.2.1 WT methods (methods 1, 2, and 3)

The performance of the popular WTz method has been assessed in several studies, and it is poor when compared with other methods. Simulations suggest that the WTz performs worse in terms of coverage for small numbers of studies (k < 16) compared with the PL and the WTt methods, whereas for large k, all three methods perform well.158 The performance of the WTz method though does not only depend on the number of studies but also depend on the τ2 estimator employed and its magnitude.45 The WTz coverage has been found to differ by up to 0.05 between different τ2 estimators, up to 0.30 between meta-analysis samples, and up to 0.20 across between-study variance values ranging from small to large τ2. Coverage has been found to be as low as 0.65 (at 0.95 nominal level) when I2 (defined as the percentage of the total variability in a set of effect sizes that is due to between-study variability beyond what is expected by within-study random error) is 90% and two or three studies are included in a meta-analysis, but it tends towards the nominal level as the number of studies increases.113

To increase coverage, the t distribution can be used, which produces wider CIs than those obtained by the standard normal distribution, especially when τ2 and k are small.15, 117 The coverage probability is therefore higher with the WTt approach, but it depends on the estimator and the magnitude of τ2, as well as on the number of studies.45 Simulation showed that the WTt CI is less affected by the number of studies compared with the WTz CI.52 Although WTt coverage may be more robust to changes in the τ2 magnitude compared with WTz when few studies are included in a meta-analysis, it has been found to differ by up to 0.05 depending on the τ2 estimator used and the number of studies.113 For large meta-analysis samples (eg, k ≥ 20), the coverage of the 95% WTt CI may be below the nominal level, but it becomes conservative (close to 1) when k is small.113, 122, 158

Alternatively, the WTqa method is easy to implement and produces intervals with better coverage in comparison with the WTz method.122 A simulation study45 showed that different estimators of the between-study variance may impact on coverage and that the WTqa method is associated with higher coverage than WTz and HKSJ CIs, but the HKSJ method produced values closer to the nominal level. The same study showed that the WTqa method has similar coverage to the WTt method. For small k, coverage of the WTt method is well above the nominal level and higher than that for the WTqa method, but as k increases, the differences in coverage are not so important.122 Simulations have also shown that WTqa outperforms WTz, PL, and ZL approaches but it is very conservative.146

3.2.2 HKSJ methods (method 4 and 5)

The HKSJ approach (method 4) is often preferred, as in the case of small k, it is conservative and on average produces wider CIs with more adequate type I error compared with the WTz method.16, 119, 141, 152, 159 The HKSJ method provides exact inference when all study sizes are equal and the random-effects model is true, resulting in better inference than WTz,160 but also provides more accurate inference in small meta-analyses with different study sizes.14, 15, 116 Several studies suggested that the HKSJ method has coverage close to the nominal level and that it is not influenced by the magnitude or estimator of τ2.16, 45, 113, 115, 119, 121, 125, 132, 135, 155 Nevertheless, Knapp and Hartung125 recommend using the PM161 estimator for the between-study variance along with the HKSJ method to obtain CIs for μ so as to get a cohesive approach based on Qgen.161, 162 Sanchez-Meca and Marin Martinez45 recommend using the HKSJ method, as it is additionally insensitive to the number of trials. Simulation studies suggest that HKSJ has good coverage when the effect measure is the log-odds ratio,15, 119 the standardized mean difference,121 the mean difference, and the risk difference.118 The coverage of the 95% HKSJ CI is generally better than the WTz and WTt coverages, but it is suboptimal in meta-analyses with binary outcomes and rare events, as shown in simulated meta-analyses where the odds-ratio was used as the measure of effect.113

A real-life data study of 920 Cochrane meta-analyses with k ≥ 3 showed that the WTz method yielded more often statistically significant results compared with the HKSJ method (45% vs 35% of meta-analyses).163 IntHout et al16 found similar results in their real-life data study with 434 Cochrane meta-analyses with dichotomous data (43% vs 34%) and 255 Cochrane meta-analyses with continuous data (51% vs 40%). It is recommended that caution is needed when fewer than five studies of unequal sizes are included in the meta-analysis.16 Wiksten et al132 in their empirical evaluation including 157 meta-analyses with dichotomous data and k ≥ 4 found that in the presence of heterogeneity (using the DerSimonian and Laird11 estimator [ urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0062] or the Cochran Q statistic [P < 0.10]),11, 164 the P value for the overall effect size was typically greater when using the HKSJ than the WTz method. However, they comment that the HKSJ method is not always more conservative when urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0063.

It has been shown that in the absence of heterogeneity, the coverage of HKSJ may be smaller than the WTz coverage providing narrower CIs.15, 115, 118, 121, 132, 133, 165 This was more prevalent in cases with rare events.132 Jackson et al133 raise a variety of concerns about the use of the HKSJ method, including (1) the implications of the modification for any given meta-analysis are hard to predict, (2) HKSJ can result in shorter CIs for the overall effect size than the WTz method, and (3) the coverage of the HKSJ CI might be anticipated to be low when urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0064. However, in simulation studies conducted by Röver et al,135 Viechtbauer et al,134 and Sanchez-Meca and Marin Martinez,45 HKSJ worked well even in the absence of heterogeneity. This is in line with the simulations by Gonnermann et al,166 but in the presence of only two studies for τ2 = 0, HKSJ is associated with very low power compared with WTz (15% vs 60%), which may be due to the wider CI, whereas for mild to moderate τ2, both methods have poor control of type I error. A simulation study compared HKSJ with the small sample modification suggested by Knapp and Hartung125 and indicated that the use of the modified HKSJ (method 5) is preferable when few studies of varying size and precision are available.135 Another simulation study suggested the use of the modified HKSJ approach instead of the common HKSJ and WTz approaches when dichotomous data are considered.167 However, for few studies (and particularly for k = 2) and as the between-study variance decreases, the modified HKSJ tends to be overconservative, and selection between the methods is a matter of power vs type I error.133-135

3.2.3 Likelihood-based methods (methods 6, 7, and 8)

The PL method is often preferred to the WTz method, as it is associated with a higher coverage closer to the nominal level, even when k is relatively small.122, 143 Jackson et al158 showed that the PL method performed well and better than the WTz and WTt methods in meta-analyses with few studies (k ≤ 8) with coverage close to the nominal level. However, coverage decreases as τ2 increases and/or k decreases.43 Simulations suggest that the PL CI is less affected by the number of studies in a meta-analysis compared with the WTz CI, but both WTz and PL have poor coverage control, as they yield values below the nominal level.52

Simulations found that the Bartlett-type correction CI (method 7) improves coverage properties over the WTz, WTt, and PL methods that their coverage deviates the nominal level as τ2 increases and/or k decreases.17, 52 Although the Bartlett-type correction CI has a satisfactory power compared with the WTz, WTt, and PL CIs52 and performs well when τ2 = 0,142 caution is needed for k ≤ 5, as it tends to be overconservative.17 The Skovgaard statistic CI (method 8) is associated with coverage closer to the nominal level compared with the WTz and PL CIs, which is remarkable for small k.17, 79, 141 Both the Skovgaard statistic CI and the Bartlett-type correction CI perform very satisfactorily regarding coverage and yield similar results.141

3.2.4 HC and BT (methods 9 and 10)

Simulations showed that in the absence of publication bias and for k > 10, the HC method yields better coverage than WTz, HKSJ, PL, and BT methods, whereas for k < 10, the HKSJ and PL methods perform best.30 The same study showed that when publication bias is present and for k > 10, HC improved coverage compared with WTz, HKSJ, PL, and BT methods and showed less bias than the fixed-effect model. Also, the WTz and BT methods have comparable coverage probabilities with coverage below the nominal level,54, 122 but coverage is increased for the exact weights.54

3.2.5 Resampling methods (methods 11, 12, 13, and 14)

Zeng and Lin146 showed that the ZL CI outperforms both WTz and PL CIs for small k in terms of coverage. Another simulation study showed that the FP CI controls coverage better than WTz, WTt, and PL and is closely followed by the Bartlett-type correction CI, but the latter is slightly more powerful especially for small k.52 The same study showed that the FP CI and the Bartlett-type correction CI were less affected by the number of studies than WTz, PL, and WTt methods.

3.2.6 Bayesian CrIs (method 15)

Simulation studies showed that Bayesian intervals produce intervals with coverage closer to the nominal level compared with the HKSJ, modified HKSJ, and PL CIs,155, 168 and they tend to be smaller than the HKSJ CI even in situations with similar or larger coverage.155 However, the performance of the Bayesian CrIs may vary depending on the prior assigned to the between-study variance.156

3.3 Prediction intervals

One of the most important aims in clinical decision-making is the prediction of the possible effect size in an individual setting. A prediction interval provides a predicted range for the true effect size in a new study, and its calculation is recommended to be conducted under the random-effects model.24, 153 Assuming the random effects are normally distributed, an ad hoc 95% prediction interval can be obtained by
urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0065
To date, this is the standard prediction interval approach used in meta-analysis. We call this prediction interval ad hoc, because τ2 is unknown, and currently, there is no exact distributional form available. The use of a t distribution instead of a normal distribution reflects the uncertainty resulting from the estimation of the heterogeneity. Higgins et al24 also presented this ad hoc 95% prediction interval but instead using quantiles from the tk − 2 distribution. However, we suggest using the t distribution with consistent degrees of freedom for both CIs (eg, see the WTt and HKSJ methods) and prediction intervals. This is because when there is truly no heterogeneity (τ2 = 0), the overall effect size and the true effect size in a new study are identical, so that CIs (for the overall effect size) and prediction intervals (for the true effect in a new study) should be identical. Taking the estimated between-study variance of zero to be the true value therefore gives rise to the intuition that CIs and prediction intervals should be identical when urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0066. In fact, in the metafor R package,77 the CIs, and prediction intervals are always computed in a consistent manner: When a WTz CI is computed, then a prediction interval is also calculated using a standard normal distribution, whereas when a HKSJ CI is computed, then both CI and prediction interval are computed using the tk − 1 distribution. Hence, when urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0067, the CI and prediction interval will coincide, as intuition suggests that they should. To date, other routines, including the meta75 R package, the Stata metan,104 and Stata mvmeta,105 commands calculate a prediction interval using a t distribution with k − 2 degrees of freedom. Another advantage of using the tk − 1 distribution is that for k = 2, where a CI for the overall effect size is available, prediction intervals can be calculated. However, this is not the case when the tk − 2 distribution is used. Prediction intervals come especially naturally from a Bayesian approach, but at the price of specifying priors.156

It is worth noting that the prediction interval does not inform the statistical significance of urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0068; it instead describes the region within which the true study effects of new studies are expected to be found. A prediction interval can help understand the uncertainty about whether an intervention is expected to work and reflects the potential effect in future study participants.169 Prediction intervals are particularly helpful when excess between-study heterogeneity exists, and the combination of individual studies into an overall effect size would not be advisable. IntHout et al170 found that in more than 70% of the statistically significant meta-analyses in the Cochrane Database of Systematic Reviews with urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0069, the 95% prediction interval suggested that the effect size in a new study could be null or even in the opposite direction from the overall result in some patient populations. The prediction interval can also be used to calculate the probability that a new trial will have a negative result and to improve the calculations of the power of a new trial. Conclusions drawn from a prediction interval are based on the assumption the study-effects are normally distributed. The prediction interval estimation will be imprecise if the estimates of the overall effect size and τ2 are away from the true parameter. Partlett and Riley171 assessed the performance of the ad hoc 95% prediction interval in a simulation study, and they concluded that the method is only accurate when heterogeneity is large (I2 > 30%) and the study sizes are similar. However, for small heterogeneity and different study sizes, the coverage of prediction interval can be as low as 78% depending on the between-study variance estimator.171 Lee and Thompson172 highlight the importance when calculating a prediction interval to allow for potential skewing and heavy tails in the random-effects distributions. Prediction intervals can be implemented in several software, such as R (using, for example, metafor77 and meta75 packages) Stata (using, for example, metan104 and mvmeta105 commands).

4 DISCUSSION

The estimation of the overall effect size is one of the primary aims in meta-analysis. Therefore, the computation of a CI/CrIs is crucial in order to interpret the uncertainty in the estimated overall effect size. Wald-type methods, and in particular the WTz CI using the DerSimonian and Laird11 estimator for the between-study variance, are commonly used and are the default option in several meta-analysis software (eg, RevMan).67 However, the accuracy of these standard CI methods is not optimal, as the coverage probability associated with these CIs can deviate considerably from the nominal coverage probability in small meta-analyses.12-18 This is not surprising as the WT methods rely upon large-sample approximations requiring many studies to be included in a meta-analysis. However, meta-analyses often include a small number of studies, and large-sample approximations can be inaccurate.19-23 Perhaps because of this property, several other CI methods have been proposed to improve the standard WT CIs, including likelihood-based and resampling methods and, more recently, higher order likelihood inference methods. In the present study, we provide a comprehensive review of the CIs for the overall effect size under the random-effects model.

Our review identified 15 methods for calculating a CI for the overall effect size, each of which has different statistical properties. The selection of a method for computing a CI should be based on its statistical performance according to the corresponding meta-analysis' characteristics, as well as on the method's computational and conceptual complexity. Usually, one of these comes at the price of another. For example, the likelihood-based methods are associated with coverage closer to the nominal level compared with the commonly used WTz method but are computationally more demanding than the WTz CI. Also, the use of some methods (eg, ZL) is limited in meta-analyses, due to complex calculations with standard software or their unavailability in statistical software. Simulations have assessed the performance of various methods and showed that it mostly depends on the magnitude of the between-study variance and number of studies in a meta-analysis. However, additional items should be considered when selecting a CI. These may include the type of outcome data and the study size.

The selection of the most preferable methods to calculate a CI for the overall effect size can be mostly based on coverage, as this measure was the only one consistently reported across the identified studies. The 15 methods identified in this review have never all been compared in one simulation study under the same conditions, and hence, making any clear recommendations about these methods would be difficult. Also, none of the methods had an optimal coverage across all settings. Therefore, we can only offer tentative recommendations based on the available evidence, but these depend on the study findings, their simulation scenarios, and the CIs examined. It would require an extensive simulation study to assess the performance of all of these methods, under the same, realistic settings. Future studies should evaluate the CIs for all relevant properties, including coverage, precision, complexity, and power of the corresponding tests. In addition, further research is necessary to make judgements on the performance of CIs. In particular, a comprehensive simulation study informed by real-life data included in a meta-analysis would help determine the factors that impact the performance of the CI methods. Factors to consider in this analysis may include number and size of studies, baseline risk variability, magnitude and estimator of the between-study variance, frequency of events in dichotomous outcome data, type of outcome data, choice of effect size, distribution of effect sizes, sensitivity to small-study effects or publication bias, and different meta-analytical approaches (eg, Mantel–Haenszel, Peto, or one-step methods).

To date, limited evidence exists to inform which method performs best, especially when studies are few in number (<5), and given that the Bayesian intervals have not been assessed extensively in comparative studies. Overall, studies suggest that the HKSJ method has one of the best performance profiles. It performs well even in meta-analyses with fewer than 10 studies28 and is robust to the use of different estimators for the between-study variance and to changes in the magnitude of the between-study variance.45, 113 However, it should be considered that HKSJ is not always conservative compared with a fixed-effect meta-analysis.132 If the estimated between-study variance is zero, the variance of the estimated overall effect size can be inaccurately small, and hence, the HKSJ CI will be too narrow.132 In such cases, it has been suggested to use the modified HKSJ to avoid inaccurate narrow CIs.125 Also, caution is needed in meta-analyses with rare events, where the HKSJ coverage has been found to be as low as 85%113 and meta-analyses with fewer than five studies.28 In the case of few studies, the modified HKSJ has been suggested,135 but in the case of k = 2, the modified HKSJ tends to be overly conservative.135, 166 The likelihood-based methods, and in particular the higher order methods Skovgaard statistic CI and Bartlett-type correction CI, are also associated with good coverage properties.141 However, the higher order likelihood methods have never been compared directly with HKSJ, which would help make informed decisions on the CI selection. Alternatively, Bayesian intervals may be considered preferable to frequentist intervals in situations where prior information is available and can be considered suitable for use in the meta-analysis. Bender et al28 recommend the use of the HKSJ method as a standard approach but in case studies have considerably different precisions the modified HKSJ should be preferred. The same authors also suggest that when reliable prior information on the between-study variance is available, then the Bayesian intervals with (weakly) informative prior distributions for the heterogeneity should be preferred.

The computation of prediction intervals in meta-analysis is also valuable, as they provide additional information about the overall effect size, and we believe that they should be used more frequently. We propose to use k − 1 degrees of freedom rather than k − 2 to calculate prediction intervals, so that the CIs using a t distribution (eg, WTt and HKSJ CIs) and prediction intervals are identical when urn:x-wiley:17592879:media:jrsm1319:jrsm1319-math-0070. Although some concerns have been raised about prediction intervals, including their actual coverage probability of the true effect in a new study and their sensitivity to distributional assumptions,171, 172 their advantages outweigh their disadvantages, as they are a nice and easy way for people to interpret the implications of the between-study heterogeneity implied by their fitted model. A comprehensive simulation study assessing the different types of prediction intervals under a variety of meta-analytical scenarios and different between-study variance estimators would help critically examine the issues associated with the calculation and interpretation of prediction intervals.

In conclusion, there are multiple methods to compute a CI for the overall effect size, and none of the methods clearly performs best across all meta-analytical settings. We hope that bringing them all together in one place will facilitate investigators in forming their own judgements about the most appropriate method for their needs. Overall, based on the existing literature and consensus among the coauthors of this paper, we tentatively suggest the application of the HKSJ method as a standard approach, at least in a meta-analysis with five or more studies. We recommend conducting a sensitivity analysis using a variety of methods (with at least two to three methods) to assess the robustness of findings and conclusions, especially in a meta-analysis with fewer than 10 studies. It should be highlighted that these results refer to normally distributed true study-specific effects, and simulation studies are necessary to compare the performance of the 15 methods described in this review. For example, Kontopantelis and Reeves173 used various nonnormal distributions for the effect sizes and compared the WTz, WTqa, HKSJ, PL, BT, and FP methods. The authors showed that simulation results were broadly consistent across different effect size distributions (normally distributed, skew-normal, and extreme nonnormal study effects) with PL providing the best coverage, but with wide CI. Also, the FP method provided coverage close to the nominal level, regardless the included number of studies, at the expense of highly lengthy CIs. The HKSJ method had a consistent 94% coverage for nonnormal study effects and small heterogeneity, but with larger heterogeneity, the FP method performed better than the HKSJ CI. For a small number of studies (≤5), the WTz and PL methods performed best, with WTz outperforming PL only when the between-study variance was small. However, more simulation studies with nonnormal true study-specific effects are required to draw robust conclusions for the 15 CIs across different meta-analytical scenarios.

We also recommend the calculation of prediction intervals as a supplement to a CI to illustrate the degree of heterogeneity, particularly when large between-study heterogeneity is present. However, caution is needed for small between-study heterogeneity and unequal study sizes. In this case, it is advisable to prefer prediction intervals derived in Bayesian framework using, for example, informative prior distributions.20, 21 Should any new methods become available, we recommend that these are compared with most, or ideally all, of the methods described in this review, and under the same circumstances both using real-life data and simulation studies. In Appendices S6 to S10, we present a selection of the identified methods for computing a CI to four illustrative, and contrasting, real data examples.174 We hope that our codes presented in Appendices S8A-C and S9 can help to make this possible. This will help obtain a clearer picture about the performance of these methods when these are compared with each other.

List of Abbreviations

  • BT
  • Biggerstaff and Tweedie
  • CI
  • confidence interval
  • CrIs
  • credible intervals
  • FP
  • Follmann and Proschan
  • HC
  • Henmi and Copas
  • HKSJ
  • Hartung-Knapp/Sidik-Jonkman
  • IQWiG
  • Institute for Quality and Efficiency in Health Care
  • PL
  • profile likelihood
  • WT
  • Wald-type
  • WTqa
  • quantile approximation
  • WTt
  • Wald-type with a t distribution
  • WTz
  • Wald-type with a normal distribution
  • ZL
  • Zeng and Lin
  • AVAILABILITY OF DATA AND MATERIALS

    All data are from a published study and are available in the supporting file.

    DETAILS OF CONTRIBUTORS

    Areti Angeliki Veroniki, Dan Jackson, Ralf Bender, Oliver Kuss, Dean Langan, Julian P.T. Higgins, Guido Knapp, and Georgia Salanti contributed to the conception and design of the study and helped to draft the manuscript. Areti Angeliki Veroniki searched PubMed, screened identified articles for eligibility, contacted original study authors when needed, and conducted the statistical analysis. All authors read and approved the final manuscript.

    FUNDING INFORMATION

    This work did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. Ralf Bender is employed by the Institute for Quality and Efficiency in Health Care, Cologne, Germany.

    ACKNOWLEDGMENTS

    We thank Dr. Jack Bowden and Dr. Wolfgang Viechtbauer for providing their feedback on a previous draft of the paper. We thank Annamaria Guolo and Michael Preuss for responding to our questions regarding their approaches. We also thank Shazia Siddiqui and Myanca Rodrigues for formatting the manuscript.

      CONFLICT OF INTEREST

      The authors declare that there is no conflict of interest.

        The full text of this article hosted at iucr.org is unavailable due to technical difficulties.