Volume 4, Issue 1 p. 26-35
Special Issue Paper
Full Access

Issues relating to confounding and meta-analysis when including non-randomized studies in systematic reviews on the effects of interventions

Jeffrey C. Valentine

Corresponding Author

Jeffrey C. Valentine

College of Education and Human Development, University of Louisville, Louisville, KY, U.S.A.

Correspondence to: Dr. Jeffrey Valentine, 309 College of Education and Human Development, University of Louisville, Louisville, KY 40292, U.S.A.

E-mail: [email protected]

Search for more papers by this author
Simon G. Thompson

Simon G. Thompson

Department of Public Health, University of Cambridge, Cambridge, U.K.

Search for more papers by this author
First published: 06 November 2012
Citations: 95

Abstract

Background

Confounding caused by selection bias is often a key difference between non-randomized studies (NRS) and randomized controlled trials (RCTs) of interventions.

Key methodological issues

In this third paper of the series, we consider issues relating to the inclusion of NRS in systematic reviews on the effects of interventions. We discuss whether potential biases from confounding in NRS can be accounted for, the limitations of current methods for attempting to do so, the different contexts of NRS and RCTs, the problems these issues create for reviewers, and a research agenda for the future.

Guidance

Reviewers who are considering whether or not to include NRS in meta-analyses must weigh a number of factors. Including NRS may allow a review to address outcomes or pragmatic implementations of an intervention not studied in RCTs, but it will also increase the workload for the review team, as well as their required technical repertoire. Furthermore, the results of a synthesis involving NRS will likely be more difficult to interpret, and less certain, relative to the results of a synthesis involving only randomized studies. When both randomized and non-randomized evidence are available, we favor a strategy of including NRS and RCTs in the same systematic review but synthesizing their results separately.

Conclusion

Including NRS will often make the limitations of the evidence derived from RCTs more apparent, thereby guiding inferences about generalizability, and may help with the design of the next generation of RCTs. Copyright © 2012 John Wiley & Sons, Ltd.

As discussed in the first paper in this series, review authors should always consider whether or not to include non-randomized studies (NRS) in a systematic review and explicitly justify their decision (Reeves et al., ). The reasons why review authors might want to consider including NRS in a systematic review are varied. When an adverse event is rare or occurs a long time after intervening, including NRS in systematic reviews may be desirable because randomized controlled trials (RCTs) often have inadequate power to detect a difference in harm between intervention and control groups and commonly do not follow up participants in the long term (Reeves et al., 2009; Loke et al., 2011). Another reason to include NRS in a systematic review is that there might be no or very few RCTs, and there may be a need to synthesize the best available evidence. As such, rigorous reviews of NRS are often necessary. In view of these considerations, this paper focuses on issues arising when considering whether to include NRS in systematic reviews on the effectiveness of interventions. Specifically, the purpose of this paper is to explore three issues. First, we address the question of whether it is advisable to combine different NRS types or to combine NRS with RCTs in the same meta-analysis. We then discuss the ways in which study quality assessments for NRS might be made, focusing on the risk of confounding in studies that rely on non-random allocation of participants to groups. Finally, we examine the impact of including NRS on the expertise required to undertake a review and provide some recommendations for future research.

From the outset, an important distinction must be drawn between a meta-analysis and a systematic review. We believe this distinction is important because, as will be seen, we are cautious about combining NRS of different types and combining NRS with RCTs, in a meta-analysis, but do see the potential for added value in including NRS in a systematic review. We view the latter as a systematic and transparent approach to the collection and evaluation of the literature on a specific research question. When we use the term “meta-analysis,” we are referring specifically to quantitative analysis of the results of multiple studies and note that a systematic review need not include a meta-analysis, nor must a systematic review be limited to a single meta-analysis. Further, not all meta-analyses are based on a systematic review of the literature (although in most cases they should be).

An experiment involves the deliberate introduction of an intervention to evaluate its effects. Philosophers of science (since at least Hume, 1739/1740, and especially Mill, 1843) have understood that assertions of causality rely on establishing the following: (i) temporal priority (i.e., the cause precedes the effect); (ii) observed association between the cause and the effect; and (c) ruling out other alternative non-causal explanations. Building on this work in epidemiology, Hill set out nine aspects of an association that should be considered before deciding that it is most likely to be attributable to causation (Hill, 1965). When all of the concerns described by Hume and Mill are addressed, RCTs provide data that allow for the estimation of average causal effects and therefore represent the best study design available to quantify the effects of interventions. This is not to suggest that RCTs are free from design and implementation problems that cloud their interpretation – RCTs just have fewer of these. Attrition, for example, is a common source of bias that affects RCTs and may result in groups for which the expectation of baseline equivalence no longer holds. Thus, even RCTs are not free from risk of bias, and it is for this reason that Cochrane reviews routinely evaluate the risk of bias for included studies, even though the great majority of studies included in Cochrane reviews are RCTs (Higgins et al., 2008).

Like RCTs, NRS are used to investigate the effects of interventions (Higgins et al., 2012; Reeves et al., 2009). NRS are diverse in the following sense: (i) they encompass a great variety of designs, many of which exploit information in a different way from RCTs; and (ii) the designs that fall under the NRS umbrella vary in terms of their inferential strength. Regardless of the specific design, valid interpretation of the resulting statistics depends in part on the extent to which the construction of the counterfactual condition avoids biasing estimates of treatment effects. In between-groups designs, one primary concern is that the participants in one group might differ systematically from participants in another group. Provided it is carried out appropriately (i.e., with appropriate allocation sequence concealment), the randomization scheme utilized by RCTs creates the expectation of baseline group equivalency on all measured and unmeasured variables. In NRS, allocation to groups depends on other factors, often unknown. Confounding occurs when selection bias gives rise to imbalances between intervention and control groups on prognostic factors, that is, the distributions of the factors differ between groups and the factors are associated with outcome. The worry is that these differences account for some or all of the observed treatment effect (or make it more difficult to detect real treatment effects). Susceptibility to this problem (also known as allocation bias) is widely regarded as the principal difference between NRS and RCTs (Shrier, 2012).

1 Different non-randomized study designs

In this section, we provide a brief overview of some of the major types of NRS designs that are used to investigate the effects of interventions (specifically, the regression discontinuity design, the interrupted time series design, and the instrumental variable approach) and conclude by focusing the discussion on the NRS design that is most frequently included in meta-analyses alongside RCTs (the non-randomized controlled study).

1.1 Regression discontinuity

The regression discontinuity design was originally described by Thistlethwaite and Campbell (1960), later elaborated by Campbell and Stanley (1963), and has been used in a variety of applications in program evaluation and economics since that time (Shadish et al., 2002; Trochim, 1984). In this design, the assignment of participants to the treatment group depends on their score on a covariate. Participants are assigned to the treatment group if their scores on the covariate exceed a specific threshold (e.g., all those with severity ratings greater than 5 receive a certain treatment, and those with severity ratings of 5 or below do not receive the treatment). The covariate is usually correlated with the outcome variable, and this correlation is exploited to obtain an estimate of the treatment effect. Like a randomized experiment, the allocation mechanism in a regression discontinuity design is completely known, and as such, it provides a strong basis for inferences (although randomized experiments are usually more efficient).

The treatment effect in a regression discontinuity design is evaluated by examining the predictions of two regression functions at the treatment assignment point (referred to as the average treatment effect at the cutoff). Suppose that the effect of the treatment is to decrease scores on the outcome for everyone who receives it. Then, the intercept for the regression function for the treatment group will differ from the intercept for the regression function for the control group. In other words, the regression functions will be discontinuous at the point of treatment assignment (giving the technique its name). Choice of inappropriate regression functions (e.g., linear versus non-linear) may result in biased treatment effect estimates.

1.2 Interrupted time series

In its simplest form, the interrupted time series design is a single-group approach that relies on the following: (i) knowing exactly when an intervention has been introduced; (ii) having a large number of observations both prior to and after the introduction of the intervention; and (ii) assuming that any changes observed after the introduction of the intervention are attributable to the intervention itself and not to changes in other conditions. Thus, interrupted time series designs use units of observation as their own controls. Most commonly, the treatment effect in an interrupted time series study is evaluated by examining changes to the level and slope of the outcome variable. This is performed both statistically (in a way that is similar to that for the regression discontinuity design, but more complex because of the autocorrelation of the observations) and/or visually.

1.3 Instrumental variable techniques

Occasionally, instrumental variable techniques have been applied to NRS to obtain estimates of what is known as the local average treatment effect (Stukel et al., 2007). Instrumental variable analysis relies on the availability of a suitable “instrument,” specifically one that is not directly associated with the outcome of interest (i.e., the variable is exogenous) but is associated with assignment to study conditions (i.e., the variable is relevant; Bowden and Turkingson, 1990; Angrist et al., 1996). When these conditions are satisfied, the effect size for the intervention can be estimated without bias (in large samples). However, if the assumptions are not met, then the estimates are biased to an unknown degree (in part because the extent of bias depends on unobservables), and as a practical matter, one rarely has certainty about whether the instrument is in fact valid or not (Imbens and Rosenbaum, 2005, put it nicely when they said that finding good instruments is an art, not a science). For these and other reasons, instrumental variable analyses are rarely used in meta-analyses of treatment effects.

1.4 Non-randomized controlled study

The NRS design that is most comparable with the usual, parallel-group RCT can be referred to as a non-equivalent control group study (Shadish et al., 2002) or more commonly in medical contexts as a non-randomized controlled study. In this design, treatment and control groups are formed in a non-random way. This can range from a haphazard scheme that might be functionally random (e.g., allocation by the last digit of an identifier) to a scheme that is more clearly problematic (e.g., by allowing participants to choose their condition). This design is more common than the other three we have discussed, and because of its prevalence and structural similarity to the RCT, it is the one most likely for researchers to consider for synthesizing together with RCTs.

Given that the only structural difference between RCTs and non-randomized controlled studies is the method of allocation, selection bias is the only additional validity threat that does not also affect a well-conducted RCT. Probably the most common statistical method used to address selection bias in non-randomized controlled studies is the adjustment for baseline covariates by analysis of covariance (ANCOVA). ANCOVA represents an attempt to estimate the average treatment effect as if the groups had been equivalent on the covariate(s) at baseline. The adjusted means arising from an ANCOVA can be thought of as providing a “better” estimate of the treatment effect and along with unadjusted standard deviations can be used to compute a covariate-adjusted standardized mean difference that can be used in meta-analysis.

Two additional considerations regarding non-randomized controlled studies that use ANCOVA are worth noting. First, the interpretation of an ANCOVA is easiest when there is limited correlation between the treatment variable and the covariate. In some applications, this will not be the case. Imagine for example that the correlation between treatment condition and a covariate is high (e.g., >0.70). In these cases, there may be very little overlap between the treatment and control groups, and as such, the adjusted means might relate to a population that does not actually exist. In addition, there is an important distinction between asserting that ANCOVA results in a “better” estimate of a treatment effect and finding out whether it approximates the effect that would have been obtained from a randomized experiment. As we address in more detail later, there is usually no way of knowing what all of the important covariates are, and we will not know if we have omitted important covariates from our model.

Given this discussion, one of the advantages of RCTs over non-randomized controlled studies should be clear: for any study design to give rise to valid inferences, its underlying assumptions have to be met. In fact, the assumptions for all NRS are inherently less transparent and testable than those required for RCTs. As a result, in any given study, the extent to which researchers have been successful in approximating the results of a RCT will be unknown.

2 Key methodological issues (Box 1)

Box 1: Key methodological issues

  • Non-randomized studies may measure outcomes or be conducted in contexts not addressed well in randomized experiments.
  • The risk of confounding by selection bias is a key difference between randomized and non-randomized controlled studies.
  • The effects of selection bias can be large and unpredictable in direction.
  • The factors that may govern selection bias are often poorly measured or even unmeasurable.
  • The ability to control selection bias by statistical adjustment may be limited.
  • Non-randomized and randomized studies may differ with respect to contextual factors, leading to differences between their results.

Various research studies have attempted to investigate the conditions under which non-randomized controlled studies can approximate the results of RCTs. In medical contexts, the results of this work are not promising, suggesting that the bias associated with not randomly allocating participants to groups can be large, unpredictable in direction, and difficult to remove. For example, in empirical studies where non-randomized studies were derived from within a large RCT (Deeks et al., 2003), the biases could lead to consistent over-estimations or under-estimations of treatment effects, for both concurrently and historically controlled studies. A variety of strategies were attempted for adjusting for selection bias, but none was satisfactory in most situations. Although the results of non-randomized studies and RCTs do not necessarily differ (Britton et al., 1998), it appears difficult to anticipate whether they will or not in a particular application (MacLehose et al., 2000).

In the social sciences, research that is somewhat more sanguine about the possibility of NRS approximating the results of RCTs does exist. For example, Shadish et al. (2008) randomly assigned participants to be in either an RCT or an NRS and then compared the results of the randomized experiment with results obtained from the NRS by several different estimation strategies (e.g., ANCOVA, propensity score matching). In general, the estimation strategies performed quite well. For example, ANCOVA (using covariates identified as predictors of the dependent variable via backwards selection) resulted in bias reductions of 84% and 94% for the two outcome variables and with most absolute differences between the RCT-adjusted and NRS-adjusted estimates being very small (e.g., standardized mean difference of <0.05). It should be noted that Shadish et al. had a richer set of covariates available to them than is probably typical and that using propensity score matching with “convenient” matching variables (i.e., sex, age, marital status, and race) performed poorly. This finding suggests the benefits of careful planning. In another example, Cook et al. (2008) found three studies that employed both randomized experiment and regression discontinuity designs. In all three cases, the effects from the RCT were similar in magnitude to those from the regression discontinuity design.

Although this research field is still evolving, our best evidence at this point suggests that there may be identifiable conditions under which NRS are more likely to yield results that approximate RCTs. In addition to careful selection of potential covariates, selecting participants from the same local pool (i.e., all participants are from school A, as opposed to a design in which all treatment students are from school A and all comparison students are from school B) probably serves to reduce both observed and unobserved differences between groups at baseline.

These methods further depend on adjusting for precise and valid measurements of important confounding factors, and this can be thought of as a necessary condition for generating believable estimates from non-randomized controlled study designs (Cook et al., 2008).

As well as the issue of precise and valid measurement, a more difficult problem for systematic review authors in some research areas is identifying the important confounding factors that should have been measured by researchers. The goal of measuring potential confounding factors is to model the allocation process (i.e., the process by which groups are formed) to account for variables other than the treatment that are associated with the outcome. This is a difficult judgment because what constitutes an “important confounding factor” almost certainly depends on the specific research question being posed, the way in which comparison groups are formed, and the outcome being assessed. There is unfortunately little empirical research guiding the choice of important confounders, and as such, a deep level of substantive and methodological expertise is needed. The implication of these considerations is that determining what researchers should have measured and controlled for will rarely be an unambiguous question.

3 Statistical considerations (Box 2)

Box 2: Statistical considerations

  • The extent of selection bias arising from confounding may vary across the included non-randomized studies.
  • The extent of selection bias in non-randomized studies is hard to judge, especially from published information.
  • Some information on bias by confounding may be gained when individual studies present effect estimates adjusted for different sets of confounders.
  • A meta-analysis may give a precise estimate of average bias, rather than an estimate of the intervention's effect.
  • Heterogeneity between study results may reflect differential biases rather than true differences in an intervention's effect.

We have already noted that some NRS (e.g., interrupted time series) employ an inferential logic that may make their results difficult to combine with those of RCTs. Even for non-randomized controlled study designs (the NRS type most closely resembling a RCT), complications can arise when different sets of variables are adjusted for across studies. Although estimates of effect can be naively averaged across studies by inverse variance weighting, this may only provide a precise estimate of the average bias rather than a useful estimate of the intervention's effect (Egger et al., 2001). For these and other reasons, the synthesis of NRS can be controversial, and the simultaneous synthesis of NRS and RCTs even more so. In addition, the results of a synthesis of NRS are almost always more ambiguous than those of a synthesis involving only RCTs.

It is important for individuals conducting a systematic review and meta-analysis to list the confounders that have been adjusted for in each study in the attempt to control selection bias. Doing so may provide an opportunity to compare unadjusted and adjusted effect estimates or to compare estimates according to different degrees of control for potential confounders. Sometimes, this information can be used across studies to judge what would have been the effect of adjusting for further confounders in studies that only present unadjusted (or minimally adjusted) estimates. For example, adjustment for a baseline measure of the outcome (in an analysis of covariance) is always useful, in both decreasing bias and increasing precision.

The NRS selected for potential inclusion in a systematic review may use multiple regression, propensity score, or instrumental variable analysis methods to adjust for confounding. Multiple regression and propensity score analysis (by either regression adjustment or matching) generally give similar results (D'Agostino, 1998). Adjustments using multiple regression and propensity score analyses rely on measurements at baseline that are thorough (i.e., all of the important variables are measured), precise, and valid. These analyses cannot adjust for the effects of unmeasured or unmeasurable factors that may be involved in the allocation process in an NRS, or poorly measured factors, and the degree of resulting bias from residual confounding may still be large and unpredictable in direction (Deeks et al., 2003). Attempting to estimate the maximum possible effect of confounding (Chiba et al., 2007) is not usually practicable – sophisticated analysis cannot rescue poor design.

Heterogeneity between study results is more likely when a meta-analysis includes NRS. Heterogeneity can result from differential biases affecting the studies, as well as differences in context between NRS and RCTs if both types of study are included in the same meta-analysis. Thus, it will always be sensible to explore potential sources of heterogeneity between studies to compare results between randomized studies and NRS (if both types of study are included in the same meta-analysis), as well as to adopt a random-effects meta-analysis approach that acknowledges the unexplained heterogeneity of results between studies. Even so, it has to be understood that the confidence interval around a summary meta-analytic estimate in these circumstances only represents identifiable statistical variation and does not fully reflect the uncertainty due to the unknown direction and magnitude of biases in each study (Turner et al., 2009).

4 Guidance for review authors (Box 3)

Box 3: Additional guidance for reviewers

  • Reviewers should identify the procedures used in the original studies to limit confounding by selection bias.
  • Important potential confounders should be listed, together with the extent to which the studies have addressed their comparability between groups (by either design or analysis).
  • The uncertain nature and extent of selection bias should render conclusions from meta-analyses including non-randomized studies more cautious.

In some cases, meta-analyzing NRS and RCTs together may be unobjectionable. For example, if the NRS is a regression discontinuity design and the effect of the intervention is linear, then estimates from the regression discontinuity design and the RCTs are expected to be similar. We believe such situations are relatively rare. As such, for the reasons described earlier, in most cases, particular caution is warranted when considering including RCTs and NRS in the same meta-analysis. If NRS are limited to non-randomized controlled study designs that have drawn intervention and control participants from the same local pool and if the effect sizes arising from the studies are adjusted for important confounding factors that have been carefully identified and measured a priori, then it might be defensible to include these studies in the same meta-analysis along with RCTs. However, two other points merit consideration. First, the studies should generally be similar in terms of other aspects of risk of bias (e.g., performance, detection, and attrition bias). Otherwise, one might be comparing poorly executed RCTs with well-executed non-randomized controlled studies (or vice versa).

Further, the studies should be similar in other respects as well, which can be defined in terms of the population, intervention, comparator, and outcome (PICO) (MacLehose et al., 2000). This can often be a difficult judgment (Valentine et al., 2011). Lipsey and Wilson (2001) put it well when they said that intervention studies tend to have “personalities” and, at least in some contexts, there is good reason to believe that the characteristics of RCTs and non-randomized controlled study designs tend to differ in many more ways other than just how the comparison groups are formed. As an example, in Kownacki and Shadish's (1999) review of Alcoholics Anonymous (AA) programs, three RCTs and nine NRS were included. A meta-analysis of the RCTs suggested a moderate negative effect for AA, whereas meta-analysis of the NRS suggested a moderate positive effect. At first sight, this might be taken as evidence of bias in the NRS. However, all three RCTs involved participants who had been ordered to participate in treatment, whereas eight of the nine NRS involved participants who had volunteered for treatment. Part of the theory of action underlying AA is that participants must attend of their own volition, and therefore, it could be plausibly argued that court-mandated treatment should not be effective. This example points to the difficulties involved in untangling the differential allocation (or other) bias and PICO differences between study designs.

Other studies suggest additional reasons to be cautious about combining RCTs and NRS in the same meta-analysis. For example, in their review of school-based prevention efforts, Valentine et al. (2009) found that RCTs differed from NRS along a number of dimensions. Among other differences, interventions that were studied using RCTs were more concentrated (i.e., more contact time per week, but the intervention lasted fewer weeks) relative to interventions studied using NRS and were more likely to be implemented by the study authors (or those working closely with them), and published evaluations were more likely to be authored by the intervention developers. This relatively high degree of developer involvement is potentially problematic. For example, across a large number of literature reviews, Petrosino and Soydan (2005) observed effect sizes that depended on who conducted the evaluation. When interventions were evaluated by independent researchers, the effect size was smaller than when evaluated by the developers or their colleagues. This finding could be due to a degree of role conflict when authors are studying their own intervention, or it may simply be that program developers are the best implementers of their own interventions (Valentine et al., 2011). Regardless, in the context of school-based prevention efforts, one might expect RCTs to yield larger effects than NRS because of the additional involvement of the program developers and that this difference in the precise nature of the intervention between studies makes it more difficult to justify the inclusion of the RCTs and the NRS in the same meta-analysis. Although such explanations will always remain somewhat speculative, similar expectations may be true in other contexts as well (Lipsey, 1992).

Assessing the risk of bias caused by confounding is not straightforward. Confounders relating to the research question of interest need to be identified at the protocol stage and, ideally, classified by their importance. Although some suggestions are available (Valentine and Cooper, 2008), there is no established way to do this. An assessment of the risk of confounding should consider whether each confounder was measured, how precisely each was measured (e.g. binary, ordered categories, continuously scaled), how each was taken into account in the analysis – both the design and analysis strategy (e.g., matching, multiple regression, propensity scores), and underlying assumptions about the association between the confounder and the outcome (e.g., linearity of the functional form). An example of how this was performed in a meta-analysis of observational studies is provided by Thompson et al. (2011). For an optimal assessment in a systematic review, this information needs to be reported consistently for all included studies. The Cochrane NRS Methods Group worksheet provides a tool for doing this (Cochrane Non-Randomized Studies Methods Group, 2010). If information is not available for some primary studies, a uniform judgment about the risk of confounding across included studies cannot be made.

We have also already discussed the possibility that the contexts of NRS may differ from RCTs in ways that are important. Therefore, including NRS in the same systematic review can add to our understanding of the totality of the evidence, for example by pointing out gaps in the evidence (such as understudied populations). As such, even though we are skeptical of simplistic efforts to include RCTs and NRS in the same meta-analysis and suggest that this should be carried out only cautiously if at all, we generally do believe that there is additional value to be gained by including NRS in the same systematic review as RCTs. NRS can be included for descriptive purposes only (e.g., by creating a forest plot that omits the overall summary effect size) or could be synthesized separately from RCTs. Descriptive forest plots could include a “traffic light” indicator (as provided by RevMan 5 in a separate table) to show the risk of different biases for individual studies, alongside their point estimates, allowing visual inspection of any obvious relationship between effect size and risk of bias (Valentine et al., 2010, for other suggestions on how the results of included studies can be presented descriptively).

The additional expertise required to synthesize NRS should not be underestimated. RCTs make relatively few assumptions and, as such, represent the easiest type of study with which to work. We have already touched on some of the different assumptions required for different NRS designs, and it should be clear that assessing these is difficult. In part, this difficulty is due to the way information is presented in study reports (e.g., incompletely and in summary form). However, part of the difficulty is related to the assumptions themselves, which are inherently less transparent and more ambiguous than those for RCTs. Furthermore, virtually all studies face pragmatic implementation issues. However, those facing NRS can be particularly difficult to address, in part because researchers are less familiar with them and therefore do not anticipate their occurrence. Among these are fuzzy cutoffs in the regression discontinuity design (i.e., when the treatment is approximately, but not exactly, determined by the cutoff score) and delayed implementation in interrupted time series designs (e.g., when a treatment is implemented slowly in population, with no formal accounting of when treated units actually start receiving treatment). In addition, for some aspects of risk of bias, such as whether the important confounders have been measured as discussed earlier, a deep understanding of the context of the research question is required. Teams wishing to synthesize NRS will need even more topic-specific, statistical, and methodological expertise than would be required from a similar review of RCTs (consider, for example, the difficulties in determining whether a valid instrument for an instrumental variable analysis has been identified). This will require additional training of review authors or assembling a multidisciplinary team with the requisite expertise.

Furthermore, interpreting the results of RCTs and NRS can be difficult. If both kinds of designs yield similar results, then this might reasonably be taken as suggesting a lack of substantial bias in the NRS. However, if different designs yield dissimilar answers, it will usually be unclear whether this is due to selection (or other) biases in the NRS, differences in PICO aspects of the research questions being addressed between study designs, or both. Sometimes, reviewers will be in a position to explore reasons for the discrepancy, but they will usually not be able to arrive at definitive statements regarding the source of the disagreement. This can be a disconcerting situation for review authors and for their readers. That said, if NRS yield different answers from RCTs, this situation does not go away by ignoring the NRS altogether. In other words, the problem exists regardless of whether reviewers decide to review the NRS or not – it is just evident when they collect and synthesize the NRS and hidden when they do not. Our preference in these situations is to give more weight to the RCTs when interpreting the overall body of evidence, unless a plausible PICO difference between study designs is evident. When this is the case, one important role for a systematic review is to help inform the design of the next generation of RCTs so that researchers can begin to untangle the different contextual factors.

Risk of bias assessments for other kinds of NRS will need more development and elaboration. Consider as an example the interrupted time series design. For the inferences regarding the effects of the intervention to be valid, several conditions need to be met. Among these are that there were no other changes introduced along with the intervention that might explain the intervention effect. Violations can be hard to diagnose in the absence a deep understanding of the broader context of the study, and furthermore, this is not a relevant concern in the analysis of randomized experiments. As such, reviewers wishing to include interrupted time series designs will need to develop specific items that assess this and other unique relevant dimensions.

Finally, review authors should emphasize that confidence intervals from NRS (for both individual studies and any combined estimate) only reflect sampling error and the true uncertainty around point estimates will inevitably be larger because of unknown biases.

5 Research priorities (Box 4)

Box 4: Research priorities

  • How to use individual participant data from one or more studies to gain insight into the extent of confounding in all studies.
  • How to gain empirical evidence on selection bias from studies of many meta-analyses and use this in an individual meta-analysis.
  • How to use elicited expert opinion on the effect of confounding in individual studies in an overall meta-analysis.
  • How to represent quantitatively the increased uncertainty of meta-analytic conclusions from potentially biased primary studies.
  • Ascertaining the degree of selection bias in NRS of adverse outcomes.

There is more possibility of addressing selection and other biases in NRS successfully when individual participant data (IPD) are available. For example, even when IPD are available in just one or a few studies, this can provide the opportunity to investigate the effects of adjusting for different confounders or to compare partially and more fully adjusted estimates of effect, thus providing insight into the potential selection biases in other studies available only in published summary form. If IPD are available from all studies, results from studies that can only provide partial adjustment and those that allow fuller adjustment for confounding factors can be combined using a bivariate meta-analysis approach. This has been used to synthesize estimates of risk factor – disease associations across multiple observational studies for which different confounders were available (The Fibrinogen Studies Collaboration, 2009). In using such a technique, there is an assumption of exchangeability between studies in the effects of confounding. Similar methods have been proposed when relevant correlations in before–after studies are not reported in all studies (Abrams et al., 2000). As such, more common reporting and warehousing of IPD is a priority, and methods for analyses that integrate IPD and group statistics should continue to develop.

Methodology for addressing the uncertain effects of biases is also developing. Elicited expert opinions can be formally used to judge the magnitude and uncertainty of biases in each study, and these bias distributions can then be incorporated into a bias-adjusted meta-analysis. This work has been recently specifically extended for use in observational epidemiological studies, where confounding is often the prime concern (Thompson et al., 2011), and this could be adapted to NRS of interventions. These methods separate internal validity (representing rigor of study design and execution) from external biases (representing representativeness or generalizability of the studies to the target context). The uncertainty in the final meta-analytic estimate includes both the sampling variation from the original study estimates and the uncertainties in the elicited biases.

These ideas would have a stronger basis if bias estimates could be informed by empirical evidence, rather than judgment alone. For example, Welton et al. (2009) used an estimate of the degree of bias introduced by inadequate allocation concealment to create evidence-based priors that were then used in a different meta-analysis (on the effects of two different classes of drugs on schizophrenia). The empirical evidence needed to derive evidence-based priors may come from the BRANDO study (Savovic et al., 2010) in due course; this study has combined data from all previous methodological reviews of the biases arising from sub-optimal features of study design, for example, sequence generation, allocation concealment, and blinding. However, it is likely that some (or even most) identifiable biases are context specific, and as such, the extent to which identified biases operate in the same way and to the same extent from one research context to the next is questionable. Further, it should be noted that although techniques for representing additional uncertainty due to the presence of biases are of interest to researchers and advanced users, they need to be developed further and applied more widely before being integrated into the usual systematic review practice.

Finally, we already noted that RCTs are typically not designed to detect evidence of adverse effects (e.g., are not adequately powered for rare events, do not have sufficiently long follow-up periods). This characteristic of RCTs may increase the need for synthesis of NRS. In this regard, there is recent evidence that confounding from allocation may not, on average, cause bias in the estimation of adverse effects. Golder et al. (2011) carried out a systematic review of 19 studies that reported 58 meta-analyses of adverse effects and compared the effect sizes derived by combined RCTs and NRS separately; unadjusted for confounding factors, the overall ratio of odds ratios comparing NRS versus RCTs was 1.03 (95% confidence interval 0.93 to 1.15). This result suggests that there may be less concern over direct confounding by indication in NRS that investigate adverse outcomes. However, it is clear that much more research needs to be conducted on this question.

6 Conclusion

In this paper, we considered the effects of confounding by selection bias with respect to inclusion of NRS in systematic reviews and meta-analyses. We discussed whether such biases can be accounted for, the limitations of current methods, the different contexts of NRS and RCTs, the consequent practical issues for review authors, and research priorities for the future. Review authors who are considering whether or not to combine NRS in meta-analyses, alone or in combination with RCTs, must weigh a number of factors. Including NRS will increase the workload for the review team, as well as the technical expertise required. Furthermore, the results of a synthesis involving NRS will likely be more difficult to interpret, and less certain, relative to the results of a synthesis involving only RCTs. We favor a strategy of including NRS and RCTs in the same systematic review, but synthesizing their results separately. Including NRS will often make the limitations of the evidence derived from RCTs more apparent, thereby guiding inferences about generalizability, and may help with the design of the next generation of RCTs.

Funding

The workshop was supported financially by the Agency for Healthcare Quality and Research and by a grant from the Cochrane Collaboration Discretionary Fund. The views expressed in this article are those of the authors and not necessarily those of the funding bodies, The Cochrane Collaboration or its registered entities, committees or working groups or The Campbell Collaboration.

Acknowledgements

Barnaby Reeves, Peter Tugwell, and George Wells had commented on earlier drafts of the paper in their role as editors of the series of papers. We are grateful to all of the workshop participants (see Appendix to the first paper in this series), all of whom contributed to the discussions providing the foundation for this paper. The views expressed here are those of the authors and do not necessarily reflect a consensus of the workshop participants.

    Disclosures

    JV contributes to the Cochrane Collaboration and to the Campbell Collaboration.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.