Volume 76, Issue 4 p. 444-447
Editorial
Free Access

Hundreds of thousands of zombie randomised trials circulate among us

J. P. A. Ioannidis

Corresponding Author

J. P. A. Ioannidis

Professor

Departments of Medicine, Epidemiology and Population Health, Biomedical Data Science, and Statistics Meta-Research Innovation Center at Stanford (METRICS), Stanford University, Stanford, CA, USA

Correspondence to: J. P.A. Ioannidis

E-mail: [email protected]

1 Professor, Departments of Medicine, Epidemiology and Population Health, Biomedical Data Science, and Statistics Meta-Research Innovation Center at Stanford (METRICS), Stanford University, Stanford, CA, USA

Search for more papers by this author
First published: 30 October 2020
Citations: 34
This editorial accompanies a paper by Carlisle. Anaesthesia 2021; 76: 472–9.

In this issue, Anaesthesia, publishes an overview of the experience of the journal in trying to unearth submitted randomised controlled trials (RCT) that appear to be false and those where the data lack credibility so blatantly that they can be called ‘zombies’ [1]. The important work of John Carlisle, who single-handedly led this task over several years, has major implications for medical science and its publication systems. Carlisle probed the submitted papers for all 526 trials considered by Anaesthesia over 3 years (February 2017–March 2020) and also obtained individual level data from the authors of 153 trials. Availability of individual level data increased the odds of a trial being called false or zombie by 47-fold and 79-fold, respectively. Among the 153 trials examined in depth, 44% were deemed to be false and 26% were deemed to be zombies. Very few false and zombie trials could be detected based just on the originally submitted manuscript without the benefit of accessing individual level data.

How widespread is the problem?

The rates of false trials and zombies are probably underestimates for the 153 trials examined in depth. Carlisle spent a lot of time on this exercise, perhaps investing more effort to each trial than many/most of its authors. Nevertheless, the ability to detect major or even fatal flaws depends on whether there is any hint signalling these flaws and whether one can identify these hints. Summary group data presented in manuscripts offer few clues that something major might have gone wrong. In previous work [2-4], Carlisle had shown how examining the reported baseline characteristics per randomised group may identify occasional trials with implausible data. This led to a few major discoveries, for example, like the one that led to the retraction, re-analysis and (debatable) republication of the study by Estruch et al., arguably the most prestigious trial on nutrition research [5, 6]. Individual level data offer far more opportunities to probe data veracity, but still do not tell the full story. Most trials classified as zombies had data spreadsheets with blatant problems. If we accept that these data had been manipulated or fabricated, then the manipulator/fabricator was silly to do something so egregious and easily caught, for example, cut and paste large data segments across the spreadsheet or report even numbers for all outcomes. Carlisle caught the obvious mess, the uneducated sloppiness. Sophisticated sloppiness and more dexterous manipulators would still remain undetected. Moreover, even though Carlisle is extremely experienced in doing these analyses, the effort resembles dissecting a corpse searching proof of poisoning. Even the most experienced eye may miss these diagnoses. There is no inclusive textbook of manipulations. Compiling such a trials forensics guide is not straightforward.

Could these data be extrapolated to all RCTs across medicine? The 153 trials investigated in depth were chosen deliberately, based on whether the summary data and manuscript raised any suspicion (2017-2019) or (in the last year) whether they came from ‘suspect’ countries that submitted many trials (Egypt, China, India, Iran, Japan, South Korea and Turkey). Therefore, they represent a biased, select sample. Not surprisingly, higher numbers of false and zombie trials were detected from these countries.

Data origin

Focusing on papers submitted during the last year, papers from countries where spreadsheets were routinely sought and limiting to the trials where individual level data could be scrutinised, the rates of false trials were very high: 100% (7/7) in Egypt; 75% (3/ 4) in Iran; 54% (7/13) in India; 46% (22/48) in China; 40% (2/5) in Turkey; 25% (5/20) in South Korea; and 18% (2/11) in Japan. In most of these countries, except for Turkey and South Korea, all or almost all of the false trials were zombies. The WHO International Clinical Trials Registry lists (as of 26 September 2020): 6668 trials from Egypt; 27,064 from Iran; 33,134 from India; 54,746 from China; 8362 from Turkey; 19,680 from Korea; and 51,095 from Japan. If one extrapolates the proportions of false trials observed in Anaesthesia in 2019-2020 for these countries across their WHO registry entries, one can estimate almost 90,000 registered false trials from these countries, including some 50,000 zombies. In fact, Anaesthesia only considers for publication trials that have been registered. Perhaps the majority of trials from these seven countries were not even registered [7, 8] (with variability in the registration rates across countries and over time). Moreover, the proportion of false and zombie trials may be higher among studies that don’t even bother to register themselves than among registered ones. If these assumptions are true, then 200,000–300,000 false trials and 100,000–200,000 zombies may be lurking from these seven countries alone.

These estimates obviously have uncertainty, but they cover only a subset of the total false and zombie trials, since only seven countries were considered in these calculations. Trials from other countries were not as systematically examined by Carlisle, because his prior suspicion for them being false and zombie was lower. This may be appropriate, and the rates of false and zombie trials may indeed be much lower elsewhere. However, even then, there may be many tens, if not hundreds of thousands of additional false and zombie trials from other countries. Moreover, as discussed above, a clean spreadsheet does not mean that a trial is free from manipulation. For example, it could just be that manipulators in the USA or Europe are more sophisticated and can hide fraud better than their colleagues in China or Egypt.

The number of trials submitted to Anaesthesia from countries known for their strong tradition in research seems low at first glance. In fact, the total number of trials submitted from the six most prolific countries in that group (France, Germany, Australia, USA, UK and Canada) is only 90, as opposed to 96 trials submitted from China alone. Of these six countries, only the USA (n = 147,442 trials) outperforms China (n = 54,746) in the number of trials entered in the WHO registry, and this includes trials registered over many years. Among recent years, the gap has narrowed or even reversed. In many fields of research, China has outpaced the USA and other traditional research powers in publication volume [9, 10]. For example, as of 27 September 2020, the number of COVID-19 papers from China in LitCOVID is more than double the number of papers from the USA. A staggering 1551 trials on COVID-19 were registered by 19 May 2020 in clinicaltrials.gov, with a massive contribution coming from China [11]. Other countries, like the other six highlighted in the Anaesthesia corpus, are also becoming massive producers of clinical research. Most of this research may come from inexperienced investigators and from environments with suboptimal safeguards for the quality or integrity of the research. Several empirical studies highlight major deficits in these studies from China and less developed countries and suggest that these studies may eventually disseminate biased inferences, for example, more statistically significant results and larger estimates of treatment benefits [12-14].

Studies done in the USA or Europe are not bulletproof either. Most likely fewer of them would show egregious spreadsheet errors. Nevertheless, much academic research in these countries remains of low quality. Furthermore, in trials funded by industry and entering regulatory review, superficial errors would be detrimental to drug licensing. These trials could nevertheless be zombies for other, more subtle, but even more dangerous issues including: choice of clinically irrelevant outcomes; misleading choice or comparators and non-inferiority designs; sophisticated but misleading statistical analyses; selective reporting; spin and other forms of dissemination bias; and collusion within the medical-industrial complex [15]. Such trials were probably under-represented in the Anaesthesia cohort, given the nature of the specialty of anaesthesia and of its interventions.

Research value

Are false and zombie trials as defined by Carlisle a total waste, or they still have some value despite their flaws? First, one has to acknowledge that Carlisle did not provide an inter-observer agreement estimate for calling a trial false or zombie. Different editors and analysts may feel differently about the seriousness of detected problems. Some observed peculiarities and anomalies in the data may still be chance findings without manipulation involved. However, most look very unlikely to reflect randomness.

A critical issue is whether the detection of a zombie signature invalidates the entire dataset. Perhaps only part of the data are spurious and the rest of them are valid. For example, one cannot exclude that some data were entered by an inexperienced student or even by a colleague with unethical behaviour, but the rest of the data are entirely valid. In the past, we have documented situations where fraudulent data were entered in a trial and this resulted in loss of trust not only in that trial itself but to the entire organisation running many trials [16]. The specific trial results and conclusions were not affected by the thin slice of fraudulent data and the other trials of the same investigators should certainly not have suffered, since intentions, methods and conduct were excellent. Clinical research is difficult and error-prone. The perfect trial does not exist, but imperfect trials may still offer some useful evidence and guidance.

Eventually, however, it is not just an issue whether the specific flawed data make a difference in the specific dataset. This could be addressed by excluding the questionable data and rerunning the analyses. This is how a debatable effort was made to salvage the study by Estruch et al. from definitive retraction [6, 7]. The deeper problem is that the detection of fatal flaws in some data or in some aspect of the study may justifiably call into question whether additional flaws and fatal mistakes or manipulations have happened in other study aspects that cannot be probed with whatever data are made available. Many aspects of clinical research remain non-transparent, even when individual level data are shared. Shared datasets are a final version that may have undergone tens of iterations and cleansing and manipulations since the first data points were entered upon initial collection. Aspects of the fidelity of the randomisation, allocation concealment, blinding, visit schedule, data collection, data elicitation and many other key features of how a trial was actually run in the field remain largely unknown. A trial with a stupendous zombie pattern on a spreadsheet may or may not have had equally stupendous problems in other aspects of its design, conduct and write-up. Major problems may cluster in the same trials, even if the extent of clustering of major flaws is unknown in each single trial.

The intriguing editorial experience of Anaesthesia has implications across medical research. First, editorial review of trials should require access to individual level data and this would require proper resources and time commitment. Upon personal communication, Carlisle confided that he devoted on average 3 h to each trial that provided spreadsheets, but less experienced editors and analysts may not be as efficient. Moreover, most journals currently run on very thin editorial operations.

Second, one dreads to think of other study designs, for example, observational research, that are even less likely to be regulated and more likely to be sloppy than randomised trials. In all, access to individual level data should become far more common in medical research. Counterarguments have been raised, but solutions exist to appease them [17]. Sharing of individual level data from trials has already become standard in a few journals, such as British Medical Journal and PLoS Medicine, and the results of these trials appear robust to re-analysis [18]. Even if editors cannot look at every single detail, availability of data will enhance post-publication review options and allow many other fruitful uses of these data. Third, systematic reviewers should be aware that false and zombie trials are very common and may be even more common in specific fields and in trials coming from some countries or teams. Many systematic reviews may need to tone down their confidence in their conclusions. Finally, funders and regulators can also promote transparency by providing incentives for data sharing and/or penalties against hiding datasets. Similarly, the academic reward system should take more seriously into account transparency, openness and research rigor in its appraisals for hiring and promoting faculty [19]. There are already too many zombies circulating among us.

Acknowledgements

The Meta-Research Innovation Center at Stanford (METRICS) has been supported by a grant from the Laura and John Arnold Foundation and the work of JI is also supported by an unrestricted gift from Sue and Bob O’Donnell. No other competing interests declared.