What’s holding up the big data revolution in healthcare?

Kiret Dhindsa; Mohit Bhandari; Ranil R Sonnadara

doi:10.1136/bmj.k5357

Editorials

What’s holding up the big data revolution in healthcare?

BMJ 2018; 363 doi: https://doi.org/10.1136/bmj.k5357 (Published 28 December 2018) Cite this as: BMJ 2018;363:k5357

Kiret Dhindsa, postdoctoral fellow1,
Mohit Bhandari, professor2,
Ranil R Sonnadara, associate professor2

¹Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada
²Department of Surgery, McMaster University, Hamilton, Ontario, Canada

Correspondence to: K Dhindsa dhindsj{at}mcmaster.ca

Poor data quality, incompatible datasets, inadequate expertise, and hype

Big data refers to datasets that are too large or complex to analyse with traditional methods.1 Instead we rely on machine learning—self updating algorithms that build predictive models by finding patterns in data.2 In recent years, a so called “big data revolution” in healthcare has been promised3 4 5 so often that researchers are now asking why this supposed inevitability has not happened.6 Although some technical barriers have been correctly identified,7 there is a deeper issue: many of the data are of poor quality and in the form of small, incompatible datasets.

Current practices around collection, curation, and sharing of data make it difficult to apply machine learning to healthcare on a large scale. We need to develop, evaluate, and adopt modern health data standards that guarantee data quality, ensure that datasets from different institutions are compatible for pooling, and allow timely access to datasets by researchers and others. These prerequisites for machine learning have not yet been met.

Part of the problem is that the hype surrounding machine learning obscures the reality that it is just a tool for data science with its own requirements and limitations. The hype also fails to acknowledge that all healthcare tools must work within a wide range of human constraints, from the molecular to the social and political. Each of these will limit what can be achieved: even big technical advances may have only modest effects when integrated into the complex framework of clinical practice and healthcare delivery.

Although machine learning is the state of the art in predictive big data analytics, it is still susceptible to poor data quality,8 9 sometimes in uniquely problematic ways.2 Machine learning, including its more recent incarnation deep learning,10 performs tasks involving pattern recognition (generally a combination of classification, regression, dimensionality reduction, and clustering11). The ability to detect even the most subtle patterns in raw data is a double edged sword: machine learning algorithms, like humans, can easily be misdirected by spurious and irrelevant patterns.12

For example, medical imaging datasets are often riddled with annotations—made directly on the images—that carry information about specific diagnostic features found by clinicians. This is disastrous in the machine learning context, where an algorithm trained using datasets that include annotated images will seem to perform extremely well on standard tests but fail to work in a real world scenario where similar annotations are not available. Since these algorithms find patterns regardless of how meaningful those patterns are to humans, the rule “garbage in, garbage out” may apply even more than usual.13

Even if we had good data, would we have enough? Healthcare data are currently distributed across multiple institutions, collected using different procedures, and formatted in different ways. Machine learning algorithms recognise patterns by exploiting sources of variance in large datasets. But inconsistencies across institutions mean that combining datasets to achieve the required size easily introduces an insurmountable degree of non-predictive variability. This makes it all too easy for a machine learning algorithm to miss the truly important patterns and latch onto the more dominating patterns introduced by institutional differences.

Holistic solution

A holistic solution to problems of data quality and quantity would include adoption of consistent health data standards across institutions, complete with new data sharing policies that ensure ongoing protection of patient privacy. If healthcare leaders see an opportunity to advance patient care with big data and machine learning, they must take the initiative to establish new data policies in consultation with clinicians, data scientists, patients, and the public.

Improved data management is clearly necessary if machine learning algorithms are to generate models that can transition successfully from the laboratory to clinical practice. How should we go about it? Effective data management requires specialist training in data science and information technology, and detailed knowledge of the nuances associated with data types, applications, and domains, including how they relate to machine learning. This points to a growing role for data management specialists and knowledge engineers who can pool and curate datasets; such experts may become as essential to modern healthcare as imaging technicians are now.14 Clinicians will also need training as collectors of health data and users of machine learning tools.15

To truly realise the potential of big data in healthcare we need to bring together up-to-date data management practices, specialists who can maximise the usability and quality of health data, and a new policy framework that recognises the need for data sharing. Until then, the big data revolution (or at least a realistic version of it) remains on hold.

Footnotes

Competing interests: We have read and understood BMJ policy on declaration of interests and declare the following: RRS reports board membership for Compute Ontario, SHARCNET, and SOSCIP—all non-profit advanced research computing organisations. MB reports personal fees from Stryker, Sanofi, Ferring, and Pendopharm and grants from Acumed, DJO, and Sanofi outside the submitted work.
Provenance and peer review: Not commissioned; externally peer reviewed.

References

↵
1. De Mauro A,
2. Greco M,
3. Grimaldi M
. A formal definition of big data based on its essential features. Libr Rev2016;65:122-35. doi:10.1108/LR-06-2015-0061.
OpenUrl CrossRef
↵
1. Michalski RS,
2. Carbonell JG,
3. Mitchell TM
. Machine learning: an artificial intelligence approach.Springer Science & Business Media, 2013.
↵
1. Fernandes L,
2. O’Connor M,
3. Weaver V
. Big data, bigger outcomes: Healthcare is embracing the big data movement, hoping to revolutionize HIM by distilling vast collection of data for specific analysis. J AHIMA2012;83:38-43, quiz 44.pmid:23061351
OpenUrl PubMed
↵
1. Raghupathi W,
2. Raghupathi V
. Big data analytics in healthcare: promise and potential. Health Inf Sci Syst2014;2:3. doi:10.1186/2047-2501-2-3 pmid:25825667
OpenUrl CrossRef PubMed
↵
1. Wang Y,
2. Kung L,
3. Byrd TA
. Big data analytics: understanding its capa- bilities and potential benefits for healthcare organizations. Technol Forecast Soc Change2018;126:3-13.doi:10.1016/j.techfore.2015.12.019.
OpenUrl CrossRef
↵
1. Murdoch TB,
2. Detsky AS
. The inevitable application of big data to health care. JAMA2013;309:1351-2. doi:10.1001/jama.2013.393doi:. pmid:23549579
OpenUrl CrossRef PubMed Web of Science
↵
1. Lee CH,
2. Yoon HJ
. Medical big data: promise and challenges. Kidney Res Clin Pract2017;36:3. doi:10.23876/j.krcp.2017.36.1.3doi:. pmid:28392994
OpenUrl CrossRef PubMed
↵
Cortes C, Jackel LD, Chiang WP. Limits on learning machine accuracy imposed by data quality. In: Advances in neural information processing systems. MIT Press, 1995:239-46.
↵
1. Najafabadi MM,
2. Villanustre F,
3. Khoshgoftaar TM,
4. Seliya N,
5. Wald R,
6. Muharemagic E
. Deep learning applications and challenges in big data analytics. Journal of Big Data2015;2:1.doi:10.1186/s40537-014-0007-7doi:.
OpenUrl CrossRef
↵
1. LeCun Y,
2. Bengio Y,
3. Hinton G
. Deep learning. Nature2015;521:436-44. doi:10.1038/nature14539doi:. pmid:26017442
OpenUrl CrossRef PubMed
↵
1. Bishop C
. Pattern recognition and machine learning.Springer, 2006.
↵
Papernot N, McDaniel P, Jha S, Fredrikson M, Celik ZB, Swami A. The limitations of deep learning in adversarial settings. In: Security and Privacy (EuroS&P), 2016 IEEE European Symposium on IEEE. 2016:372-87.
↵
1. Hazen BT,
2. Boone CA,
3. Ezell JD,
4. Jones-Farmer LA
. Data quality for data science, predictive analytics, and big data in supply chain management: an introduction to the problem and suggestions for research and applications. Int J Prod Econ2014;154:72-80.doi:10.1016/j.ijpe.2014.04.018.
OpenUrl CrossRef
↵
1. Cases M,
2. Furlong LI,
3. Albanell J,
4. et al
. Improving data and knowledge management to better integrate health care and research. J Intern Med2013;274:321-8. doi:10.1111/joim.12105doi:. pmid:23808970
OpenUrl CrossRef PubMed
↵
1. Fridsma DB
. Health informatics: a required skill for 21st century clinicians. BMJ2018;362:k3043. doi:10.1136/bmj.k3043 pmid:30002063
OpenUrl FREE Full Text

[1] ↵
De Mauro A,
Greco M,
Grimaldi M
. A formal definition of big data based on its essential features. Libr Rev2016;65:122-35. doi:10.1108/LR-06-2015-0061.
OpenUrl CrossRef

[2] De Mauro A,

[3] Greco M,

[4] Grimaldi M

[5] ↵
Michalski RS,
Carbonell JG,
Mitchell TM
. Machine learning: an artificial intelligence approach.Springer Science & Business Media, 2013.

[6] Michalski RS,

[7] Carbonell JG,

[8] Mitchell TM

[9] ↵
Fernandes L,
O’Connor M,
Weaver V
. Big data, bigger outcomes: Healthcare is embracing the big data movement, hoping to revolutionize HIM by distilling vast collection of data for specific analysis. J AHIMA2012;83:38-43, quiz 44.pmid:23061351
OpenUrl PubMed

[10] Fernandes L,

[11] O’Connor M,

[12] Weaver V

[13] ↵
Raghupathi W,
Raghupathi V
. Big data analytics in healthcare: promise and potential. Health Inf Sci Syst2014;2:3. doi:10.1186/2047-2501-2-3 pmid:25825667
OpenUrl CrossRef PubMed

[14] Raghupathi W,

[15] Raghupathi V

[16] ↵
Wang Y,
Kung L,
Byrd TA
. Big data analytics: understanding its capa- bilities and potential benefits for healthcare organizations. Technol Forecast Soc Change2018;126:3-13.doi:10.1016/j.techfore.2015.12.019.
OpenUrl CrossRef

[17] Wang Y,

[18] Kung L,

[19] Byrd TA

[20] ↵
Murdoch TB,
Detsky AS
. The inevitable application of big data to health care. JAMA2013;309:1351-2. doi:10.1001/jama.2013.393doi:. pmid:23549579
OpenUrl CrossRef PubMed Web of Science

[21] Murdoch TB,

[22] Detsky AS

[23] ↵
Lee CH,
Yoon HJ
. Medical big data: promise and challenges. Kidney Res Clin Pract2017;36:3. doi:10.23876/j.krcp.2017.36.1.3doi:. pmid:28392994
OpenUrl CrossRef PubMed

[24] Lee CH,

[25] Yoon HJ

[26] ↵
Cortes C, Jackel LD, Chiang WP. Limits on learning machine accuracy imposed by data quality. In: Advances in neural information processing systems. MIT Press, 1995:239-46.

[27] ↵
Najafabadi MM,
Villanustre F,
Khoshgoftaar TM,
Seliya N,
Wald R,
Muharemagic E
. Deep learning applications and challenges in big data analytics. Journal of Big Data2015;2:1.doi:10.1186/s40537-014-0007-7doi:.
OpenUrl CrossRef

[28] Najafabadi MM,

[29] Villanustre F,

[30] Khoshgoftaar TM,

[31] Seliya N,

[32] Wald R,

[33] Muharemagic E

[34] ↵
LeCun Y,
Bengio Y,
Hinton G
. Deep learning. Nature2015;521:436-44. doi:10.1038/nature14539doi:. pmid:26017442
OpenUrl CrossRef PubMed

[35] LeCun Y,

[36] Bengio Y,

[37] Hinton G

[38] ↵
Bishop C
. Pattern recognition and machine learning.Springer, 2006.

[39] Bishop C

[40] ↵
Papernot N, McDaniel P, Jha S, Fredrikson M, Celik ZB, Swami A. The limitations of deep learning in adversarial settings. In: Security and Privacy (EuroS&P), 2016 IEEE European Symposium on IEEE. 2016:372-87.

[41] ↵
Hazen BT,
Boone CA,
Ezell JD,
Jones-Farmer LA
. Data quality for data science, predictive analytics, and big data in supply chain management: an introduction to the problem and suggestions for research and applications. Int J Prod Econ2014;154:72-80.doi:10.1016/j.ijpe.2014.04.018.
OpenUrl CrossRef

[42] Hazen BT,

[43] Boone CA,

[44] Ezell JD,

[45] Jones-Farmer LA

[46] ↵
Cases M,
Furlong LI,
Albanell J,
et al
. Improving data and knowledge management to better integrate health care and research. J Intern Med2013;274:321-8. doi:10.1111/joim.12105doi:. pmid:23808970
OpenUrl CrossRef PubMed

[47] Cases M,

[48] Furlong LI,

[49] Albanell J,

[50] et al

[51] ↵
Fridsma DB
. Health informatics: a required skill for 21st century clinicians. BMJ2018;362:k3043. doi:10.1136/bmj.k3043 pmid:30002063
OpenUrl FREE Full Text

[52] Fridsma DB

What’s holding up the big data revolution in healthcare?

Holistic solution

Footnotes

References

Article alerts

Log in or register:

Download this article to citation manager

Help

Forward this page

Content links

About us

Resources

Explore BMJ

My account

Information

Search form

What’s holding up the big data revolution in healthcare?

Holistic solution

Footnotes

References

Article alerts

Log in or register:

Download this article to citation manager

Help

Forward this page

Content links

About us

Resources

Explore BMJ

My account

Information