ml in healthcare research paper

An official website of the United States government

Official websites use .gov A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS A lock ( Lock Locked padlock icon ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Publications
Account settings
Advanced Search
Journal List

The role of machine learning in clinical research: transforming the future of evidence generation

E hope weissler, tristan naumann, tomas andersson, rajesh ranganath, olivier elemento, daniel f freitag, james benoit, michael c hughes, faisal khan, paul slater, khader shameer, matthew roe, emmette hutchison, scott h kollins, zhaoling meng, jennifer l wong, lesley curtis, erich huang, marzyeh ghassemi.

Author information
Article notes
Copyright and License information

Corresponding author.

Received 2021 Apr 30; Accepted 2021 Jul 26; Collection date 2021.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Interest in the application of machine learning (ML) to the design, conduct, and analysis of clinical trials has grown, but the evidence base for such applications has not been surveyed. This manuscript reviews the proceedings of a multi-stakeholder conference to discuss the current and future state of ML for clinical research. Key areas of clinical trial methodology in which ML holds particular promise and priority areas for further investigation are presented alongside a narrative review of evidence supporting the use of ML across the clinical trial spectrum.

Conference attendees included stakeholders, such as biomedical and ML researchers, representatives from the US Food and Drug Administration (FDA), artificial intelligence technology and data analytics companies, non-profit organizations, patient advocacy groups, and pharmaceutical companies. ML contributions to clinical research were highlighted in the pre-trial phase, cohort selection and participant management, and data collection and analysis. A particular focus was paid to the operational and philosophical barriers to ML in clinical research. Peer-reviewed evidence was noted to be lacking in several areas.

Conclusions

ML holds great promise for improving the efficiency and quality of clinical research, but substantial barriers remain, the surmounting of which will require addressing significant gaps in evidence.

Keywords: Clinical trials as topic; Machine learning, Artificial intelligence, Research design, Research ethics

Interest in machine learning (ML) for healthcare has increased rapidly over the last 10 years. Though the academic discipline of ML has existed since the mid-twentieth century, improved computing resources, data availability, novel methods, and increasingly diverse technical talent have accelerated the application of ML to healthcare. Much of this attention has focused on applications of ML in healthcare delivery ; however, applications of ML that facilitate clinical research are less frequently discussed in the academic and lay press (Fig. 1 ). Clinical research is a wide-ranging field, with preclinical investigation and observational analyses leading to traditional trials and trials with pragmatic elements, which in turn spur clinical registries and further implementation work. While indispensable to improving healthcare and outcomes, clinical research as currently conducted is complex, labor intensive, expensive, and may be prone to unexpected errors and biases that can, at times, threaten its successful application, implementation, and acceptance.

The number of clinical practice–related publications was determined by searching “(“machine learning” or “artificial intelligence”) and (“healthcare”).” The number of healthcare-related publications was determined by searching “(“machine learning” or “artificial intelligence”) and (“healthcare”)”, and the number of clinical research–related publications was determined by searching “(“machine learning” or “artificial intelligence”) and (“clinical research”).”

Machine learning has the potential to help improve the success, generalizability, patient-centeredness, and efficiency of clinical trials. Various ML approaches are available for managing large and heterogeneous sources of data, identifying intricate and occult patterns, and predicting complex outcomes. As a result, ML has value to add across the spectrum of clinical trials, from preclinical drug discovery to pre-trial planning through study execution to data management and analysis (Fig. 2 ). Despite the relative lack of academic and lay publications focused on ML-enabled clinical research (vìs-a-vìs the attention to ML in care delivery), the profusion of established and start-up companies devoting significant resources to the area indicates a high level of interest in, and burgeoning attempts to make use of, ML application to clinical research, and specifically clinical trials.

Areas of machine learning contribution to clinical research. Machine learning has the potential to contribute to clinical research through increasing the power and efficiency of pre-trial basic/translational research and enhancing the planning, conduct, and analysis of clinical trials

Key ML terms and principles may be found in Table 1 . Many of the ML applications discussed in this article rely on deep neural networks, a subtype of ML in which interactions between multiple (sometimes many) hidden layers of the mathematical model enable complex, high-dimensional tasks, such as natural language processing, optical character recognition, and unsupervised learning. In January 2020, a diverse group of stakeholders, including leading biomedical and ML researchers, along with representatives from the US Food and Drug Administration (FDA), artificial intelligence technology and data analytics companies, non-profit organizations, patient advocacy groups, and pharmaceutical companies convened in Washington, DC, to discuss the role of ML in clinical research. In the setting of relatively scarce published data about ML application to clinical research, the attendees at this meeting offered significant personal, institutional, corporate, and regulatory experience pertaining to ML for clinical research. Attendees gave presentations in their areas of expertise, and effort was made to invite talks covering the entire spectrum of clinical research with presenters from multiple stakeholder groups for each topic. Subjects about which presentations were elicited in advance were intentionally broad and included current and planned applications of ML to clinical research, guidelines for the successful integration of ML into clinical research, and approaches to overcoming the barriers to implementation. Regular discussion periods generated additional areas of interest and concern and were moderated jointly by experts in ML, clinical research, and patient care. During the discussion periods, attendees focused on current issues in ML, including data biases, logistics of prospective validation, and the ethical issues associated with machines making decisions in a research context. This article provides a summary of the conference proceedings, outlining ways in which ML is currently being used for various clinical research applications in addition to possible future opportunities. It was generated through a collaborative writing process in which drafts were iterated through continued debate about unresolved issues from the conference itself. For many of the topics covered, no consensus about best practices was reached, and a diversity of opinions is conveyed in those instances. This article also serves as a call for collaboration between clinical researchers, ML experts, and other stakeholders from academia and industry in order to overcome the significant remaining barriers to its use, helping ML in clinical research to best serve all stakeholders.

Key terms related to machine learning in clinical research

Term	Definition
Machine learning (ML)	A mathematical model that is able to improve its performance on a task by exposure to data.
Deep neural networks	ML models with one or more latent (hidden) layers allowing for the generation of non-linear output and complex interactions between layers. Deep neural networks power “deep learning,” which enables tasks, such as image recognition, natural language processing (NLP), and complex predictions. Subtypes of deep neural networks are classified based on the relationship between hidden layers and include convolutional, recurrent, gated graph, and generative adversarial neural networks.
Training, test, and validation sets	: Dataset from which the model learns the optimal parameters to accomplish the task. : Dataset on which the performance of a trained, parameterized model is evaluated. : Dataset that is used to evaluate the model’s performance during training. Differs from a test set in that it is used during training to establish hyperparameters of the model.
Supervised learning	A subset of ML in which the outcomes to be learned by the model (“labels”) are provided in the training set. For example, teaching a model to identify breast cancer patients for study inclusion would require training the model on a training set containing labeled patients with and without breast cancer prior to validating that model on a new set of patients with and without breast cancer.
Unsupervised learning	A subset of ML in which there are no pre-specified labels for the model to learn to predict; instead, models identify hidden patterns in the data.
Natural language processing (NLP)	A form of artificial intelligence that enables the understanding of language. Much modern NLP uses deep neural networks in which words and their relationships to each other are encoded in a set of highly dimensional vectors, enabling the model to parse the meaning of new pieces of text it is presented with.

The role of ML in preclinical drug discovery and development research

Successful clinical trials require significant preclinical investigation and planning, during which promising candidate molecules and targets are identified and the investigational strategy to achieve regulatory approval is defined. Missteps in this phase can delay the identification of promising drugs or doom clinical trials to eventual failure. ML can help researchers leverage previous and ongoing research to decrease the inefficiencies of the preclinical process.

Drug target identification, candidate molecule generation, and mechanism elucidation

ML can streamline the process and increase the success of drug target identification and candidate molecule generation through synthesis of massive amounts of existing research, elucidation of drug mechanisms, and predictive modeling of protein structures and future drug target interactions [ 1 ]. Fauqueur et al. demonstrated the ability to identify specific types of gene-disease relationships from large databases even when relevant data-points were sparse [ 2 ], while Jia et al. were able to extract drug-gene-mutation interactions from the text of scientific manuscripts [ 3 ]. This work, along with other efforts to render extremely large amounts of biomedical data interpretable by humans [ 4 , 5 ], helps researchers leverage and avoid duplicating prior work in order to target more promising avenues for further investigation. Once promising areas of investigation have been identified, ML also has a role to play in the generation of possible candidate molecules, for instance through use of a gated graph neural network to optimize molecules within the constraints of a target biological system [ 6 ]. In situations in which a drug candidate performs differently in vivo than expected, ML can synthesize and analyze enormous amounts of data to better elucidate the drug’s mechanism, as Madhukar et al. showed by applying a Bayesian ML approach to an anti-cancer compound [ 7 ]. This type of work helps increase the chance that drugs are tested in populations most likely to benefit from them. In the case of the drug evaluated by Madhukar et al., a better understanding of its mechanism facilitated new clinical trials in a cancer type (pheochromocytoma) more likely to respond to the drug (rather than prostate and endometrial cancers, among others).

Interpretation of large amounts of highly dimensional data generated during in vitro translational research (including benchtop biological, chemical, and biochemical investigation) informs the choice of certain next steps over others, but this process of interpretation and integration is complex and prone to bias and error. Aspuru-Guzik has led several successful efforts to use experimental output as input for autonomous ML-powered laboratories, integrating ML into the planning, interpretation, and synthesis phases of drug development [ 8 , 9 ]. More recently, products of ML-enabled drug development have approached human testing. For example, an obsessive-compulsive personality disorder drug purportedly developed using AI-based methods is scheduled to begin phase I trials this year. The lay press reports that the drug was selected from among only 250 candidates and developed in only 12 months compared with the 2000+ candidates and nearly five years of development more typically required [ 10 ]. However, due to the lack of peer-reviewed publications about the development of this drug, the details of its development cannot be confirmed or leveraged for future work.

Clinical study protocol optimization

As therapeutic compounds approach human trials, ML has a role to play in maximizing the success and efficiency of trials during the planning phase through application of simulation techniques to large amounts of data from prior trials in order to facilitate trial protocol development. For instance, study simulation may optimize the choice of treatment regimens for testing, as shown in a reinforcement learning approaches to Alzheimer’s disease and to non-small cell lung cancer [ 11 , 12 ]. A start-up company called Trials.AI allows investigators to upload protocols and uses natural language processing to identify potential pitfalls and barriers to successful trial completion (such as inclusion/exclusion criteria or outcome measures) [ 13 ]. Unfortunately, performance of these example models has not been evaluated in a peer-reviewed manner, and they therefore offer only conceptual promise that ML in research planning can help ensure that a given trial design is optimally suited to the stakeholders’ needs.

In summary, there are clear opportunities to use ML to improve the efficiency and yield of preclinical investigation and clinical trial planning. However, most peer-reviewed reports of ML use in this capacity focus on preclinical research and development rather than clinical trial planning. This may be due to the greater availability of suitable large, highly dimensional datasets in translational settings in addition to greater potential costs, risks, and regulatory hurdles associated with ML use in clinical trial settings. Peer-reviewed evidence of ML application to clinical trial planning is needed in order to overcome these hurdles.

The role of ML in clinical trial participant management

Clinical trial participant management includes the selection of target patient populations, patient recruiting, and participant retention. Unfortunately, despite significant resources generally being devoted to participant management, including time, planning, and trial coordinator effort, patient drop-out and non-adherence often cause studies to exceed allowable time or cost or fail to produce useable data. In fact, it has been estimated that between 33.6 and 52.4% of phase 1–3 clinical trials that support drug development fail to proceed to the next trial phase, leading to a 13.8% overall chance that a drug tested in phase I reaches approval [ 14 ]. ML approaches can facilitate more efficient and fair participant identification, recruitment, and retention.

Selection of patient populations for investigation

Improved selection of specific patient populations for trials may decrease the sample size required to observe a significant effect. Put another way, improvements to patient population selection may decrease the number of patients exposed to interventions from which they are unlikely to derive benefit. This area remains challenging as prior work has discovered that for every 1 intended response, there are 3 to 24 non-responders for the top medications, resulting in a large number of patients who receive harmful side effects over the intended effect [ 15 ]. In addition to facilitating patient population selection through the rapid analysis of large databases of prior research (as discussed above), unsupervised ML of patient populations can identify patterns in patient features that can be used to select patient phenotypes that are most likely to benefit from the proposed drug or intervention [ 16 ]. Unstructured data is critical to phenotyping and identifying representative cohorts, indicating that considering additional data for patients is a crucial step toward identifying robust, representative cohorts [ 17 ]. For example, unsupervised learning of electronic health record (EHR) and genetic data from 11,210 patients elucidated three different subtypes of diabetes mellitus type II with distinct phenotypic expressions, each of which may have a different need for and response to a candidate therapy [ 18 ]. Bullfrog AI is a start-up that has sought to capitalize on the promise of targeted patient population selection, analyzing clinical trial data sets “to predict which patients will respond to a particular therapy in development, thereby improving inclusion/exclusion criteria and ensuring primary study outcomes are achieved” [ 19 ]. Though appealing in principle, this unsupported claim conflates outcome prediction (which is unlikely to succeed and runs counter to the intent of clinical research) with cohort selection (which would ideally identify patients on the basis of therapeutically relevant subtypes). Successfully identifying more selective patient populations does carry potential pitfalls: first, trials may be less likely to generate important negative data about subgroups that would not benefit from the intervention; and second, trials may miss subgroups who would have benefitted from the intervention, but whom the ML model missed. These potential pitfalls may be more likely to affect rural, remote, or underserved patient subgroups with more limited healthcare interactions. These two pitfalls carry possible implications for drug/device development regulatory approval and commercialization, as pivotal trials in more highly selected, and less representative, patient subgroups may require balancing the benefits of greater trial success with the drawbacks of more limited indications for drug/device use.

Participant identification and recruitment

Once the specific cohort has been selected, natural language processing (NLP) has shown promise in identification of patients matching the desired phenotype, which is otherwise a labor-intensive process. For instance, a cross-modal inference learning model algorithm jointly encodes enrollment criteria (text) and patient records (tabular data) into a shared latent space, matching patients to trials using EHR data in a significantly more efficient manner than other machine learning approaches [ 20 ]. Some commercial entities offer similar services, including Mendel.AI and Deep6AI, though peer-reviewed evidence of their development and performance metrics is unavailable, raising questions about how these approaches perform [ 21 , 22 ]. A potential opportunity of this approach is that it allows trialists to avoid relying on the completeness of structured data fields for participant identification, which has been shown to significantly bias trial cohorts [ 23 , 24 ]. Unfortunately, to the extent that novel ML approaches to patient identification rely on EHRs, biases in the EHR data may affect the algorithms’ performances, leading to replacement of one source of bias (underlying the completeness of structured data) with another (underlying the generation of EHR documentation).

Participant retention, monitoring, and protocol adherence

Two broad approaches are available to improve participant retention and protocol adherence using ML-assisted methods. The first is to use ML to collect and analyze large amounts of data to identify and intervene upon participants at high risk of study non-compliance. The second approach is to use ML to decrease participant study burden and thereby improve participants’ experiences.

AiCure is a commercial entity focused on protocol adherence using facial recognition technology to ensure patients take the assigned medication. AiCure was demonstrated to be more effective than a modified directly observed therapy strategy at detecting and improving patient adherence in both a schizophrenia trial and an anticoagulation trial among patients with a history of recent stroke [ 25 , 26 ]. Unfortunately, AiCure’s model development and validation process has not been published, heightening concerns that it may perform differently in different patient subgroups, as has been demonstrated in other areas of computer vision [ 27 ]. Furthermore, these approaches, though promising, may encounter a potential barrier to implementation because their perceived invasiveness of privacy may not be acceptable to all research participants and because selecting patients with access to and comfort with the necessary devices and technology may introduce bias.

The other approach to improving participant retention uses ML to reduce the trial burden for participants using passive data collection techniques (methods will be discussed further in the “Data collection and management” section) and by extracting more information from available data generated during clinical practice and/or by study activities. Information created during routine clinical care can be processed using ML methods to yield data for investigational purposes. For instance, generative adversarial network modeling of slides stained with hematoxylin and eosin in the standard clinical fashion can detect which patients require more intensive and expensive multiplexed imaging, rather than subjecting all participants to that added burden [ 28 ]. NLP can also facilitate repurposing of clinical documentation for study use, such as auto-populating study case report forms, often through reliance on the Unified Medical Language System [ 29 , 30 ]. Patients also create valuable content outside of the clinical trial context that ML can process into study data to reduce the burden of data collection for trial participants, such as natural language processing of social media posts to identify serious drug reactions with high fidelity [ 31 ]. Patient data from wearable devices have proven to be able to correlate participant activity with the International Parkinson and Movement Disorders Society Unified Parkinson’s Disease Rating Scale, distinguish between neuropsychiatric symptomatology patterns, and identify patient falls [ 32 – 34 ].

In summary, although ML and NLP have shown promise across a broad range of activities related to improving the management of participants in clinical trials, the implications of these applications of ML/NLP in regard to clinical trial quality and participant experience are unclear. Studies comparing different approaches to participant management are a necessary next step toward identifying best practices.

Data collection and management

The use of ML in clinical trials can change the data collection, management, and analysis techniques required. However, ML methods can help address some of the difficulties associated with missing data and collecting real-world data.

Collection, processing, and management of data from wearable and other smart devices

Patient-generated health data from wearable and other mobile/electronic devices can supplement or even replace study visits and their associated traditional data collection in certain situations. Wearables and other devices may enable the validation and use of new, patient-centered biomarkers. Developing new “digital biomarkers” from the data collected by a mobile device’s various sensors (such as cameras, audio recorders, accelerometers, and photoplethysmograms) often requires ML processing to derive actionable insights because the data yielded from these devices can be sparse as well as variable in quality, availability, and synchronicity. Using the relatively large and complex data yielded by wearables and other devices for research purposes therefore requires specialized data collection, storage, validation, and analysis techniques [ 34 – 37 ]. For instance, a deep neural network was used to process input from a mobile single-lead electrocardiogram platform [ 38 ], a random forest model was used to process audio output from patients with Parkinson’s disease [ 39 ], and a recurrent neural network was used to process accelerometer data from patients with atopic dermatitis [ 40 ]. These novel digital biomarkers may facilitate the efficient conduct and patient-centeredness of clinical trials, but this approach carries potential pitfalls. As has been shown to occur with an electrocardiogram classification model, ML processing of wearable sensor output to derive research endpoints introduces the possibility of corrupt results if the ML model is subverted by intentionally or unintentionally modified sensor data (though this risk exists with any data regardless of processing technique) [ 41 ]. Because of the complexity involved, software intended to diagnose, monitor, or treat medical conditions is regulated by the FDA, and the FDA has processes and guidance related to biomarker validation and qualification for use in regulatory trials.

Beyond the development of novel digital biomarkers, other device-related opportunities in patient centricity include the ability to export data and analytics back to participants to facilitate education and insight. Barriers to implementation of ML processing of device data include better defining how previously validated clinical endpoints and patient-centric digital biomarkers overlap as well as understanding participant opinions about privacy in relation to the sharing and use of device data. FDA approval of novel biomarkers will also be required. Researchers interested in leveraging the power of these devices must explain to patients their risks and benefits both for ethical and privacy-related reasons and because implementation without addressing participant concerns has the potential to worsen participant recruitment and retention [ 42 ].

Study data collection, verification, and surveillance

An appealing application of ML, specifically NLP, to study data management is to automate data collection into case report forms, decreasing the time, expense, and potential for error associated with human data extraction, whether in prospective trials or retrospective reviews. Though this use requires overcoming variable data structures and provenances, it has shown early promise in cancer [ 43 , 44 ], epilepsy [ 30 ], and depression [ 45 ], among other areas [ 29 ]. Regardless of how data have been collected, ML can power risk-based monitoring approaches to clinical trial surveillance, enabling the prevention and/or early detection of site failure, fraud, and data inconsistencies or incompleteness that may delay database lock and subsequent analysis. For instance, even when humans collect data into case report forms (often transmitted in PDF form), the adequacy of the collected data for outcome ascertainment can be assessed by combining optical character recognition with NLP [ 46 ]. Suspicious data patterns in clinical trials, or incorrect data in observational studies, can be identified by applying auto-encoders to distinguish plausible from implausible data [ 47 ].

Endpoint identification, adjudication, and detection of safety signals

ML can also be applied to data processing. Semi-automated endpoint identification and adjudication offers the potential to reduce time, cost, and complexity compared with the current approach of manual adjudication of events by a committee of clinicians, because while endpoint adjudication has traditionally been a labor-intensive process, sorting and classifying events lies well within the capabilities of ML. For instance, IQVIA Inc. has described the ability to automatically process some adverse events related to drug therapies using a combination of optical character recognition and NLP, though this technique has not been described in peer-reviewed publications [ 48 ]. A potential barrier to implementation of semi-automated event adjudication is that endpoint definitions and the data required to support them often change from trial to trial, which theoretically requires re-training a classification model for each new trial (which is not a viable approach). More recently, efforts have been made to standardize outcomes in the field of cardiovascular research, though not all trials adhere to these outcomes. Trial data have not been pooled to facilitate model training for cardiovascular endpoints, and most fields have not yet undertaken similar efforts [ 49 ]. Further efforts in this area will require true consensus about event definitions, use of consensus definitions, and a willingness of stakeholders to share adequate data for model training from across multiple trials.

Approaches to missing data

ML can be used in several different ways to address the problem of missing data, across multiple causes for data missingness, data-related assumptions and goals, and data collection and intended analytic methods. Possible goals may be to impute specific estimates of the missing covariate values directly or to average over many possible values from some learned distribution to compute other quantities of interest. While the latest methods are evolving and more systematic comparisons are needed, some early evidence suggests more complex ML methods may not always be of benefit over simpler imputation methods, such as population mean imputation [ 50 ]. Applications of missing value techniques include analysis of sparse datasets, such as registries, EHR data, ergonomic data, and data from wearable devices [ 51 – 54 ]. Although these techniques can help mitigate the negative effects of data missingness or scarcity, over-reliance on data augmentation methods may lead to the development of models with limited applicability to new, imperfect datasets. Therefore, a more meaningful approach would be to apply ML to improve data collection during the conduct of research itself.

Data analysis

Data collected in clinical trials, registries, and clinical practices are fertile sources for hypothesis generation, risk modeling, and counterfactual simulation, and ML is well suited for these efforts. For instance, unsupervised learning can identify phenotypic clusters in real-world data that can be further explored in clinical trials [ 55 , 56 ]. Furthermore, ML can potentially improve the ubiquitous practice of secondary trial analyses by more powerfully identifying treatment heterogeneity while still providing some protection (although incomplete) against false-positive discoveries, uncovering more promising avenues for future study [ 57 , 58 ]. Additionally, ML is effectively used to generate risk predictions in retrospective datasets that can subsequently be prospectively validated. For instance, using a random forest model in COMPANION trial data, researchers were able to improve discrimination between patients who would do better or worse following cardiac resynchronization therapy compared with a multivariable logistic regression [ 59 ]. This demonstrates the ability of random forests to model interactions between features that are not captured by simpler models.

While predictive modeling is an important and necessary task, the derivation of real-world evidence from real-world data (i.e., making causal inferences) remains a highly sought-after (and very difficult) goal toward which ML offers some promise. Proposed techniques include optimal discriminant analysis, targeted maximum likelihood estimation, and ML-powered propensity score weighting [ 60 – 64 ]. A particularly intriguing technique involves use of ML to enable counterfactual policy estimation, in which existing data can be used to make predictions about outcomes under circumstances that do not yet, or could not, exist [ 65 ]. For instance, trees of predictors can offer survival estimates for heart failure patients under the conditions of receiving or not receiving a heart transplant and reinforcement learning suggests improved treatment policies on the basis of prior sub-optimal treatments and outcomes [ 66 , 67 ]. Unfortunately, major barriers to implementation are a lack of interoperability between EHR data structures and fraught data sharing agreements that limit the amount of data available for model training [ 68 ].

In summary, there are many effective ML approaches to clinical trial data management, processing, and analysis but fewer techniques for improving the quality of data as they are generated and collected. As data availability and quality are the foundations of ML approaches, the conduct of high-quality trials remains of utmost importance to enable higher-level ML processing.

Barriers to the integration of ML techniques in clinical research

Both operational and philosophical barriers limit the harnessing of the full potential of ML for clinical research. ML in clinical research is a high-risk proposition due to the potential to propagate errors or biases through multiple research contexts and into the corpus of biomedical evidence due to the use of flawed models; however, as previously discussed, ML offers promising ways to improve the quality and efficiency of clinical research for patients and other stakeholders. Both the operational and philosophical barriers to ML integration require attention at each stage of model development and use to overcome hurdles while maximizing stakeholder confidence in the process and its results. Operational barriers to ML integration in clinical research can aggravate and reinforce philosophical concerns if not managed in a robust and transparent manner. For instance, inadequate training data and poor model calibration can lead to racial bias in model application, such as has been noted in ML for melanoma identification [ 27 ]. Stakeholders, including regulatory agencies, funding sources, researchers, participants, and industry partners, must collaborate to fully integrate ML into clinical research. The wider ML community espouses “FAT (fairness, accountability, and transparency) ML” principles that also include responsibility, explainability, accuracy, auditability, and fairness and that should be applied to ML in clinical research, as discussed further.

Operational barriers to ML in clinical research

The development of ML algorithms and their deployment for clinical research use is a multi-stage, multi-disciplinary process. The first step is to assemble a team with the clinical and ML domain expertise necessary for success. Failing to assemble such a team and to communicate openly within the team increases the risks of either developing a model that distorts clinical reality or using an ML technique that is inappropriate to the available data and research question at hand [ 69 ]. For instance, a model to predict mortality created without any clinical team members may identify intubation as predictive of mortality, which is certainly true but likely clinically useless. Collaboration is necessary and valuable for both the data science and clinical science components of the team but may require additional up-front, cross-disciplinary training, transparency, and trust to fully operationalize.

The choice and availability of data for algorithm development and validation is both a stubborn and highly significant barrier to ML integration into clinical research, though its full discussion is outside the scope of this manuscript. Many recent ML models, especially deep neural networks, require large amounts of data to train and validate. To ensure generalizability beyond the training data set, developers should use multiple data sources during this process because a number of documented cases demonstrated that algorithms performed significantly differently in validation data sets compared with training data sets [ 70 ]. Because data used in clinical research are often patient related and generated by institutions (in the case of EHR data) or companies (in the case of clinical trial data) at a significant cost, owners of data may be reluctant to share. Even when they are willing to share data, variation in data collection and storage techniques can hamper interoperability. Large datasets, such as MIMIC, eICU, and the UK Biobank, are good resources when other real-world data cannot be obtained [ 71 – 73 ], but any single data source is inadequate to yield a model that is ready for use, especially because training on retrospective data (such as MIMIC and UK Biobank) does not always translate well to prospective applications. For example, Nestor et al. demonstrated the importance of considering year of care in MIMIC due to temporal drift, and Gong et al. demonstrated methods for feature aggregation across large temporal changes, such as EHR transitions [ 70 , 74 ]. Furthermore, certain disease states and patient types are less likely to be well represented in data generated for the purpose of clinical care. For example, while MIMIC is widely used because of its public availability, models trained on its ICU population are unlikely to generalize to many applications outside critical care. These issues with data availability and quality are intimately associated with problems surrounding reproducibility and replicability [ 75 ], which are more difficult to achieve in ML-driven clinical research for a number of reasons in addition to data availability, including the role of randomness in many ML techniques and the computational expense of model replication. The ongoing difficulties with reproducibility and replicability of ML-driven clinical research threaten to undermine stakeholder confidence in ML integration into clinical research.

Philosophical barriers to ML in clinical research

Explainability refers to the concept that the processes underlying algorithmic output should be explainable to algorithm users in terms they understand. A large amount of research has been devoted to techniques to accomplish this, including attention scores and saliency maps, but concerns about the performance and suitability of these techniques persist [ 76 – 79 ]. Though an appealing principle, a significant debate exists about whether the concept of explainability interferes unnecessarily with the ability of ML to positively contribute to clinical care and research. Explainability may lead researchers to incorrectly trust fundamentally flawed models. Proponents of this argument instead champion trustworthiness . Advocates of trustworthiness are of the opinion that many aspects of clinical medicine (and of clinical research)—such as laboratory assays, the complete mechanisms of certain medications, and statistical tests—that are not well or widely understood continue to be used because they have been shown to work reliably and well, even if how or why remains opaque to many end users [ 80 ]. This philosophical barrier has more recently become an operational barrier as well with the passage of the European Union’s General Data Protection Regulation, which requires that automated decision-making algorithms provide “meaningful information about the logic involved.”

Part of the focus on explainability and trustworthiness is due to a desire to understand whether ML algorithms are introducing bias into model output, as was notably shown to be the case in a highly publicized series of ProPublica articles about recidivism prediction algorithms [ 81 ]. Bias in clinical research–focused algorithms has the potential to be equally devastating, for instance, by theoretically suggesting non-representative study cohorts on the basis of a lower predicted participant drop-out.

Guidelines toward overcoming operational and philosophical barriers to ML in clinical research

Because the operational problems previously detailed can potentiate the philosophical tangles of ML use in clinical research, many of the ways to overcome these hurdles overlap. The first and foremost approach to many of these issues includes data provenance, quality, and access. The open-access data sources previously discussed (MIMIC, UK Biobank) are good places to start, but inadequate on their own. Enhanced access to data and the technical expertise required to analyze it is needed. Attempts to render health data interoperable have been ongoing for decades, yielding data standard development initiatives and systems, such as the PCORnet Common Data Model [ 82 ], FHIR [ 83 ], i2b2 [ 84 ], and OMOP [ 85 ]. Recently, regulation requiring health data interoperability through use of core data classes and elements has been enacted by the US Department of Health and Human Services and Centers for Medicare and Medicaid Services on the basis of the 21st Century Cures Act [ 85 , 86 ]. Where barriers to data sharing persist, other options to improve the amount of data available include federated data and cloud-based data access, in which developers can train and validate models on data that they do not own or directly interact with [ 87 – 89 ]. This has become increasingly common in certain fields, such as genomics and informatics, as evidenced by large consortia, such as eMERGE and OHDSI [ 90 , 91 ].

Recently, a group of European universities and pharmaceutical companies have joined to create “MELODDY,” in which large amounts of drug development data will be shared while protecting companies’ proprietary information, though no academic publications have yet been produced [ 91 ]. “Challenges” in which teams compete to accomplish ML tasks often yield useful models, such as early sepsis prediction or more complete characterization of breast cancer cell lines, which can then be distributed to participating health institutions for validation in their local datasets [ 92 – 95 ].

Algorithm validation can both help ensure that ML models are appropriate for their intended clinical research use while also increasing stakeholder confidence in the use of ML in clinical research. Though the specifics continue to be debated, published best practices for specific use cases are emerging [ 96 ]; recent suggestions to standardize such reporting in a one-page “model card” are notable [ 97 ]. For instance, possible model characteristics that could be reported include the intended use cohort, intended outcome of interest, required input data structure and necessary transformations, model type and structure, training cohort specifics, consequences of model application outside of intended use, and algorithm management of uncertainty. Performance metrics that are useful for algorithm evaluation in clinical contexts include receiver-operating characteristic and precision-recall curves, calibration, net benefit, and c-statistic for benefit [ 92 ]. Depending on the intended use case, the most appropriate metrics to report or to optimize will differ. For instance, a model intended to identify patients at high risk for protocol non-adherence may have a higher tolerance for false-positives than one intended to simulate study drug dosages for trial planning. Consensus decisions about obligatory metrics for certain model structures and use cases are required to ensure that models with similar intended uses can be compared with one another. Developers will need to specify how often these metrics should be re-evaluated to assess for model drift. Ideally, evaluation of high-stakes clinical research models should be overseen by a neutral third party, such as a regulatory agency.

To foster trustworthiness even in the absence of explainability, it is essential that the model development and validation processes be transparent , including the reporting of model uncertainty. This may allow more advanced consumers to evaluate the model from a technical standpoint while at the very least helping less-advanced users to identify situations in which a model’s output should be approached with caution. For instance, understanding the source, structure, and drawbacks of the data used for model training and validation will provide insight into how the model’s output might be affected by the quality of the underlying data. However, trustworthiness may be built by running ML models in clinical research contexts in parallel with traditional research methods to show that the ML methods perform at least as well as traditional approaches. Though the importance of these principles may appear self-evident, the large number of ML models being used commercially for clinical research without reporting of the models’ development and performance characteristics suggests more work is needed to align stakeholders in this regard. Even while writing this manuscript, in which peer-reviewed publications were used whenever available, we encountered many cases in which the only “evidence” supporting a model’s performance was a commercial entity’s promotional material. In several other instances, the peer-reviewed articles available to support a commercial model’s performance offered no information at all about the model’s development or validation, which, as discussed earlier, is crucial to engendering trustworthiness. Another concerning aspect of commercial ML-enabled clinical research solutions is private companies’ and health care systems’ practice of training, validating, and applying models using patient data under the guise of quality improvement initiatives, thereby avoiding the need for ethical/institutional review board approval or patient consent [ 93 ]. This practice puts the entire field of ML development at risk of generating biased models and/or losing stakeholder buy-in (as occurred in dramatic fashion with the UK’s “Care.data” initiative) [ 94 ] and illustrates the need to build a more reasonable path toward ethical data sharing and more stringent processes surrounding model development and validation.

Although no FDA guidance is yet available specific to ML in clinical research, guidance on ML in clinical care and commentary from FDA representatives suggest several possible features of a regulatory approach to ML in clinical research. For instance, the FDA’s proposed ML-specific modifications to the “Software as a Medical Device” Regulations (SaMD) draw a distinction between fixed algorithms that were trained using ML techniques but frozen prior to deployment and those that continue to learn “in the wild.” These latter algorithms may more powerfully take advantage of the large amounts of data afforded by ongoing use but also pose additional risks of model drift with the potential need for iterative updates to the algorithm. In particular, model drift should often be expected because models that are incorporated into the decision-making process will inherently change the data they are exposed to in the future. The proposed ML-specific modifications to SaMD guidance outline an institution or organization-level approval pathway that would facilitate these ongoing algorithm updates within pre-approved boundaries (Fig. 3 ).

FDA-proposed workflow to regulate machine learning algorithms under the Software as a Medical Device framework. From: Proposed regulatory framework for modifications to artificial intelligence/machine learning (AI/ML)-based software as a medical device: Discussion paper and request for feedback. https://www.fda.gov/media/122535/download . Accessed 17 May 2020

The optimal frequency of model re-evaluation by the FDA has yet to be determined (and may vary based off the model type, training set, and intended use), but clearly some form of recurrent review will be needed, prompted either by a certain time period, certain events (for instance, a global pandemic), or both. Discussion with representatives from the FDA indicates that ML in clinical research is viewed as a potentially high-risk use case due to the potential to propagate errors or biases through the algorithm into research studies; however, its potential opportunities were widely appreciated. Until formalized guidance about ML in clinical research is released, the FDA has clearly stated a willingness to work with sponsors and stakeholders on a case-by-case basis to determine the appropriate role of ML in research intended to support a regulatory application. However, this regulatory uncertainty could potentially stifle sponsors’ and stakeholders’ willingness to invest in ML for clinical research until guidance is drafted. This, in turn, may require additional work at a legislative level to provide a framework for further FDA guidance.

Concerns of bias are central to clinical research even when ML is not involved: clinical research and care have long histories of gender, racial, and socioeconomic bias [ 95 , 96 ]. The ability of ML to potentiate and perpetuate bias in clinical research, possibly without study teams’ awareness, must be actively managed. To the extent that bias can be identified, it can often be addressed and reduced; a worst-case scenario is application of a model with unknown bias in a new cohort with high-stakes results. As with much of ML in clinical research, data quality and quantity are critical in combating bias. No single perfect dataset exists, especially as models trained on real-world data will replicate the intentional or unintentional biases of the clinicians and researchers who generated those data [ 97 ]. Therefore, training models on more independent and diverse datasets decreases the likelihood of occult bias [ 98 ]. Additionally, bias reduction can be approached through the model construction itself, such as by de-biasing word embeddings and using counterfactual fairness [ 99 – 102 ]. Clinical research teams may pre-specify certain subgroups of interest in which the algorithm must perform equally well [ 103 ]. Finally, while ML raises the specter of reinforcing and more efficiently operationalizing historical discrimination, ML may help us de-bias clinical research and care by monitoring and drawing attention to bias [ 98 ]. Bias reduction is an area of ML in clinical research in which multi-disciplinary collaboration is especially vital and powerful: clinical scientists may be able to share perspective on long-standing biases in their domains of expertise, while more diverse teams may offer innovative insights into de-biasing ML models.

While traditional double-blinded, randomized, controlled clinical trials with their associated statistical methodologies remain the gold standard for biomedical evidence generation, augmentation with ML techniques offers the potential to improve the success and efficiency of clinical research, increasing its positive impact for all stakeholders. To the extent that ML-enabled clinical research can improve the efficiency and quality of biomedical evidence, it may save human lives and reduce human suffering, introducing an ethical imperative to explore this possibility. Realizing this potential will require overcoming issues with data structure and access, definitions of outcomes, transparency of development and validation processes, objectivity of certification, and the possibility of bias. The potential applications of ML to clinical research currently outstrip its actual use, both because few prospective studies are available about the relative effectiveness of ML versus traditional approaches and because change requires time, energy, and cooperation. Stakeholder willingness to integrate ML into clinical research relies in part on robust responses to issues of data provenance, bias, and validation as well as confidence in the regulatory structure surrounding ML in clinical research. The use of ML algorithms whose development has been opaque and without peer-reviewed publication must be addressed. The attendees of the January 2020 conference on ML in clinical research represent a broad swath of stakeholders with differing priorities and clinical research–related challenges, but all in attendance agreed that communication and collaboration are essential to implementation of this promising technology. Transparent discussion about the potential benefits and drawbacks of ML for clinical research and the sharing of best practices must continue not only in the academic community but in the lay press and government as well to ensure that ML in clinical research is applied in a fair, ethical, and open manner that is acceptable to all.

Acknowledgements

The authors would like to acknowledge the contributions of Peter Hoffmann and Brooke Walker to the editing and preparation of this manuscript.

Abbreviations

Electronic health record

US Food and Drug Administration

Machine learning

Natural language processing

Software as a Medical Device

Authors’ contributions

All authors contributed to the conception and design of the work and the analysis and interpretation of the data consisting of reports (peer-reviewed and otherwise) concerning the development, performance, and use of ML in clinical research. EHW drafted the work. All authors substantively revised the work. The author(s) read and approved the final manuscript.

Funding support for the meeting was provided through registration fees from Amgen Inc., AstraZeneca, Bayer AG, Boehringer-Ingelheim, Cytokinetics, Eli Lilly & Company, Evidation, IQVIA, Janssen, Microsoft, Pfizer, Sanofi, and Verily. No government funds were used for this meeting.

Availability of data and materials

Not applicable

Declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

HW has nothing to disclose.

TN has nothing to disclose.

TA is an employee of AstraZeneca.

RR has nothing to disclose.

OE is a co-founder of and holds equity in OneThree Biotech and Volastra Therapeutics and is scientific advisor for and holds equity in Freenome and Owkin,

YL has nothing to disclose.

DF is an employee of Bayer AG, Germany.

JB has nothing to disclose.

MH reports personal fees from Duke Clinical Research Institute, non-financial support from RGI Informatics, LLC, and grants from Oracle Labs.

FK is an employee of AstraZeneca.

PS has nothing to disclose.

SK is an employee of AstraZeneca; has served as an advisor for Kencor Health and OccamzRazor; has received consulting fees from Google Cloud (Alphabet), McKinsey, and LEK Consulting; was an employee of Philips Healthcare; and has a patent (Diagnosis and Classification of Left Ventricular Diastolic Dysfunction Using a Computer) issued to MSIP.

Dr. Roe reports grants from the American College of Cardiology, American Heart Association, Bayer Pharmaceuticals, Familial Hypercholesterolemia Foundation, Ferring Pharmaceuticals, Myokardia, and Patient Centered Outcomes Research Institute; grants and personal fees from Amgen, AstraZeneca, and Sanofi Aventis; personal fees from Janssen Pharmaceuticals, Elsevier Publishers, Regeneron, Roche-Genetech, Eli Lilly, Novo Nordisk, Pfizer, and Signal Path; and is an employee of Verana Health.

EH is an employee of AstraZeneca.

SK reports personal fees from Holmusk.

UB is an employee of Boehringer-Ingelheim.

ZM has nothing to disclose.

JW reports being an employee of Sanofi US.

LC has nothing to disclose.

EH reports personal fees from Valo Health and is a founder of (with equity in) kelaHealth and Clinetic.

MG has nothing to disclose.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1. Senior AW, Evans R, Jumper J, Kirkpatrick J, Sifre L, Green T, Qin C, Žídek A, Nelson AWR, Bridgland A, Penedones H, Petersen S, Simonyan K, Crossan S, Kohli P, Jones DT, Silver D, Kavukcuoglu K, Hassabis D. Improved protein structure prediction using potentials from deep learning. Nature. 2020;577(7792):706–710. doi: 10.1038/s41586-019-1923-7. [ DOI ] [ PubMed ] [ Google Scholar ]
2. Fauqueur JTA, Togia T. Proceedings of the 18th BioNLP Workshop and Shared Task. 2019. Constructing large scale biomedical knowledge bases from scratch with rapid annotation of interpretable patterns. [ Google Scholar ]
3. Jia R, Wong C, Poon H. Document-level N-ary relation extraction with multiscale representation learning. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); 2019; Minneapolis: Association for Computational Linguistics. https://ui.adsabs.harvard.edu/abs/2019arXiv190402347J .
4. Dezso Z, Ceccarelli M. Machine learning prediction of oncology drug targets based on protein and network properties. BMC Bioinformatics. 2020;21(1):104. doi: 10.1186/s12859-020-3442-9. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
5. Bagherian M, Sabeti E, Wang K, Sartor MA, Nikolovska-Coleska Z, Najarian K. Machine learning approaches and databases for prediction of drug-target interaction: a survey paper. Brief Bioinform. 2021;22(1):247–69. 10.1093/bib/bbz157. [ DOI ] [ PMC free article ] [ PubMed ]
6. Liu QAM, Brockschmidt M, Gaunt AL. Constrained graph variational autoencoders for molecule design. NeurIPS 2018. 2018;arXiv:1805.09076:7806–15.
7. Madhukar NS, Khade PK, Huang L, Gayvert K, Galletti G, Stogniew M, Allen JE, Giannakakou P, Elemento O. A Bayesian machine learning approach for drug target identification using diverse data types. Nat Commun. 2019;10(1):5221. doi: 10.1038/s41467-019-12928-6. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
8. Langner S, Hase F, Perea JD, Stubhan T, Hauch J, Roch LM, et al. Beyond ternary OPV: high-throughput experimentation and self-driving laboratories optimize multicomponent systems. Adv Mater. 2020;32(14):e1907801. doi: 10.1002/adma.201907801. [ DOI ] [ PubMed ] [ Google Scholar ]
9. Granda JM, Donina L, Dragone V, Long DL, Cronin L. Controlling an organic synthesis robot with machine learning to search for new reactivity. Nature. 2018;559(7714):377–381. doi: 10.1038/s41586-018-0307-8. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
10. Koh D. Sumitomo Dainippon Pharma and Exscientia achieve breakthrough in AI drug discovery: Healthcare IT News - Portland, ME: Healthcare IT News; 2020.
11. Romero K, Ito K, Rogers JA, Polhamus D, Qiu R, Stephenson D, Mohs R, Lalonde R, Sinha V, Wang Y, Brown D, Isaac M, Vamvakas S, Hemmings R, Pani L, Bain LJ, Corrigan B, Alzheimer's Disease Neuroimaging Initiative. Coalition Against Major Diseases The future is now: model-based clinical trial design for Alzheimer's disease. Clin Pharmacol Ther. 2015;97(3):210–214. doi: 10.1002/cpt.16. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
12. Zhao Y, Zeng D, Socinski MA, Kosorok MR. Reinforcement learning strategies for clinical trials in nonsmall cell lung cancer. Biometrics. 2011;67(4):1422–1433. doi: 10.1111/j.1541-0420.2011.01572.x. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
13. trials.ai 2019 [cited 2021 February 2]. Available from: trials.ai .
14. Wong CH, Siah KW, Lo AW. Estimation of clinical trial success rates and related parameters. Biostatistics. 2019;20(2):273–286. doi: 10.1093/biostatistics/kxx069. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
15. Schork NJ. Personalized medicine: time for one-person trials. Nature. 2015;520(7549):609–611. doi: 10.1038/520609a. [ DOI ] [ PubMed ] [ Google Scholar ]
16. Glicksberg BS, Miotto R, Johnson KW, Shameer K, Li L, Chen R, Dudley JT. Automated disease cohort selection using word embeddings from electronic health records. Pac Symp Biocomput. 2018;23:145–156. [ PMC free article ] [ PubMed ] [ Google Scholar ]
17. Liao KP, Cai T, Savova GK, Murphy SN, Karlson EW, Ananthakrishnan AN, Gainer VS, Shaw SY, Xia Z, Szolovits P, Churchill S, Kohane I. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. BMJ. 2015;350(apr24 11):h1885. doi: 10.1136/bmj.h1885. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
18. Li L, Cheng WY, Glicksberg BS, Gottesman O, Tamler R, Chen R, et al. Identification of type 2 diabetes subgroups through topological analysis of patient similarity. Sci Transl Med. 2015;7(311):311ra174. doi: 10.1126/scitranslmed.aaa9364. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
19. Our Solution 2021 [cited 2021 February 2]. Available from: https://www.bullfrogai.com/our-solution/ .
20. Zhang X, Xiao C, Glass LM, Sun J. Proceedings of the Web Conference 2020. Taipei: Association for Computing Machinery; 2020. DeepEnroll: patient-trial matching with deep embedding and entailment prediction; pp. 1029–1037. [ Google Scholar ]
21. Calaprice-Whitty D, Galil K, Salloum W, Zariv A, Jimenez B. Improving clinical trial participant prescreening with artificial intelligence (AI): a comparison of the results of AI-assisted vs standard methods in 3 oncology trials. Ther Innov Regul Sci. 2020;54(1):69–74. doi: 10.1007/s43441-019-00030-4. [ DOI ] [ PubMed ] [ Google Scholar ]
22. How it works 2019 [cited 2021 February 2]. Available from: https://deep6.ai/how-it-works/ .
23. Vassy JL, Ho YL, Honerlaw J, Cho K, Gaziano JM, Wilson PWF, Gagnon DR. Yield and bias in defining a cohort study baseline from electronic health record data. J Biomed Inform. 2018;78:54–59. doi: 10.1016/j.jbi.2017.12.017. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
24. Weber GM, Adams WG, Bernstam EV, Bickel JP, Fox KP, Marsolo K, Raghavan VA, Turchin A, Zhou X, Murphy SN, Mandl KD. Biases introduced by filtering electronic health records for patients with “complete data”. J Am Med Inform Assoc. 2017;24(6):1134–1141. doi: 10.1093/jamia/ocx071. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
25. Bain EE, Shafner L, Walling DP, Othman AA, Chuang-Stein C, Hinkle J, Hanina A. Use of a novel artificial intelligence platform on mobile devices to assess dosing compliance in a phase 2 clinical trial in subjects with schizophrenia. JMIR Mhealth Uhealth. 2017;5(2):e18. doi: 10.2196/mhealth.7030. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
26. Labovitz DL, Shafner L, Reyes Gil M, Virmani D, Hanina A. Using artificial intelligence to reduce the risk of nonadherence in patients on anticoagulation therapy. Stroke. 2017;48(5):1416–1419. doi: 10.1161/STROKEAHA.116.016281. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
27. Adamson AS, Smith A. Machine learning and health care disparities in dermatology. JAMA Dermatol. 2018;154(11):1247–1248. doi: 10.1001/jamadermatol.2018.2348. [ DOI ] [ PubMed ] [ Google Scholar ]
28. Burlingame EA, Margolin AA, Gray JW, Chang YH. SHIFT: speedy histopathological-to-immunofluorescent translation of whole slide images using conditional generative adversarial networks. Proc SPIE Int Soc Opt Eng. 2018;10581. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6166432/ . [ DOI ] [ PMC free article ] [ PubMed ]
29. Han J, Chen K, Fang L, Zhang S, Wang F, Ma H, Zhao L, Liu S. Improving the efficacy of the data entry process for clinical research with a natural language processing-driven medical information extraction system: quantitative field research. JMIR Med Inform. 2019;7(3):e13331. doi: 10.2196/13331. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
30. Fonferko-Shadrach B, Lacey AS, Roberts A, Akbari A, Thompson S, Ford DV, Lyons RA, Rees MI, Pickrell WO. Using natural language processing to extract structured epilepsy data from unstructured clinic letters: development and validation of the ExECT (extraction of epilepsy clinical text) system. BMJ Open. 2019;9(4):e023232. doi: 10.1136/bmjopen-2018-023232. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
31. Gavrielov-Yusim N, Kurzinger ML, Nishikawa C, Pan C, Pouget J, Epstein LB, et al. Comparison of text processing methods in social media-based signal detection. Pharmacoepidemiol Drug Saf. 2019;28(10):1309–1317. doi: 10.1002/pds.4857. [ DOI ] [ PubMed ] [ Google Scholar ]
32. Barnett I, Torous J, Staples P, Sandoval L, Keshavan M, Onnela JP. Relapse prediction in schizophrenia through digital phenotyping: a pilot study. Neuropsychopharmacology. 2018;43(8):1660–1666. doi: 10.1038/s41386-018-0030-z. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
33. Chaudhuri S, Oudejans D, Thompson HJ, Demiris G. Real-world accuracy and use of a wearable fall detection device by older adults. J Am Geriatr Soc. 2015;63(11):2415–2416. doi: 10.1111/jgs.13804. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
34. Chen R, Jankovic F, Marinsek N, Foschini L, Kourtis L, Signorini A, et al. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Anchorage: Association for Computing Machinery; 2019. Developing measures of cognitive impairment in the real world from consumer-grade multimodal sensor streams; pp. 2145–2155. [ Google Scholar ]
35. Yurtman A, Barshan B, Fidan B. Activity recognition invariant to wearable sensor unit orientation using differential rotational transformations represented by quaternions. Sensors (Basel). 2018;18(8):2725. https://pubmed.ncbi.nlm.nih.gov/30126235/ . [ DOI ] [ PMC free article ] [ PubMed ]
36. Lu K, Yang L, Seoane F, Abtahi F, Forsman M, Lindecrantz K. Fusion of heart rate, respiration and motion measurements from a wearable sensor system to enhance energy expenditure estimation. Sensors (Basel). 2018;18(9):3092. https://pubmed.ncbi.nlm.nih.gov/30223429/ . [ DOI ] [ PMC free article ] [ PubMed ]
37. Cheung YK, Hsueh PS, Ensari I, Willey JZ, Diaz KM. Quantile coarsening analysis of high-volume wearable activity data in a longitudinal observational study. Sensors (Basel). 2018;18(9):3056. https://pubmed.ncbi.nlm.nih.gov/30213093/ . [ DOI ] [ PMC free article ] [ PubMed ]
38. Hannun AY, Rajpurkar P, Haghpanahi M, Tison GH, Bourn C, Turakhia MP, Ng AY. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nat Med. 2019;25(1):65–69. doi: 10.1038/s41591-018-0268-3. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
39. Ozkanca Y, Ozturk MG, Ekmekci MN, Atkins DC, Demiroglu C, Ghomi RH. Depression screening from voice samples of patients affected by Parkinson's disease. Digit Biomark. 2019;3(2):72–82. doi: 10.1159/000500354. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
40. Moreau A, Anderer P, Ross M, Cerny A, Almazan TH, Peterson B, Moreau A, Anderer P, Ross M, Cerny A, Almazan TH, Peterson B. Detection of nocturnal scratching movements in patients with atopic dermatitis using accelerometers and recurrent neural networks. IEEE J Biomed Health Inform. 2018;22(4):1011–1018. doi: 10.1109/JBHI.2017.2710798. [ DOI ] [ PubMed ] [ Google Scholar ]
41. Han X, Hu Y, Foschini L, Chinitz L, Jankelson L, Ranganath R. Deep learning models for electrocardiograms are susceptible to adversarial attack. Nat Med. 2020;26(3):360–363. doi: 10.1038/s41591-020-0791-x. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
42. Doerr M, Maguire Truong A, Bot BM, Wilbanks J, Suver C, Mangravite LM. Formative evaluation of participant experience with mobile econsent in the app-mediated Parkinson mPower study: a mixed methods study. JMIR Mhealth Uhealth. 2017;5(2):e14. doi: 10.2196/mhealth.6521. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
43. Savova GK, Danciu I, Alamudun F, Miller T, Lin C, Bitterman DS, Tourassi G, Warner JL. Use of natural language processing to extract clinical cancer phenotypes from electronic medical records. Cancer Res. 2019;79(21):5463–5470. doi: 10.1158/0008-5472.CAN-19-0579. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
44. Malke JC, Jin S, Camp SP, Lari B, Kell T, Simon JM, Prieto VG, Gershenwald JE, Haydu LE. Enhancing case capture, quality, and completeness of primary melanoma pathology records via natural language processing. JCO Clin Cancer Inform. 2019;3:1–11. doi: 10.1200/CCI.19.00006. [ DOI ] [ PubMed ] [ Google Scholar ]
45. Vaci N, Liu Q, Kormilitzin A, De Crescenzo F, Kurtulmus A, Harvey J, et al. Natural language processing for structuring clinical text data on depression using UK-CRIS. Evid Based Ment Health. 2020;23(1):21–26. doi: 10.1136/ebmental-2019-300134. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
46. Tian Q, Liu M, Min L, An J, Lu X, Duan H. An automated data verification approach for improving data quality in a clinical registry. Comput Methods Programs Biomed. 2019;181:104840. doi: 10.1016/j.cmpb.2019.01.012. [ DOI ] [ PubMed ] [ Google Scholar ]
47. Estiri H, Murphy SN. Semi-supervised encoding for outlier detection in clinical observation data. Comput Methods Programs Biomed. 2019;181:104830. doi: 10.1016/j.cmpb.2019.01.002. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
48. Glass LMSG, Patil R. AI in clinical development: improving safety and accelerating results. [White paper] 2019. [ Google Scholar ]
49. Hicks KA, Mahaffey KW, Mehran R, Nissen SE, Wiviott SD, Dunn B, Solomon SD, Marler JR, Teerlink JR, Farb A, Morrow DA, Targum SL, Sila CA, Hai MTT, Jaff MR, Joffe HV, Cutlip DE, Desai AS, Lewis EF, Gibson CM, Landray MJ, Lincoff AM, White CJ, Brooks SS, Rosenfield K, Domanski MJ, Lansky AJ, McMurray J, Tcheng JE, Steinhubl SR, Burton P, Mauri L, O'Connor CM, Pfeffer MA, Hung HMJ, Stockbridge NL, Chaitman BR, Temple RJ, Standardized Data Collection for Cardiovascular Trials Initiative (SCTI) 2017 Cardiovascular and stroke endpoint definitions for clinical trials. Circulation. 2018;137(9):961–972. doi: 10.1161/CIRCULATIONAHA.117.033502. [ DOI ] [ PubMed ] [ Google Scholar ]
50. Liu Y, Gopalakrishnan V. An overview and evaluation of recent machine learning imputation methods using cardiac imaging data. Data (Basel). 2017;2(1):8. https://pubmed.ncbi.nlm.nih.gov/28243594/ . [ DOI ] [ PMC free article ] [ PubMed ]
51. Phung S, Kumar A, Kim J. A deep learning technique for imputing missing healthcare data. Conf Proc IEEE Eng Med Biol Soc. 2019;2019:6513–6516. doi: 10.1109/EMBC.2019.8856760. [ DOI ] [ PubMed ] [ Google Scholar ]
52. Qiu YL, Zheng H, Gevaert OJ. A deep learning framework for imputing missing values in genomic data. 2018. [ Google Scholar ]
53. Feng T, Narayanan S. 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) 2019. Imputing missing data in large-scale multivariate biomedical wearable recordings using bidirectional recurrent neural networks with temporal activation regularization. [ DOI ] [ PubMed ] [ Google Scholar ]
54. Luo Y, Szolovits P, Dighe AS, Baron JM. 3D-MICE: integration of cross-sectional and longitudinal imputation for multi-analyte longitudinal clinical data. J Am Med Inform Assoc. 2018;25(6):645–653. doi: 10.1093/jamia/ocx133. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
55. Ngufor C, Warner MA, Murphree DH, Liu H, Carter R, Storlie CB, et al. Identification of Clinically meaningful plasma transfusion subgroups using unsupervised random forest clustering. AMIA Annu Symp Proc. 2017;2017:1332–1341. [ PMC free article ] [ PubMed ] [ Google Scholar ]
56. Tomic A, Tomic I, Rosenberg-Hasson Y, Dekker CL, Maecker HT, Davis MM. SIMON, an automated machine learning system, reveals immune signatures of influenza vaccine responses. J Immunol. 2019;203(3):749–759. doi: 10.4049/jimmunol.1900033. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
57. Watson JA, Holmes CC. Machine learning analysis plans for randomised controlled trials: detecting treatment effect heterogeneity with strict control of type I error. Trials. 2020;21(1):156. doi: 10.1186/s13063-020-4076-y. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
58. Rigdon J, Baiocchi M, Basu S. Preventing false discovery of heterogeneous treatment effect subgroups in randomized trials. Trials. 2018;19(1):382. doi: 10.1186/s13063-018-2774-5. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
59. Kalscheur MM, Kipp RT, Tattersall MC, Mei C, Buhr KA, DeMets DL, Field ME, Eckhardt LL, Page CD. Machine learning algorithm predicts cardiac resynchronization therapy outcomes: lessons from the companion trial. Circ Arrhythm Electrophysiol. 2018;11(1):e005499. doi: 10.1161/CIRCEP.117.005499. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
60. Linden A, Yarnold PR. Combining machine learning and propensity score weighting to estimate causal effects in multivalued treatments. J Eval Clin Pract. 2016;22(6):871–881. doi: 10.1111/jep.12610. [ DOI ] [ PubMed ] [ Google Scholar ]
61. Schuler MS, Rose S. Targeted maximum likelihood estimation for causal inference in observational studies. Am J Epidemiol. 2017;185(1):65–73. doi: 10.1093/aje/kww165. [ DOI ] [ PubMed ] [ Google Scholar ]
62. Wendling T, Jung K, Callahan A, Schuler A, Shah NH, Gallego B. Comparing methods for estimation of heterogeneous treatment effects using observational data from health care databases. Stat Med. 2018;37(23):3309–3324. doi: 10.1002/sim.7820. [ DOI ] [ PubMed ] [ Google Scholar ]
63. Schomaker M, Luque-Fernandez MA, Leroy V, Davies MA. Using longitudinal targeted maximum likelihood estimation in complex settings with dynamic interventions. Stat Med. 2019;38(24):4888–4911. doi: 10.1002/sim.8340. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
64. Pirracchio R, Petersen ML, van der Laan M. Improving propensity score estimators’ robustness to model misspecification using super learner. Am J Epidemiol. 2015;181(2):108–119. doi: 10.1093/aje/kwu253. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
65. Gottesman O, Johansson F, Komorowski M, Faisal A, Sontag D, Doshi-Velez F, Celi LA. Guidelines for reinforcement learning in healthcare. Nat Med. 2019;25(1):16–18. doi: 10.1038/s41591-018-0310-5. [ DOI ] [ PubMed ] [ Google Scholar ]
66. Yoon J, Zame WR, Banerjee A, Cadeiras M, Alaa AM, van der Schaar M. Personalized survival predictions via trees of predictors: an application to cardiac transplantation. PLoS One. 2018;13(3):e0194985. doi: 10.1371/journal.pone.0194985. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
67. Komorowski M, Celi LA, Badawi O, Gordon AC, Faisal AA. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nat Med. 2018;24(11):1716–1720. doi: 10.1038/s41591-018-0213-5. [ DOI ] [ PubMed ] [ Google Scholar ]
68. Ghassemi M, Naumann T, Schulam P, Beam AL, Chen IY, Ranganath R. Practical guidance on artificial intelligence for health-care data. Lancet Digit Health. 2019;1(4):e157–e159. doi: 10.1016/S2589-7500(19)30084-6. [ DOI ] [ PubMed ] [ Google Scholar ]
69. Wiens J, Saria S, Sendak M, Ghassemi M, Liu VX, Doshi-Velez F, Jung K, Heller K, Kale D, Saeed M, Ossorio PN, Thadaney-Israni S, Goldenberg A. Do no harm: a roadmap for responsible machine learning for health care. Nat Med. 2019;25(9):1337–1340. doi: 10.1038/s41591-019-0548-6. [ DOI ] [ PubMed ] [ Google Scholar ]
70. Nestor B, McDermott M, Chauhan G, et al. Rethinking clinical prediction: why machine learning must consider year of care and feature aggregation. arXiv preprint 2018;arXiv:181112583.
71. Johnson AE, Pollard TJ, Shen L, Lehman LW, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3(1):160035. doi: 10.1038/sdata.2016.35. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
72. Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG, Badawi O. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Sci Data. 2018;5(1):180178. doi: 10.1038/sdata.2018.178. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
73. UK Biobank. www.ukbiobank.ac.uk . Accessed 22 Mar 2021.
74. Gong JJ, Naumann T, Szolovits P, Guttag JV. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Halifax: Association for Computing Machinery; 2017. Predicting clinical outcomes across changing electronic health record systems; pp. 1497–1505. [ Google Scholar ]
75. Beam AL, Manrai AK, Ghassemi M. Challenges to the reproducibility of machine learning models in health care. JAMA. 2020;323(4):305–6. 10.1001/jama.2019.20866. [ DOI ] [ PMC free article ] [ PubMed ]
76. Adebayo J, Gilmer J, Muelly M, Goodfellow I, Hardt M, Kim B. Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal: Curran Associates Inc.; 2018. Sanity checks for saliency maps; pp. 9525–9536. [ Google Scholar ]
77. Wiegreffe S, Pinter Y. Attention is not not explanation. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) Hong Kong, China: Association for Computational Linguistics; 2019. [ Google Scholar ]
78. Jain S, Wallace BC. Attention is not explanation. NAACL-HLT; 2019. [ Google Scholar ]
79. Serrano S, Smith NA. Is attention interpretable? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2931–2951, Stroudsburg, PA, USA, 2019. Association for Computational Linguistics.
80. Sendak M, Elish MC, Gao M, Futoma J, Ratliff W, Nichols M, et al. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. Barcelona: Association for Computing Machinery; 2020. “The human body is a black box”: supporting clinical decision-making with deep learning; pp. 99–109. [ Google Scholar ]
81. Angwin JLJ, Mattu S, Kirchner L. Machine bias. ProPublica. 2016. [ Google Scholar ]
82. Qualls LG, Phillips TA, Hammill BG, Topping J, Louzao DM, Brown JS, et al. Evaluating foundational data quality in the National Patient-Centered Clinical Research Network (PCORnet(R)) EGEMS (Wash DC) 2018;6(1):3. doi: 10.5334/egems.199. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
83. Bosca D, Moner D, Maldonado JA, Robles M. Combining archetypes with fast health interoperability resources in future-proof health information systems. Stud Health Technol Inform. 2015;210:180–184. [ PubMed ] [ Google Scholar ]
84. Klann JG, Abend A, Raghavan VA, Mandl KD, Murphy SN. Data interchange using i2b2. J Am Med Inform Assoc. 2016;23(5):909–915. doi: 10.1093/jamia/ocv188. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
85. Overhage JM, Ryan PB, Reich CG, Hartzema AG, Stang PE. Validation of a common data model for active safety surveillance research. J Am Med Inform Assoc. 2012;19(1):54–60. doi: 10.1136/amiajnl-2011-000376. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
86. 21st Century Cures Act: Interoperability, information blocking, and the ONC Health IT Certification Program [updated 1 May 2020]. Available from: https://www.federalregister.gov/documents/2020/05/01/2020-07419/21st-century-cures-act-interoperability-information-blocking-and-the-onc-health-it-certification . Accessed 16 May 2020.
87. Oh M, Park S, Kim S, Chae H. Machine learning-based analysis of multi-omics data on the cloud for investigating gene regulations. Brief Bioinform. 2020. Epub 2020/04/01. 10.1093/bib/bbaa032. [ DOI ] [ PubMed ]
88. Czeizler E, Wiessler W, Koester T, Hakala M, Basiri S, Jordan P, Kuusela E. Using federated data sources and Varian Learning Portal framework to train a neural network model for automatic organ segmentation. Phys Med. 2020;72:39–45. doi: 10.1016/j.ejmp.2020.03.011. [ DOI ] [ PubMed ] [ Google Scholar ]
89. Zerka F, Barakat S, Walsh S, Bogowicz M, Leijenaar RTH, Jochems A, Miraglio B, Townend D, Lambin P. Systematic review of privacy-preserving distributed machine learning from federated databases in health care. JCO Clin Cancer Inform. 2020;4:184–200. doi: 10.1200/CCI.19.00047. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
90. McCarty CA, Chisholm RL, Chute CG, Kullo IJ, Jarvik GP, Larson EB, et al. The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Med Genomics. 2011;4(1):13. doi: 10.1186/1755-8794-4-13. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
91. Boyce RD, Ryan PB, Noren GN, Schuemie MJ, Reich C, Duke J, et al. Bridging islands of information to establish an integrated knowledge base of drugs and health outcomes of interest. Drug Saf. 2014;37(8):557–567. doi: 10.1007/s40264-014-0189-0. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
92. van Klaveren D, Steyerberg EW, Serruys PW, Kent DM. The proposed ‘concordance-statistic for benefit’ provided a useful metric when modeling heterogeneous treatment effects. J Clin Epidemiol. 2018;94:59–68. doi: 10.1016/j.jclinepi.2017.10.021. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
93. Robbins RBE. An invisible hand: patients aren’t being told about the AI systems advising their care. STAT. 2020. [ Google Scholar ]
94. Sterckx S, Rakic V, Cockbain J, Borry P. “You hoped we would sleep walk into accepting the collection of our data”: controversies surrounding the UK care.data scheme and their wider relevance for biomedical research. Med Health Care Philos. 2016;19(2):177–190. doi: 10.1007/s11019-015-9661-6. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
95. Committee on Understanding and Eliminating Racial and Ethnic Disparities in Health Care . Confronting racial and ethnic disparities in health care. Washington (DC): National Academies Press; 2003. [ PubMed ] [ Google Scholar ]
96. Criado PC. Invisible women. New York: Harry N. Abrams; 2019. [ Google Scholar ]
97. Zhang H, Lu AX, Abdalla M, McDermott M, Ghassemi M. Proceedings of the ACM Conference on Health, Inference, and Learning. Toronto: Association for Computing Machinery; 2020. Hurtful words: quantifying biases in clinical contextual word embeddings; pp. 110–120. [ Google Scholar ]
98. Chen IY, Joshi S, Ghassemi M. Treating health disparities with artificial intelligence. Nat Med. 2020;26(1):16–17. doi: 10.1038/s41591-019-0649-2. [ DOI ] [ PubMed ] [ Google Scholar ]
99. Bolukbasi T, Chang K-W, Zou J, Saligrama V, Kalai A. Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona: Curran Associates Inc.; 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings; pp. 4356–4364. [ Google Scholar ]
100. Kusner, Matt, Loftus, Joshua, Russell, Chris and Silva, Ricardo. Counterfactual fairness Conference. Proceedings of the 31st International Conference on Neural Information Processing Systems Conference. Long Beach, California, USA Publisher: Curran Associates Inc; 2017:4069–4079.
101. Hardt M, Price E, Srebro N. Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona: Curran Associates Inc.; 2016. Equality of opportunity in supervised learning; pp. 3323–3331. [ Google Scholar ]
102. Ustun B, Liu Y, Parkes D. Fairness without harm: decoupled classifiers with preference guarantees. In: Kamalika C, Ruslan S, editors. Proceedings of the 36th International Conference on Machine Learning; Proceedings of Machine Learning Research: PMLR %J Proceedings of Machine Learning Research. 2019. pp. 6373–6382. [ Google Scholar ]
103. Noseworthy PA, Attia ZI, Brewer LC, Hayes SN, Yao X, Kapa S, Friedman PA, Lopez-Jimenez F. Assessing and Mitigating bias in medical artificial intelligence: the effects of race and ethnicity on a deep learning model for ECG analysis. Circ Arrhythm Electrophysiol. 2020;13(3):e007988. doi: 10.1161/CIRCEP.119.007988. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

View on publisher site
PDF (1004.0 KB)
Collections

Add to Collections

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
Explore content
About the journal
Publish with us
Sign up for alerts
Review Article
Published: 04 July 2022

Shifting machine learning for healthcare from development to deployment and from models to data

Angela Zhang ORCID: orcid.org/0000-0003-0906-6770 1 , 2 , 3 , 4 ,
Lei Xing ORCID: orcid.org/0000-0003-2536-5359 5 ,
James Zou ORCID: orcid.org/0000-0001-8880-4764 4 , 6 &
Joseph C. Wu ORCID: orcid.org/0000-0002-6068-8041 1 , 3 , 7 , 8

Nature Biomedical Engineering volume 6 , pages 1330–1345 ( 2022 ) Cite this article

46k Accesses

122 Citations

213 Altmetric

Metrics details

Biomedical engineering
Computational science
Machine learning
Medical imaging
Translational research

In the past decade, the application of machine learning (ML) to healthcare has helped drive the automation of physician tasks as well as enhancements in clinical capabilities and access to care. This progress has emphasized that, from model development to model deployment, data play central roles. In this Review, we provide a data-centric view of the innovations and challenges that are defining ML for healthcare. We discuss deep generative models and federated learning as strategies to augment datasets for improved model performance, as well as the use of the more recent transformer models for handling larger datasets and enhancing the modelling of clinical text. We also discuss data-focused problems in the deployment of ML, emphasizing the need to efficiently deliver data to ML models for timely clinical predictions and to account for natural data shifts that can deteriorate model performance.

The future of digital health with federated learning

Automated clinical coding: what, why, and where we are?

Use of deep learning to develop continuous-risk models for adverse event prediction from electronic health records

In the past decade, machine learning (ML) for healthcare has been marked by particularly rapid progress. Initial groundwork has been laid for many healthcare needs that promise to improve patient care, reduce healthcare workload, streamline healthcare processes and empower the individual 1 . In particular, ML for healthcare has been successful in the translation of computer vision through the development of image-based triage 2 and second readers 3 . There has also been rapid progress in the harnessing of electronic health records 4 , 5 (EHRs) to predict the risk and progression of many diseases 6 , 7 . A number of software platforms for ML are beginning to make their way into the clinic 8 . In 2018, iDX-DR, which detects diabetic retinopathy, was the first ML system for healthcare that the United States Food and Drug Administration approved for clinical use 8 . Babylon 9 , a chatbot triage system, has partnered with the United Kingdom’s National Healthcare system. Furthermore, Viz.ai 10 , 11 has rolled out their triage technology to more than 100 hospitals in the United States.

As ML systems begin to be deployed in clinical settings, the defining challenge of ML in healthcare has shifted from model development to model deployment. In bridging the gap between the two, another trend has emerged: the importance of data. We posit that large, well-designed, well-labelled, diverse and multi-institutional datasets drive performance in real-world settings far more than model optimization 12 , 13 , 14 , and that these datasets are critical for mitigating racial and socioeconomic biases 15 . We realize that such rich datasets are difficult to obtain, owing to clinical limitations of data availability, patient privacy and the heterogeneity of institutional data frameworks. Similarly, as ML healthcare systems are deployed, the greatest challenges in implementation arise from problems with the data: how to efficiently deliver data to the model to facilitate workflow integration and make timely clinical predictions? Furthermore, once implemented, how can model robustness be maintained in the face of the inevitability of natural changes in physician and patient behaviours? In fact, the shift from model development to deployment is also marked by a shift in focus: from models to data.

In this Review, we build on previous surveys 1 , 16 , 17 and take a data-centric approach to reviewing recent innovations in ML for healthcare. We first discuss deep generative models and federated learning as strategies for creating larger and enhanced datasets. We also examine the more recent transformer models for handling larger datasets. We end by highlighting the challenges of deployment, in particular, how to process and deliver usable raw data to models, and how data shifts can affect the performance of deployed models.

Deep generative models

Generative adversarial networks (GANs) are among the most exciting innovations in deep learning in the past decade. They offer the capability to create large amounts of synthetic yet realistic data. In healthcare, GANs have been used to augment datasets 18 , alleviate the problems of privacy-restricted 19 and unbalanced datasets 20 , and perform image-modality-to-image-modality translation 21 and image reconstruction 22 (Fig. 1 ). GANs aim to model and sample from the implicit density function of the input data 23 . They consist of two networks that are trained in an adversarial process under which one network, the ‘generator’, generates synthetic data while the other network, the ‘discriminator’, discriminates between real and synthetic data. The generative model aims to implicitly learn the data distribution from a set of samples to further generate new samples drawn from the learned distribution, while the discriminator pushes the generator network to sample from a distribution that more closely mirrors the true data distribution.

a , GANs can be used to augment datasets to increase model performance and anonymize patient data. For example, they have been used to generate synthetic images of benign and malignant lesions from real images 183 . b , GANs for translating images acquired with one imaging modality into another modality 51 . Left to right: input CT image, generated MR image and reference MR image. c , GANs for the denoising and reconstruction of medical images 184 . Left, low-dose CT image of a patient with mitral valve prolapse, serving as the input into the GAN. Right, corresponding routine-dose CT image and the target of the GAN. Middle, GAN-generated denoised image resembling that obtained from routine-dose CT imaging. The yellow arrows indicate a region that is distinct between the input image (left) and the target denoised image (right). d , GANs for image classification, segmentation and detection 39 . Left, input image of T2 MRI slice from the multimodal brain-tumour image-segmentation benchmark dataset. Middle, ground-truth segmentation of the brain tumour. Right, GAN-generated segmentation image. Yellow, segmented tumour; blue, tumour core; and red, Gd-enhanced tumour core. e , GANs can model a spectrum of clinical scenarios and predict disease progression 66 . Top: given an input MR image (denoted by the arrow), DaniGAN can generate images that reflect neurodegeneration over time. Bottom, difference between the generated image and the input image. ProGAN, progressive growing of generative adversarial network; DaniNet, degenerative adversarial neuroimage net. Credit: Images (‘Examples’) reproduced with permission from: a , ref. 183 , Springer Nature Ltd; b , ref. 51 , under a Creative Commons licence CC BY 4.0 ; c , ref. 184 , Wiley; d , ref. 39 , Springer Nature Ltd; e , ref. 66 , Springer Nature Ltd.

Over the years, a multitude of GANs have been developed to overcome the limitations of the original GAN (Table 1 ), and to optimize its performance and extend its functionalities. The original GAN 23 suffered from unstable training and low image diversity and quality 24 . In fact, training two adversarial models is, in practice, a delicate and often difficult task. The goal of training is to achieve a Nash equilibrium between the generator and the discriminator networks. However, simultaneously obtaining such an equilibrium for networks that are inherently adversarial is difficult and, if achieved, the equilibrium can be unstable (that is, it can be suddenly lost after model convergence). This has also led to sensitivity to hyperparameters (making the tuning of hyperparameters a precarious endeavour) and to mode collapse, which occurs when the generator produces a limited and repeated number of outputs. To remedy these limitations, changes have been made to GAN architectures and loss functions. In particular, the deep convolutional GAN (DCGAN 25 ), a popular GAN often used for medical-imaging tasks, aimed to combat instability by introducing key architecture-design decisions, including the replacement of fully connected layers with convolutional layers, and the introduction of batch normalization (to standardize the inputs to a layer when training deep neural networks) and ReLU (rectified linear unit) activation. The Laplacian pyramid of adversarial networks GAN (LAPGAN 26 ) and the progressively growing GAN (ProGAN 27 ) build on DCGAN to improve training stability and image quality. Both LAPGAN and ProGAN start with a small image, which promotes training stability, and progressively grow the image into a higher-resolution image.

The conditional GAN (cGAN 28 ) and the auxiliary classifier GAN (AC-GAN 29 ) belong to a subtype of GANs that enable the model to be conditioned on external information to create synthetic data of a specific class or condition. This was found to improve the quality of the generated samples and increase the capability to handle the generation of multimodal data. The pix2pix GAN 30 , which is conditioned on images, allows for image-to-image translation (also across imaging modalities) and has been popular in healthcare applications.

A recent major architectural change to GANs involve attention mechanisms. Attention was first introduced to facilitate language translation and has rapidly become a staple in deep-learning models, as it can efficiently capture longer-range global and spatial relations from input data. The incorporation of attention into GANs has led to the development of self-attention GANs (SAGANs) 31 , 32 and BigGAN; 33 ; the latter scales up SAGAN to achieve state-of-the-art performance.

Another primary strategy to mitigate the limitations of GANs involves improving the loss function. Early GANs used the Jensen-Shannon divergence and the Kullback-Leibler divergence as loss functions to minimize the difference in distribution between the synthetic generated dataset and the real-data dataset. However, the Jensen-Shannon divergence was found to fail in scenarios where there is no overlap (or little overlap) between distributions, while the minimization of the Kullback-Leibler divergence can lead to mode collapse. To address these problems, a number of GANs have used alternative loss functions. The most popular are arguably the Wasserstein GAN (WGAN 34 ) and the Wasserstein GAN gradient penalty (WGAN-GP 35 ). The Wasserstein distance measures the effort to minimize the distance between dataset distributions and has been shown to have a smoother gradient. Additional popular strategies that have been implemented to improve GAN performance and that do not involve modifying the model architecture include spectral normalization and varying how frequently the discriminator is updated (with respect to the update frequency of the generator).

The explosive progress of GANs has spawned many more offshoots of the original GAN, as documented by the diverse models that now populate the GAN Model Zoo 36 .

Augmenting datasets

In the past decade, many deep-learning models for medical-image classification 3 , 37 , segmentation 38 , 39 and detection 40 have achieved physician-level performance. However, the success of these models is ultimately beholden to large, diverse, balanced and well-labelled datasets. This is a bottleneck that extends across domains, yet it is particularly restrictive in healthcare applications where collecting comprehensive datasets comes with unique obstacles. In particular, large amounts of standardized clinical data are difficult to obtain, and this is exacerbated by the reality that clinical data often reflects the patient population of one or few institutions (with the data sometimes overrepresenting common diseases or healthy populations and making the sampling of rarer conditions more difficult). Datasets with high class imbalance or insufficient variability can often lead to poor model performance, generalization failures, unintentional modelling of confounders 41 and propagation of biases 42 . To mitigate these problems, clinical datasets can be augmented by using standard data-manipulation techniques, such as the flipping, rotation, scaling and translation of images 43 . However, these methods can lead to limited increases in performance and generate highly correlated training data.

GANs offer potent solutions to these problems. GANs can be used to augment training data to improve model performance. For example, a convolutional neural network (CNN) for the classification of liver lesions, trained on both synthetically and traditionally augmented data, boosted the performance of the model by 10% with respect to a CNN trained on only traditionally augmented datasets 18 . Moreover, when generating synthetic data across data classes, developing a generator for each class can result in higher model performance 20 , 44 , as was shown via the comparison of two variants of GANs (a DCGAN that generated labelled examples for each of three lesion classes separately and an AC-GAN that incorporated class conditioning to generate labelled examples) 18 .

The aforementioned studies involved class-balanced datasets but did not address medical data with either simulated or real class imbalances. In an assessment of the capability of GANs to alleviate the shortcomings of unbalanced chest-X-ray datasets 20 , it was found that training a classifier on real unbalanced datasets that had been augmented with DCGANs outperformed models that were trained with the unbalanced and balanced versions of the original dataset. Although there was an increase in classification accuracy across all classes, the greatest increase in performance was seen in the most imbalanced classes (pneumothorax and oedema), which had just one-fourth the number of training cases as the next class.

Protecting patient privacy

The protection of patient privacy is often a leading concern when developing clinical datasets 45 . Sharing patient data when generating multi-institution clinical datasets can pose a risk to patient privacy 46 . Even if privacy protocols are followed, patient characteristics can sometimes be inferred from the ML model and its outputs 47 , 48 . In this regard, GANs may provide a solution. Data created by GANs cannot be attributed to a single patient, as they synthesize data that reflect the patient population in aggregate. GANs have thus been used as a patient-anonymization tool to generate synthetic data for model training 9 , 49 . Although models trained on just synthetic data can perform poorly, models trained on synthetic data and fine-tuned with 10% real data resulted in similar performance to models trained on real datasets augmented with synthetic data 19 . Similarly, using synthetic data generated from GANs to train an image-segmentation model was sufficient to achieve 95% of the accuracy of the same model trained on real data 49 . Hence, using synthetic data during model development can mitigate potential patient-privacy violations.

Image-to-image translation

One exciting use of GANs involves image-to-image translation. In healthcare, this capability has been used to translate between imaging modalities—between computed tomography (CT) and magnetic resonance (MR) images 21 , 49 , 50 , 51 , between CT and positron emission tomography (PET) 52 , 53 , 54 , between MR and PET 55 , 56 , 57 , and between T1 and T2 MR images 58 , 59 . Transfer between image modalities can reduce the need for additional costly and time-intensive image acquisitions, can be used in scenarios where imaging is not possible (as is the case for MR imaging in individuals with metal implants) and to expand the types of training data that can be created from image datasets. There are two predominant strategies for image-to-image translation: paired-image training (with pix2pix 30 ) and unpaired training (with CycleGAN 60 ). For example, pix2pix was used to generate synthetic CT images for accurate MR-based dose calculations for the pelvis 61 . Similarly, using paired magnetic resonance angiography and MR images, pix2pix was modified to generate a model for the translation of T1 and T2 MR images to retrospectively inspect vascular structures 62 .

Obtaining paired images can be difficult in scenarios involving moving organs or multimodal medical images that are in three dimensions and do not have cross-modality paired data. In such cases, one can use CycleGAN 60 , which handles image-to-image translation on unpaired images. A difficulty with unpaired images is the lack of ground-truth labels for evaluating the accuracy of the predictions (yet real cardiac MR images have been used to compare the performance of segmentation models trained on synthetic cardiac MR images translated from CT images 49 ). Another common problem is the need to avoid geometric distortions that destroy anatomical structures. Limitations with geometric distortions can be overcome by using two auxiliary mappings to constrain the geometric invariance of synthetic data 21 .

Opportunities

In the context of clinical datasets, GANs have primarily been used to augment or balance the datasets, and to preserve patient privacy. Yet a burgeoning application of GANs is their use to systematically explore the entire terrain of clinical scenarios and disease presentations. Indeed, GANs can be used to generate synthetic data to combat model deterioration in the face of domain shifts 63 , 64 , for example, by creating synthetic data that simulate variable lighting or camera distortions, or that imitate data collected from devices from different vendors or from different imaging modalities. Additionally, GANs can be used to create data that simulate the full spectrum of clinical scenarios and disease presentations, from dangerous and rare clinical scenarios such as incorrect surgery techniques 63 , to modelling the spectrum of brain-tumour presentation 19 , to exploring the disease progression of neurodegenerative diseases 65 , 66 .

However, GANs can suffer from training instability and low image diversity and quality. These limitations could hamper the deployment of GANs in clinical practice. For example, one hope for image-to-image translation in healthcare involves the creation of multimodality clinical images (from CT and MR, for example) for scenarios in which only one imaging modality is possible. However, GANs are currently limited in the size and quality of the images that they can produce. This raises the question of whether these images can realistically be used clinically when medical images are typically generated at high resolution. Moreover, there may be regulatory hurdles involved in approving ML healthcare models that have been trained on synthetic data. This is further complicated by the current inability to robustly evaluate and control the quality of GANs and of the synthetic data that they generate 67 . Still, in domains unrelated to healthcare, GANs have been used to make tangible improvements to deployed models 68 . These successes may lay a foundation for the real-world application of GANs in healthcare.

Federated learning

When using multi-institutional datasets, model training is typically performed centrally: data siloed in individual institutions are aggregated into a single server. However, data used in such ‘centralized training’ represent a fraction of the vast amount of clinical data that could be harnessed for model development. Yet, openly sharing and exchanging patient data is restricted by many legal, ethical and administrative constraints; in fact, in many jurisdictions, patient data must remain local.

Federated learning is a paradigm for training ML models when decentralized data are used collaboratively under the orchestration of a central server 69 , 70 (Fig. 2 ). In contrast to centralized training, where data from various locations are moved to a single server to train the model, federated learning allows for the data to remain in place. At the start of each round of training, the current copy of the model is sent to each location where the training data are stored. Each copy of the model is then trained and updated using the data at each location. The updated models are then sent from each location back to the central server, where they are aggregated into a global model. The subsequent round of training follows, the newly updated global model is distributed again, and the process is repeated until model convergence or training is stopped. At no point do the data leave a particular location or institution, and only individuals associated with an institution have direct access to its data. This mitigates concerns about privacy breaches, minimizes costs associated with data aggregation, and allows training datasets to quickly scale in size and diversity. The successful implementation of federated learning could transform how deep-learning models for healthcare are trained. Here we focus on two applications: cross-silo federated learning and cross-device federated learning (Table 2 ).

Multiple institutions collaboratively train an ML model. Federated learning begins when each institution notifies a central server of their intention to participate in the current round of training. Upon notification, approval and recognition of the institution, the central server sends the current version of the model to the institution (step 1). Then, the institution trains the model locally using the data available to it (step 2). Upon completion of local training, the institution sends the model back to the central server (step 3). The central server aggregates all of the models that have been trained locally by each of the individual institutions into a single updated model (step 4). This process is repeated in each round of training until model training concludes. At no point during any of the training rounds do patient data leave the institution (step 5). The successful implementation of federated learning requires healthcare-specific federated learning frameworks that facilitate training, as well as institutional infrastructure for communication with the central server and for locally training the model.

Cross-silo federated learning

Cross-silo federated learning is an increasingly attractive solution to the shortcomings of centralized training 71 . It has been used to leverage EHRs to train models to predict hospitalization due to heart disease 72 , to promote the development of ‘digital twins’ or ‘Google for patients’ 73 , and to develop a Coronavirus disease 2019 (COVID-19) chest-CT lesion segmenter 74 . Recent efforts have focused on empirically evaluating model-design parameters, and on logistical decisions to optimize model performance and overcome the unique implementation challenges of federated learning, such as bottlenecks in protecting privacy and in tackling the statistical heterogeneity of the data 75 , 76 .

Compared with centralized training, one concern of federated learning is that models may encounter more severe domain shifts or overfitting. However, models trained through federated learning were found to achieve 99% of the performance of traditional centralized training even with imbalanced datasets or with relatively few samples per institution, thus showing that federated learning can be realistically implemented without sacrificing performance or generalization 77 , 78 .

Although federated learning offers greater privacy protection because patient data are no longer being transmitted, there are risks of privacy breaches 79 . Communicating model updates during the training process can reveal sensitive information to third parties or to the central server. In certain instances, data leakage can occur, such as when ML models ‘memorize’ datasets 80 , 81 , 82 and when access to model parameters and updates can be used to infer the original dataset 83 . Differential privacy 84 can further reinforce privacy protection for federated learning 70 , 85 , 86 . Selective parameter sharing 87 and the sparse vector technique 88 are two strategies for achieving greater privacy, but at the expense of model performance (this is consistent with differential-privacy findings in domains outside of medicine and healthcare 80 , 89 ).

Another active area of research for federated learning in healthcare involves the handling of data that are neither independent nor identically distributed (non-IID data). Healthcare data are particularly susceptible to this problem, owing to a higher prevalence of certain diseases in certain institutions (which can cause label-distribution skew) or to institution-specific data-collection techniques (leading to ‘same label, different features’ or to ‘same features, different label’). Many federated learning strategies assume IID data, but non-IID data can pose a very real problem in federated learning; for example, it can cause the popular federated learning algorithm FedAvg 70 to fail to converge 90 . The predominant strategies for addressing this issue have involved the reframing of the data to achieve a uniform distribution (consensus solutions) or the embracing of the heterogeneity of the data 69 , 91 , 92 (pluralistic solutions). In healthcare, the focus has been on consensus solutions involving data sharing (a small subset of training data is shared among all institutions 93 , 94 ).

Cross-device federated learning to handle health data from individuals

‘ Smart’ devices can produce troves of continuous, passive and individualized health data that can be leveraged to train ML models and deliver personalized health insights for each user 1 , 16 , 39 , 95 , 96 . As smart devices become increasingly widespread, and as computing and sensor technology become more advanced and cheaper to mass-produce, the amount of health data will grow exponentially. This will accentuate the challenges of aggregating large quantities of data into a single location for centralized training and exacerbate privacy concerns (such as any access to detailed individual health data by large corporations or governments).

Cross-device federated learning was developed to address the increasing amounts of data that are being generated ‘at the edge’ (that is, by decentralized smart devices), and has been deployed on millions of smart devices; for example, for voice recognition (by Apple, for the voice assistant Siri 97 ) and to improve query suggestions (by Google, for the Android operating system 98 ).

The application of cross-device federated learning to train healthcare models for smart devices is an emerging area of research. For example, using a human-activity-recognition dataset, a global model (FedHealth) was pre-trained using 80% of the data before deploying it to be locally trained and then aggregated 99 . The aggregated model was then sent back to each user and fine-tuned on user-specific data to develop a personalized model for the user. Model personalization resolves issues arising from the highly different probability distributions that may arise across users and the global model. This training strategy outperformed non-federated learning by nearly 5.3%.

Limitations and opportunities

In view of the initial promises and successes of federated learning, the next few years will be defined by progress towards the implementation of federated learning in healthcare. This will require a high degree of coordination across institutions at each step of the federated learning process. Before training, medical data will need to undergo data normalization and standardization. This can be challenging, owing to differences in how data are collected, stored, labelled and partitioned across institutions. Current data pre-processing pipelines could be adapted to create multi-institutional training datasets, yet in federated learning, the responsibility shifts from a central entity to each institution individually. Hence, methods to streamline and validate these processes across institutions will be essential for the successful implementation of federated learning.

Another problem concerns the inability of the developer of the model to directly inspect data during model development. Data inspection is critical for troubleshooting and for identifying any mislabelled data as well as general trends. Tools (such as Federated Analytics, developed by Google 100 ) that use GANs to create synthetic data that resemble the original training data 101 and derive population-level summary statistics from the data, can be helpful. However, it is currently unclear whether tools that have been developed for cross-device settings can be applied to cross-silo healthcare settings while preserving institutional privacy.

Furthermore, federated learning will require robust frameworks for the implementation of federated networks. Many such software is proprietary, and many of the open-source frameworks are primarily intended for use in research. The primary concerns of federated learning can be addressed by frameworks designed to reinforce patient privacy, facilitate model aggregation and tackle the challenges of non-IID data.

One main hurdle is the need for each participating healthcare institution to acquire the necessary infrastructure. This implies ensuring that each institution has the same federated learning framework and version, that stable and encrypted network communication is available to send and receive model updates from the central server, and that the computing capabilities (institutional graphics processing units or access to cloud computing) are sufficient to train the model. Although most large healthcare institutions may have the necessary infrastructure in place, it has typically been optimized to store and handle data centrally. The adaptation of infrastructure to handle the requirements of federated learning requires coordinated effort and time.

A number of ongoing federated learning initiatives in healthcare are underway. Specifically, the Federated Tumour Segmentation Initiative (a collaboration between Intel and the University of Pennsylvania) trains lesion-segmentation models collaboratively across 29 international healthcare institutions 102 . This initiative focuses on finding the optimal algorithm for model aggregation, as well as on ways to standardize training data from various institutions. In another initiative (a collaboration of NVIDIA and several institutions), federated learning was used to train mammography-classification models 103 . These efforts may establish blueprints for coordinated federated networks applied to healthcare.

Natural language processing

Harnessing natural language processing (NLP)—the automated understanding of text—has been a long-standing goal for ML in healthcare 1 , 16 , 17 . NLP has enabled the automated translation of doctor–patient interactions to notes 5 , 104 , 105 , the summarization of clinical notes 106 , the captioning of medical images 107 , 108 and the prediction of disease progression 6 , 7 . However, the inability to efficiently train models using the large datasets needed to achieve adept natural-language understanding has limited progress. In this section, we provide an overview of two recent innovations that have transformed NLP: transformers and transfer learning for NLP. We also discuss their applications in healthcare.

Transformers

When modelling sequential data, recurrent neural networks (RNNs) have been the predominant choice of neural network. In particular, long short-term memory networks 109 and gated units 110 were staple RNNs in modelling EHR data, as these networks can model the sequential nature of clinical data 111 , 112 and clinical text 5 , 104 , 105 , 113 . However, RNNs harbour several limitations 114 . Namely, RNNs process data sequentially and not in parallel. This restricts the size of the input datasets and of the networks, which limits the complexity of the features and the range of relations that can be learned 114 . Hence, RNNs are difficult to train, deploy and scale, and are suboptimal for capturing long-range patterns and global patterns in data. However, learning global or long-range relationships are often needed when learning language representations. For example, sentences far removed from a word may be important for providing context for the word, and previous clinical events that have occurred can inform clinical decisions that are made years later. For a period, CNNs, which are adept at parallelization, were used to overcome some of the limitations of RNNs 115 , but were found to be inefficient when modelling longer global dependencies.

In 2017, a research team at Google (the Google Brain team) released the transformer, a landmark model that has revolutionized NLP 116 . Compared with RNN and CNN models, transformers are more parallelizable and less computationally complex at each layer, and thus can handle larger training data and learn longer-range and global relations. The use of only attention layers for the encoders and decoders while forgoing the use of RNNs or CNNs was critical to the success of transformers. Attention was introduced and refined 117 , 118 to handle bottlenecks in sequence-to-sequence RNNs 110 , 119 . Attention modules allow models to globally relate different positions of a sequence to compute a richer representation of the sequence 116 , and does so in parallel, allowing for increased computing efficiency and for the embedding of longer relations of the input sequence (Fig. 3 ).

a , The original transformer model performs language translation, and contains encoders that convert the input into an embedding and decoders that convert the embedding into the output. b , The transformer model uses attention mechanisms within its encoders and decoders. The attention module is used in three places: in the encoder (for the input sentence), in the decoder (for the output sentence) and in the encoder–decoder in the decoder (for embeddings passed from the encoder). c , The key component of the transformer block is the attention module. Briefly, attention is a mechanism to determine how much weight to place on input features when creating embeddings for downstream tasks. For NLP, this involves determining how much importance to place on surrounding text when creating a representation for a particular word. To learn the weights, the attention mechanism assigns a score to each pair of words from an input phrase to determine how strongly the words should influence the representation. To obtain the score, the transformer model first decomposes the input into three vectors: the query vector ( Q ; the word of interest), the key vector ( K ; surrounding words) and the value vector ( V ; the contents of the input) (1). Next, the dot product is taken between the query and key vector (2) and then scaled to stabilize training (3). The SoftMax function is then applied to normalize the scores and ensure that they add to 1 (4). The output SoftMax score is then multiplied by the value vector to apply a weighted focus to the input (5). The transformer model has multiple attention mechanisms (termed attention heads); each learn a separate representation for the same word, which therefore increases the relations that can be learned. Each attention head is composed of stacked attention layers. The output of each attention mechanism is concatenated into a single matrix (6) that is fed into the downstream feed-forward layer. d , e , Visual representation of what is learned 185 . Lines relate the query (left) to the words that are attended to the most (right). Line thickness denotes the magnitude of attention, and colours represent the attention head. d , The learned attention in one attention-mechanism layer of one head. e , Examples of what is learned by each layer of each attention head. Certain layers learn to attend to the next words (head 2, layer 0) or to the previous word (head 0, layer 0). f , Workflow for applying a transformer language model to a clinical task. Matmul, matrix multiplication; (CLS), classification token placed at the start of a sentence to store the sentence-level embedding; (SEP), separation token placed at the end of a sentence. BERT, bidirectional encoder representations from transformers; MIMIC, multiparameter intelligence monitoring in intensive care.

Transfer learning for NLP

Simultaneous and subsequent work following the release of the transformer resolved another main problem in NLP: the formalization of the process of transfer learning. Transfer learning has been used most extensively in computer vision, owing to the success of the ImageNet challenge, which made pre-trained CNNs widely available 120 . Transfer learning has enabled the broader application of deep learning in healthcare 17 , as researchers can fine-tune a pre-trained CNN adept at image classification on a smaller clinical dataset to accomplish a wide spectrum of healthcare tasks 3 , 37 , 121 , 122 . Until recently, robust transfer learning for NLP models was not possible, which limited the use of NLP models in domain-specific applications. A series of recent milestones have enabled transfer learning for NLP. The identification of the ideal pre-training language task for deep-learning NLP models (for example, masked-language modelling, predicting missing words from surrounding context, next-sentence prediction or predicting whether two sentences follow one another) was solved by universal language model fine-tuning (ULM-FiT 123 ) and embeddings from language model (ELMo 124 ). The generative pre-trained transformer (GPT 125 ) from Open AI and the bidirectional encoder representations from transformers (BERT 126 ) from Google Brain then applied the methods formalized by ULM-FiT and ELMo to transformer models, delivering pre-trained models that achieved unprecedented capabilities on a series of NLP tasks.

Transformers for the understanding of clinical text

Following the success of transformers for NLP, their potential to handle domain-specific text, specifically clinical text, was quickly assessed. The performances of the transformer-based model BERT, the RNN-based model ELMo and traditional word-vector embeddings 127 , 128 at clinical-concept extraction (the identification of the medical problems, tests and treatments) from EHR data were evaluated 106 . BERT outperformed traditional word vectors by a substantial margin and was more computationally efficient than ELMo (it achieved higher performance with fewer training iterations) 129 , 130 , 131 , 132 . Pre-training on a dataset of 2 million clinical notes (the dataset multiparameter intelligence monitoring in intensive care 132 ; MIMIC-III) increased the performance of all NLP models. This suggests that contextual embeddings encode valuable semantic information not accounted for in traditional word representations 106 . However, the performance of MIMIC-III BERT began to decline after achieving its optimal model; this is perhaps indicative of the model losing information learned from the large open corpus and converging to a model similar to the one initialized from scratch 106 . Hence, there may be a fine balance between learning from a large open-domain corpus and a domain-specific clinical corpus. This may be a critical consideration when applying pre-trained models to healthcare tasks.

To facilitate the further application of clinically pre-trained BERT 129 to downstream clinical tasks, a BERT pre-trained on large clinical datasets was publicly released. Because transformers and deep NLP models are resource-intensive to train (training the BERT model can cost US$50,000–200,000 133 ; and pre-training BERT on clinical datasets required 18 d of continuous training, an endeavour that may be out of the reach of many institutions), openly releasing pre-trained clinical models can facilitate widespread advancements of NLP tasks in healthcare. Other large and publicly available clinically pre-trained models (Table 3 ) are ClinicalBERT 130 , BioBERT 134 and SciBERT 135 .

The release of clinically pre-trained models has spurred downstream clinical applications. ClinicalBERT, a BERT model pre-trained on MIMIC-III data using masked-language modelling and next-sentence prediction, was evaluated on the downstream task of predicting 30 d readmission 130 . Compared with previous models 136 , 137 , ClinicalBERT can dynamically predict readmission risk during a patient’s stay and uses clinical text rather than structured data (such as laboratory values, or codes from the international classification of diseases). This shows the power of transformers to unlock clinical text, a comparatively underused data source in EHRs. Similarly, clinical text from EHRs has been harnessed using SciBERT for the automated extraction of symptoms from COVID-19-positive and COVID-19-negative patients to identify the most discerning clinical presentation 138 . ClinicalBERT has also been adapted to extract anginal symptoms from EHRs 139 . Others have used enhanced clinical-text understanding for the automatic labelling and summarization of clinical reports. BioBERT and ClinicalBERT have been harnessed to extract labels from radiology text reports, enabling an automatic clinical summarization tool and labeller 140 . Transformers have also been used to improve clinical questioning and answering 141 , in clinical voice assistants 142 , 143 , in chatbots for patient triage 144 , 145 , and in medical-image-to-text translation and medical-image captioning 146 .

Transformers for the modelling of clinical events

In view of their adeptness to model the sequential nature of clinical text, transformers have also been harnessed to model the sequential nature of clinical events 147 , 148 , 149 , 150 , 151 . A key challenge of modelling clinical events is properly capturing long-term dependencies—that is, previous clinical procedures that may preclude future downstream interventions. Transformers are particularly adept at exploring longer-range relationships and were recently used to develop BEHRT 152 , which leverages the parallels between sequences in natural language and clinical events in EHRs to portray diagnoses as words, visits as sentences and a patient’s medical history as a document 152 . When used to predict the likelihood of 301 conditions in future visits, BEHRT achieved an 8–13.2% improvement over the existing state-of-the-art EHR model 152 . BEHRT was also used to predict the incidence of heart failure from EHR data 153 .

Data-limiting factors in the deployment of ML

The past decade of research in ML in healthcare has focused on model development, and the next decade will be defined by model deployment into clinical settings 42 , 45 , 46 , 154 , 155 . In this section, we discuss two data-centric obstacles in model deployment: how to efficiently deliver raw clinical data (Table 4 ) to models, and how to monitor and correct for natural data shifts that deteriorate model performance.

Delivering data to models

A main obstacle to model deployment is associated with how to efficiently transform raw, unstructured and heterogeneous clinical data into structured data that can be inputted into ML models. During model development, pre-processed structured data are directly inputted into the model. However, during deployment, minimizing the delay between the acquisition of raw data and the delivery of structured inputs requires an adept data pipeline for collecting data from their source, and for ingesting, preparing and transforming the data (Fig. 4 ). An ideal system would need to be high-throughput, have low latency and be scalable to a large number of data sources. A lack of optimization can result in major sources of inefficiency and delayed predictions from the model. In what follows, we detail the challenges of building a pipeline for clinical data and give an overview of the key components of such a pipeline.

Delivering data to a model is a key bottleneck in obtaining timely and efficient inferences. ML models require input data that are organized, standardized and normalized, often in tabular format. Therefore, it is critical to establish a pipeline for organizing and storing heterogeneous clinical data. The data pipeline involves collecting, ingesting and transforming clinical data from an assortment of data sources. Data can be housed in data lakes, in data warehouses or in both. Data lakes are central repositories to store all forms of data, raw and processed, without any predetermined organizational structure. Data in data lakes can exist as a mix of binary data (for example, images), structured data, semi-structured data (such as tabular data) and unstructured data (for example, documents). By contrast, data warehouses store cleaned, enriched, transformed and structured data with a predetermined organizational structure.

The fundamental challenge of creating an adept data pipeline arises from the need to anticipate the heterogeneity of the data. ML models often require a set of specific clinical inputs (for example, blood pressure and heart rate), which are extracted from a suite of dynamically changing health data. However, it is difficult to extract the relevant data inputs. Clinical data vary in volume and velocity (the rate that data are generated), thus prompting the question of how frequently data should be collected. Furthermore, clinical data can vary in veracity (data quality), thus requiring different pre-processing steps. Moreover, the majority of clinical data exist in an unstructured format that is further complicated by the availability of hundreds of EHR products, each with its own clinical terminology, technical specifications and capabilities 156 . Therefore, how to precisely extract data from a spectrum of unstructured EHR frameworks becomes critical.

Data heterogeneity must be carefully accounted for when designing the data pipeline, as it can influence throughput, latency and other performance factors. The data pipeline starts with the process of data ingestion (by which raw clinical data are moved from the data source and into the pipeline), a primary bottleneck in the throughput of the data through the pipeline. In particular, handling peaks of data generation may require the design and implementation of scalable ways to support a variable number of connected objects 157 . Such data-elasticity issues can take advantage of software frameworks that scale up or down in real time to more effectively use computer resources in cloud data centres 158 .

After the data enters the pipeline, the data-preparation stage involves the cleansing, denoising, standardization and shaping of the data into structured data that are ready for consumption by the ML system. In studies that developed data pipelines to handle healthcare data 156 , 159 , 160 , the data-preparation stage was found to regulate the latency of the data pipeline, as latency depended on the efficiency of the data queue, the streaming of the data and the database for storing the computation results.

A final consideration is how data should move throughout the data pipeline; specifically, whether data should move in discrete batches or in continuous streams. Batch processing involves collecting and moving source data periodically, whereas stream processing involves sourcing, moving and processing data as soon as they are created. Batch processing has the advantages of being high-throughput, comprehensive and economical (and hence may be advantageous for scalability), whereas stream processing occurs in real time (and thus may be required for time-sensitive predictions). Many healthcare systems use a combination of batch processing and stream processing 160 .

Established data pipelines are being harnessed to support real-time healthcare modelling. In particular, Columbia University Medical Center, in collaboration with IBM, is streaming physiological data from patients with brain injuries to predict adverse neurological complications up to 48 h before existing methods can 161 . Similarly, Yale School of Medicine has used a data pipeline to support real-time data acquisition for predicting the number of beds available, handling care for inpatients and patients in the intensive care unit (such as managing ventilator capacity) and tracking the number of healthcare providers exposed to COVID-19 161 . However, optimizing the components of the data pipeline, particularly for numerous concurrent ML healthcare systems, remains a challenging task.

Deployment in the face of data shifts

A main obstacle in deploying ML systems for healthcare has been maintaining model robustness when faced with data shifts 162 . Data shifts occur when differences or changes in healthcare practices or in patient behaviour cause the deployment data to differ substantially from the training data, resulting in the distribution of the deployment data diverging from the distribution of the training data. This can lead to a decline in model performance. Also, failure to correct for data shifts can lead to the perpetuation of algorithmic biases, missing critical diagnoses 163 and unnecessary clinical interventions 164 .

In healthcare, data shifts are common occurrences and exist primarily along the axes of institutional differences (such as local clinical practices, or different instruments and data-collection workflows), epidemiological shifts, temporal shifts (for example, changes in physician and patient behaviours over time) and differences in patient demographics (such as race, gender and age). A recent case study 165 characterizing data shifts caused by institutional differences reported that pneumothorax classifiers trained on individual institutional datasets declined in performance when evaluated on data from external institutions. Similar phenomena have been observed in a number of studies 41 , 163 , 166 . Institutional differences are among the most patent causes of data shifts because they frequently harbour underlying differences in patient demographics, disease incidence and data-collection workflows. For example, in an analysis of chest-X-ray classifiers and their potential to generalize to other institutions, it was found that one institution collected chest X-rays using portable radiographs, whereas another used stationary radiographs 41 . This led to differences in disease prevalence (33% vs 2% for pneumonia) and patient demographics (average age of 63 vs 45), as portable radiographs were primarily used for inpatients who were too sick to be transported, whereas stationary radiographs were used primarily in outpatient settings. Similarly, another study found that different image-acquisition and image-processing techniques caused the deterioration of the performance of breast-mammography classifiers to random performance (areas under the receiver operating characteristic curve of 0.4–0.6) when evaluated on datasets from four external institutions and countries 163 . However, it is important to note that the models evaluated were trained on data collected during the 1990s and were externally tested on datasets created in 2014–2017. The decline in performance owing to temporal shifts is particularly relevant; if deployed today, models that have been trained on older datasets would be making inferences on newly generated data.

Studies that have characterized temporal shifts have provided insights into the conditions under which deployed ML models should be re-evaluated. An evaluation of models that used data collected over a period of 9 years found that model performance deteriorated substantially, drifting towards overprediction as early as one year after model development 167 . For the MIMIC-III dataset 132 (commonly used for the development of models to predict clinical outcomes), an assessment of the effects of temporal shifts on model performance over time showed that, whereas all models experienced a moderate decline over time, the most significant drop in performance occurred owing to a shift in clinical practice, when EHRs transitioned systems 164 (from CareVue to MetaVision). A modern-day analogy would be how ML systems for COVID-19 (ref. 168 ) that were trained on data 169 acquired during the early phase of the pandemic and before the availability of COVID-19 vaccines would perform when deployed in the face of shifts in disease incidence and presentation.

Data shifts and model deterioration can also occur when models are deployed on patients with gender, racial or socioeconomic backgrounds that are different from those of the patient population that the model was trained on. In fact, it has been shown that ML models can be biased against individuals of certain races 170 or genders 42 , or particular religious 171 or socioeconomic 15 backgrounds. For example, a large-scale algorithm used in many health institutions to identify patients for complex health needs underpredicted the health needs of African American patients and failed to triage them for necessary care 172 . Using non-representative or non-inclusive training datasets can constitute an additional source of gender, racial or socioeconomic biases. Popular chest-X-ray datasets used to train classifiers have been shown to be heavily unbalanced 15 : 67.6% of the patients in these datasets are Caucasian and only 8.98% are under Medicare insurance. Unsurprisingly, the performance of models trained with these datasets deteriorates for non-Caucasian subgroups, and especially for Medicare patients 15 . Similarly, skin-lesion classifiers that were trained primarily on images of one skin tone decrease in performance when evaluated on images of different skin tones 173 ; in this case, the drop in performance could be attributed to variations in disease presentation that are not captured when certain patient populations are not adequately represented in the training dataset 174 .

These findings exemplify two underlying limitations of ML models: the models can propagate existing healthcare biases on a large scale, and insufficient diversity in the training datasets can lead to an inadequate generalization of model outputs to different patient populations. Training models on multi-institutional datasets can be most effective at combating model deterioration 15 , and directly combating existing biases in the training data can also mitigate their impact 171 . There are also solutions for addressing data shifts that involve proactively addressing them during model development 175 , 176 , 177 , 178 or retroactively by surveilling for data shifts during model deployment 179 . A proactive attitude towards recognizing and addressing potential biases and data shifts will remain imperative.

Substantial progress in the past decade has laid a foundation of knowledge for the application of ML to healthcare. In pursuing the deployment of ML models, it is clear that success is dictated by how data are collected, organized, protected, moved and audited. In this Review, we have highlighted methods that can address these challenges. The emphasis will eventually shift to how to build the tools, infrastructure and regulations needed to efficiently deploy innovations in ML in clinical settings. A central challenge will be the implementation and translation of these advances into healthcare in the face of their current limitations: for instance, GANs applied to medical images are currently limited by image resolution and image diversity, and can be challenging to train and scale; federated learning promises to alleviate problems associated with small single-institution datasets, yet it requires robust frameworks and infrastructure; and large language models trained on large public datasets can subsume racial and ethnic biases 171 .

Another central consideration is how to handle the regulatory assessment of ML models for healthcare applications. Current regulation and approval processes are being adapted to meet the emerging needs; in particular, initiatives are attempting to address data shifts and patient representation in the training datasets 165 , 180 , 181 . However, GANs, federated learning and transformer models add complexities to the regulatory process. Few healthcare-specific benchmarking datasets exist to evaluate the performance of these ML systems during clinical deployment. Moreover, the assessment of the performance of GANs is hampered by the lack of efficient and robust metrics to evaluate, compare and control the quality of synthetic data.

Notwithstanding the challenges, the fact that analogous ML technologies are being used daily by millions of individuals in other domains, most prominently in smartphones 100 , search engines 182 and self-driving vehicles 68 , suggests that the challenges of deployment and regulation of ML for healthcare can also be addressed.

Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence. Nat. Med. 25 , 44–56 (2019).

Article CAS Google Scholar

Gulshan, V. et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316 , 2402–2410 (2016).

Article Google Scholar

Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542 , 115–118 (2017).

Rajkomar, A. et al. Scalable and accurate deep learning with electronic health records. npj Digit. Med. 1 , 18 (2018).

Rajkomar, A. et al. Automatically charting symptoms from patient-physician conversations using machine learning. JAMA Intern. Med. 179 , 836–838 (2019).

Henry, K. E., Hager, D. N., Pronovost, P. J. & Saria, S. A targeted real-time early warning score (TREWScore) for septic shock. Sci. Transl. Med. 7 , 299ra122 (2015).

Komorowski, M., Celi, L. A., Badawi, O., Gordon, A. C. & Faisal, A. A. The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care. Nat. Med. 24 , 1716–1720 (2018).

Abràmoff, M. D., Lavin, P. T., Birch, M., Shah, N. & Folk, J. C. Pivotal trial of an autonomous AI-based diagnostic system for detection of diabetic retinopathy in primary care offices. npj Digit. Med. 1 , 39 (2018).

Iacobucci, G. Babylon Health holds talks with ‘significant’ number of NHS trusts. Brit. Med. J. 368 , m266 (2020).

Hale, C. Medtronic to distribute Viz.ai’s stroke-spotting AI imaging software. Fierce Biotech (23 July 2019); https://www.fiercebiotech.com/medtech/medtronic-to-distribute-viz-ai-s-stroke-spotting-ai-imaging-software

Hassan, A. E. et al. Early experience utilizing artificial intelligence shows significant reduction in transfer times and length of stay in a hub and spoke model. Interv. Neuroradiol. 26 , 615–622 (2020).

Ting, D. S. W. et al. Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes. JAMA 318 , 2211–2223 (2017).

McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening. Nature 577 , 89–94 (2020).

Liu, Y. et al. A deep learning system for differential diagnosis of skin diseases. Nat. Med. 26 , 900–908 (2020).

Seyyed-Kalantari, L., Liu, G., McDermott, M., Chen, I. Y. & Ghassemi, M. CheXclusion: fairness gaps in deep chest X-ray classifiers. Pac. Symp. Biocomput. 26 , 232–243 (2021).

Google Scholar

Yu, K.-H., Beam, A. L. & Kohane, I. S. Artificial intelligence in healthcare. Nat. Biomed. Eng. 2 , 719–731 (2018).

Esteva, A. et al. A guide to deep learning in healthcare. Nat. Med. 25 , 24–29 (2019).

Frid-Adar, M. et al. GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing 321 , 321–331 (2018).

Shin, H.-C. et al. Medical image synthesis for data augmentation and anonymization using generative adversarial networks. In Simulation and Synthesis in Medical Imaging SASHIMI 2018 (eds Gooya, A., Goksel, O., Oguz, I. & Burgos, N.) 1–11 (Springer Cham, 2018).

Salehinejad, H., Valaee, S., Dowdell, T., Colak, E. & Barfett, J. Generalization of deep neural networks for chest pathology classification in X-rays using generative adversarial networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 990–994 (ieeexplore.ieee.org, 2018).

Zhang, Z., Yang, L. & Zheng, Y. Translating and segmenting multimodal medical volumes with cycle-and shape-consistency generative adversarial network. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition 9242–9251 (IEEE, 2018).

Xu, F., Zhang, J., Shi, Y., Kang, K. & Yang, S. A fast low-rank matrix factorization method for dynamic magnetic resonance imaging restoration. In 5th International Conference on Big Data Computing and Communications (BIGCOM) 38–42 (2019).

Goodfellow, I. J. et al. Generative adversarial networks. In Advances in Neural Information Processing Systems 27 (eds .Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. & Weinbergerm, K.Q.) Paper 1384 (Curran, 2014).

Wang, Z., She, Q. & Ward, T. E. Generative adversarial networks in computer vision: a survey and taxonomy. ACM Comput. Surv. 54 , 1–38 (2021).

Radford, A., Metz, L. & Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. Preprint at https://arxiv.org/abs/1511.06434v2 (2016).

Denton, E. L., Chintala, S. & Fergus, R. Deep generative image models using a Laplacian pyramid of adversarial networks. In Advances in Neural Information Processing Systems 28 (eds Cortes, C., Lawrence, N., Lee, D., Sugiyama, M. & Garnett, R.) Paper 903 (Curran, 2015).

Karras, T., Aila, T., Laine, S. & Lehtinen, J. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations 2018 Paper 447 (ICLR, 2018).

Mirza, M. & Osindero, S. Conditional generative adversarial nets. Preprint at https://arxiv.org/abs/1411.1784v1 (2014).

Odena, A., Olah, C. & Shlens, J. Conditional image synthesis with auxiliary classifier GANs. In Proceedings of the 34th International Conference on Machine Learning (eds. Precup, D. & Teh, Y. W.) 2642–2651 (PMLR, 2017).

Isola, P., Zhu, J.-Y., Zhou, T. & Efros, A. A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 5967–5976 (2018).

Zhang, H., Goodfellow, I., Metaxas, D. & Odena, A. Self-attention generative adversarial networks. In Proceedings of the 36th International Conference on Machine Learning (eds. Chaudhuri, K. & Salakhutdinov, R.) 7354–7363 (PMLR, 2019).

Wu, Y., Ma, Y., Liu, J., Du, J. & Xing, L. Self-attention convolutional neural network for improved MR image reconstruction. Inf. Sci. 490 , 317–328 (2019).

Brock, A., Donahue, J. & Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations Paper 564 (ICLR, 2019).

Arjovsky, M., Chintala, S. & Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning (eds. Precup, D. & Teh, Y. W.) 214–223 (PMLR, 2017).

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. & Courville, A. C. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems 30 (eds. Guyon, I. et al.) Paper 2945 (Curran, 2017).

Hindupur, A. The-gan-zoo. https://github.com/hindupuravinash/the-gan-zoo (2018).

Rajpurkar, P. et al. Deep learning for chest radiograph diagnosis: a retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS Med. 15 , e1002686 (2018).

Ouyang, D. et al. Video-based AI for beat-to-beat assessment of cardiac function. Nature 580 , 252–256 (2020).

Xue, Y., Xu, T., Zhang, H., Long, L. R. & Huang, X. SegAN: adversarial network with multi-scale L1 loss for medical image segmentation. Neuroinformatics 16 , 383–392 (2018).

Haque, A., Milstein, A. & Fei-Fei, L. Illuminating the dark spaces of healthcare with ambient intelligence. Nature 585 , 193–202 (2020).

Zech, J. R. et al. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS Med. 15 , e1002683 (2018).

Zou, J. & Schiebinger, L. AI can be sexist and racist — it’s time to make it fair. Nature 559 , 324–326 (2018).

Perez, L. & Wang, J. The effectiveness of data augmentation in image classification using deep learning. Preprint at https://arxiv.org/abs/1712.04621v1 (2017).

Madani, A., Moradi, M., Karargyris, A. & Syeda-Mahmood, T. Semi-supervised learning with generative adversarial networks for chest X-ray classification with ability of data domain adaptation. In IEEE 15th International Symposium on Biomedical Imaging (ISBI) 1038–1042 (IEEE, 2018).

He, J. et al. The practical implementation of artificial intelligence technologies in medicine. Nat. Med. 25 , 30–36 (2019).

Kelly, C. J., Karthikesalingam, A., Suleyman, M., Corrado, G. & King, D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 17 , 195 (2019).

Rocher, L., Hendrickx, J. M. & de Montjoye, Y.-A. Estimating the success of re-identifications in incomplete datasets using generative models. Nat. Commun. 10 , 3069 (2019).

Schwarz, C. G. et al. Identification of anonymous MRI research participants with face-recognition software. N. Engl. J. Med. 381 , 1684–1686 (2019).

Chartsias, A., Joyce, T., Dharmakumar, R. & Tsaftaris, S. A. Adversarial image synthesis for unpaired multi-modal cardiac data. in Simulation and Synthesis in Medical Imaging (eds. Tsaftaris, S. A., Gooya, A., Frangi, A. F. & Prince, J. L.) 3–13 (Springer International Publishing, 2017).

Emami, H., Dong, M., Nejad-Davarani, S. P. & Glide-Hurst, C. K. Generating synthetic CTs from magnetic resonance images using generative adversarial networks. Med. Phys . https://doi.org/10.1002/mp.13047 (2018).

Jin, C.-B. et al. Deep CT to MR synthesis using paired and unpaired data. Sensors 19 , 2361 (2019).

Bi, L., Kim, J., Kumar, A., Feng, D. & Fulham, M. In Molecular Imaging, Reconstruction and Analysis of Moving Body Organs, and Stroke Imaging and Treatment (eds. Cardoso, M. J. et al.) 43–51 (Springer International Publishing, 2017).

Ben-Cohen, A. et al. Cross-modality synthesis from CT to PET using FCN and GAN networks for improved automated lesion detection. Eng. Appl. Artif. Intell. 78 , 186–194 (2019).

Armanious, K. et al. MedGAN: medical image translation using GANs. Comput. Med. Imaging Graph. 79 , 101684 (2020).

Choi, H. & Lee, D. S. Alzheimer’s Disease Neuroimaging Initiative. Generation of structural MR images from amyloid PET: application to MR-less quantification. J. Nucl. Med. 59 , 1111–1117 (2018).

Wei, W. et al. Learning myelin content in multiple sclerosis from multimodal MRI through adversarial training. In Medical Image Computing and Computer Assisted Intervention — MICCAI 2018 (eds. Frangi, A. F., Schnabel, J. A., Davatzikos, C., Alberola-López, C. & Fichtinger, G.) 514–522 (Springer Cham, 2018).

Pan, Y. et al. Synthesizing missing PET from MRI with cycle-consistent generative adversarial networks for Alzheimer’s disease diagnosis. In Medical Image Computing and Computer Assisted Intervention — MICCAI 2018 (eds. Frangi, A. F., Schnabel, J. A., Davatzikos, C., Alberola-López, C. & Fichtinger, G.) 455–463 (Springer Cham, 2018).

Welander, P., Karlsson, S. & Eklund, A. Generative adversarial networks for image-to-image translation on multi-contrast MR images - a comparison of CycleGAN and UNIT. Preprint at https://arxiv.org/abs/1806.07777v1 (2018).

Dar, S. U. H. et al. Image synthesis in multi-contrast MRI with conditional generative adversarial networks. IEEE Trans. Med. Imaging 38 , 2375–2388 (2019).

Zhu, J.-Y., Park, T., Isola, P. & Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV) (IEEE, 2017); https://doi.org/10.1109/iccv.2017.244

Maspero, M. et al. Dose evaluation of fast synthetic-CT generation using a generative adversarial network for general pelvis MR-only radiotherapy. Phys. Med. Biol. 63 , 185001 (2018).

Olut, S., Sahin, Y.H., Demir, U., Unal, G. Generative adversarial training for MRA image synthesis using multi-contrast MRI. In PRedictive Intelligence in MEdicine. PRIME 2018. Lecture Notes in Computer Science (eds Rekik, I., Unal, G., Adeli, E. & Park, S.) (Springer Cham, 2018); https://doi.org/10.1007/978-3-030-00320-3_18

Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F. K. & Mahmood, F. Synthetic data in machine learning for medicine and healthcare. Nat. Biomed. Eng. 5 , 493–497 (2021).

Kanakasabapathy, M. K. et al. Adaptive adversarial neural networks for the analysis of lossy and domain-shifted datasets of medical images. Nat. Biomed. Eng. 5 , 571–585 (2021).

Bowles, C., Gunn, R., Hammers, A. & Rueckert, D. Modelling the progression of Alzheimer’s disease in MRI using generative adversarial networks. In Medical Imaging 2018: Image Processing (eds. Angelini, E. D. & Landman, B. A.) 397– 407 (International Society for Optics and Photonics, 2018).

Ravi, D., Alexander, D.C., Oxtoby, N.P. & Alzheimer’s Disease Neuroimaging Initiative. Degenerative adversarial neuroImage nets: generating images that mimic disease progression. In Medical Image Computing and Computer Assisted Intervention — MICCAI 2019. Lecture Notes in Computer Science. (eds Shen, D. et al) 164–172 (Springer, 2019).

Borji, A. Pros and cons of GAN evaluation measures. Comput. Vis. Image Underst. 179 , 41–65 (2019).

Vincent, J. Nvidia uses AI to make it snow on streets that are always sunny. The Verge https://www.theverge.com/2017/12/5/16737260/ai-image-translation-nvidia-data-self-driving-cars (2017).

Kairouz, P. et al. Advances and open problems in federated learning. Found. Trends Mach. Learn. https://doi.org/10.1561/2200000083 (2021)

McMahan, B., Moore, E., Ramage, D., Hampson, S. & Aguera y Arcas, B. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (eds. Singh, A. & Zhu, J.) 1273–1282 (ML Research Press, 2017).

Li, X. et al. Multi-site fMRI analysis using privacy-preserving federated learning and domain adaptation: ABIDE results. Med. Image Anal. 65 , 101765 (2020).

Brisimi, T. S. et al. Federated learning of predictive models from federated Electronic Health Records. Int. J. Med. Inform. 112 , 59–67 (2018).

Lee, J. et al. Privacy-preserving patient similarity learning in a federated environment: development and analysis. JMIR Med. Inform. 6 , e20 (2018).

Dou, Q. et al. Federated deep learning for detecting COVID-19 lung abnormalities in CT: a privacy-preserving multinational validation study. npj Digit. Med. 4 , 60 (2021).

Silva, S. et al. Federated learning in distributed medical databases: meta-analysis of large-scale subcortical brain data. In 2019 IEEE 16th International Symposium on Biomedical Imaging ISBI 2019 18822077 (IEEE, 2019).

Sheller, M. J., Reina, G. A., Edwards, B., Martin, J. & Bakas, S. Multi-institutional deep learning modeling without sharing patient data: a feasibility study on brain tumor segmentation. Brainlesion 11383 , 92–104 (2019).

Sheller, M. J. et al. Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data. Sci. Rep. 10 , 12598 (2020).

Sarma, K. V. et al. Federated learning improves site performance in multicenter deep learning without data sharing. J. Am. Med. Inform. Assoc. 28 , 1259–1264 (2021).

Li, W. et al. Privacy-preserving federated brain tumour segmentation. In Machine Learning in Medical Imaging (eds. Suk, H.-I., Liu, M., Yan, P. & Lian, C.) 133–141 (Springer International Publishing, 2019).

Shokri, R., Stronati, M., Song, C. & Shmatikov, V. Membership inference attacks against machine learning models. In IEEE Symposium on Security and Privacy SP 2017 3–18 (IEEE, 2017).

Fredrikson, M., Jha, S. & Ristenpart, T. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security 1322–1333 (Association for Computing Machinery, 2015).

Zhang, C., Bengio, S., Hardt, M., Recht, B. & Vinyals, O. Understanding deep learning (still) requires rethinking generalization. Commun. ACM 64 , 107–115 (2021).

Zhu, L., Liu, Z. & Han, S. Deep leakage from gradients. In Advances in Neural Information Processing Systems 32 (eds Wallach, H. et al.) Paper 8389 (Curran, 2019)

Abadi, M. et al. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security 308–318 (Association for Computing Machinery, 2016).

Brendan McMahan, H. et al. A general approach to adding differential privacy to iterative training procedures. Preprint at https://arxiv.org/abs/1812.06210v2 (2018).

McMahan, H. B., Ramage, D., Talwar, K. & Zhang, L. Learning differentially private recurrent language models. In ICLR 2018 Sixth International Conference on Learning Representations Paper 504 (ICLR, 2018).

Shokri, R. & Shmatikov, V. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security 1310–1321 (Association for Computing Machinery, 2015).

Lyu, M., Su, D. & Li, N. Understanding the sparse vector technique for differential privacy. Proc. VLDB Endow. 10 , 637–648 (2017).

Hitaj, B., Ateniese, G. & Perez-Cruz, F. Deep models under the GAN: information leakage from collaborative deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security 603–618 (Association for Computing Machinery, 2017).

Li, X., Huang, K., Yang, W., Wang, S. & Zhang, Z. On the convergence of FedAvg on Non-IID Data. In ICLR 2020 Eighth International Conference on Learning Representations Paper 261 (2020).

Smith, V., Chiang, C.-K., Sanjabi, M. & Talwalkar, A. S. Federated multi-task learning. In Advances in Neural Information Processing Systems 30 (eds Guyon, I. et al.) Paper 2307 (NeuIPS, 2017).

Xu, J. et al. Federated learning for healthcare informatics. J. Healthc. Inform. Res. 5 , 1–19 (2021).

Huang, L. et al. LoAdaBoost: loss-based AdaBoost federated machine learning with reduced computational complexity on IID and non-IID intensive care data. PLoS ONE 15 , e0230706 (2020).

Zhao, Y. et al. Federated learning with non-IID data. Preprint at https://arxiv.org/abs/1806.00582v1 (2018).

Torres-Soto, J. & Ashley, E. A. Multi-task deep learning for cardiac rhythm detection in wearable devices. npj Digit. Med. 3 , 116 (2020).

Turakhia, M. P. et al. Rationale and design of a large-scale, app-based study to identify cardiac arrhythmias using a smartwatch: The Apple Heart Study. Am. Heart J. 207 , 66–75 (2019).

Synced. Apple reveals design of its on-device ML system for federated evaluation and tuning SyncedReview https://syncedreview.com/2021/02/19/apple-reveals-design-of-its-on-device-ml-system-for-federated-evaluation-and-tuning (2021).

McMahan, B. & Ramage, D. Federated learning: collaborative machine learning without centralized training data Google AI Blog https://ai.googleblog.com/2017/04/federated-learning-collaborative.html (2017).

Chen, Y., Qin, X., Wang, J., Yu, C. & Gao, W. FedHealth: a federated transfer learning framework for wearable healthcare. IEEE Intell. Syst. 35 , 83–93 (2020).

Ramage, D. & Mazzocchi, S. Federated analytics: collaborative data science without data collection Google AI Blog https://ai.googleblog.com/2020/05/federated-analytics-collaborative-data.html (2020).

Augenstein, S. et al. Generative models for effective ML on private, decentralized datasets. In ICLR 2020 Eighth International Conference on Learning Representations Paper 1448 (ICLR, 2020).

Pati, S. et al. The federated tumor segmentation (FeTS) challenge. Preprint at https://arxiv.org/abs/2105.05874v2 (2021).

Flores, M. Medical institutions collaborate to improve mammogram assessment AI with Nvidia Clara federated learning The AI Podcast https://blogs.nvidia.com/blog/2020/04/15/federated-learning-mammogram-assessment/ (2020).

Kannan, A., Chen, K., Jaunzeikare, D. & Rajkomar, A. Semi-supervised learning for information extraction from dialogue. In Proc. Interspeech 2018 2077–2081 (ISCA, 2018); https://doi.org/10.21437/interspeech.2018-1318

Chiu, C.-C. et al. Speech recognition for medical conversations. Preprint at https://arxiv.org/abs/1711.07274v2 ; https://doi.org/10.1093/jamia/ocx073 (2017).

Si, Y., Wang, J., Xu, H. & Roberts, K. Enhancing clinical concept extraction with contextual embeddings. J. Am. Med. Inform. Assoc. 26 , 1297–1304 (2019).

Shin, H.-C. et al. Learning to read chest X-rays: recurrent neural cascade model for automated image annotation. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2016); https://doi.org/10.1109/cvpr.2016.274

Wang, X., Peng, Y., Lu, L., Lu, Z. & Summers, R. M. TieNet: text-image embedding network for common thorax disease classification and reporting in chest X-rays. In IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018 (IEEE, 2018); https://doi.org/10.1109/cvpr.2018.00943

Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9 , 1735–1780 (1997).

Cho, K. et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (eds Moschitti, A., Pang, B. & Daelemans, W.) 1724–1734 (Association for Computational Linguistics, 2014).

Lipton, Z. C., Kale, D. C., Elkan, C. & Wetzel, R. Learning to diagnose with LSTM recurrent neural networks. Preprint at https://arxiv.org/abs/1511.03677v7 (2015).

Choi, E., Bahadori, M. T., Schuetz, A., Stewart, W. F. & Sun, J. Doctor AI: predicting clinical events via recurrent neural networks. JMLR Workshop Conf. Proc. 56 , 301–318 (2016).

Zhu, Paschalidis & Tahmasebi. Clinical concept extraction with contextual word embedding. Preprint at https://doi.org/10.48550/arXiv.1810.10566 (2018).

Cho, K., van Merriënboer, B., Bahdanau, D. & Bengio, Y. On the properties of neural machine translation: encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (eds Wu, D., Carpuat, M., Carreras, X. & Vecchi, E. M.) 103–111 (Association for Computational Linguistics, 2014).

Gehring, J., Auli, M., Grangier, D., Yarats, D. & Dauphin, Y. N. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning (eds Precup, D. & Teh, Y. W.) 1243–1252 (PMLR, 2017).

Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems 30 (eds Guyon, I. et al.) Paper 3058 (Curran, 2017).

Bahdanau, D., Cho, K. H. & Bengio, Y. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations ICLR 2015 (ICLR, 2015).

Luong, T., Pham, H. & Manning, C. D. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (eds Màrquez, L., Callison-Burch, C. & Su, J.) 1412–1421 (Association for Computational Linguistics, 2015); https://doi.org/10.18653/v1/d15-1166

Sutskever, I., Vinyals, O. & Le, Q. V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27 (eds Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. & Weinberger, K. Q.) Paper 1610 (Curran, 2014).

Krizhevsky, A., Sutskever, I. & Hinton, G. E. in Advances in Neural Information Processing Systems 25 (eds Bartlett, P. et al.) 1097–1105 (Curran, 2012).

Kiani, A. et al. Impact of a deep learning assistant on the histopathologic classification of liver cancer. npj Digit. Med. 3 , 23 (2020).

Park, S.-M. et al. A mountable toilet system for personalized health monitoring via the analysis of excreta. Nat. Biomed. Eng. 4 , 624–635 (2020).

Howard, J. & Ruder, S. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (eds Gurevych, I. & Miyao, Y.) 328–339 (Association for Computational Linguistics, 2018).

Peters, M. E. et al. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Walker, M., Ji, H. & Stent, A.) 2227–2237 (Association for Computational Linguistics, 2018).

Brown, T. et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33 (eds. Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F. & Lin, H.) 1877–1901 (Curran, 2020).

Kenton, J. D. M.-W. C. & Toutanova, L. K. BERT: pre-training of deep bidirectional transformers for language understanding. in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Burstein, J., Doran, C. & Solorio, T.) 4171–4186 (Association for Computational Linguistics, 2019).

Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. Preprint at https://arxiv.org/abs/1301.3781v3 (2013).

Pennington, J., Socher, R. & Manning, C. GloVe: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (eds Moschitti, A., Pang, B., Daelemans, W.) 1532–1543 (Association for Computational Linguistics, 2014).

Alsentzer, E. et al. Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop (eds Rumshisky, A., Roberts, K., Bethard, S. & Naumann, T.) 72–78 (Association for Computational Linguistics, 2019).

Huang, K., Altosaar, J. & Ranganath, R. ClinicalBERT: modeling clinical notes and predicting hospital readmission. Preprint at https://arxiv.org/abs/1904.05342v3 (2019).

Peng, Y., Yan, S. & Lu, Z. Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. In Proceedings of the 18th BioNLP Workshop and Shared Task (eds Demner-Fushman, D., Bretonnel Cohen, K., Ananiadou, S. & Tsujii, J.) 58–65 (Association for Computational Linguistics, 2019).

Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data 3 , 160035 (2016).

Sharir, O., Peleg, B. & Shoham, Y. The cost of training NLP models: a concise overview. Preprint at https://arxiv.org/abs/2004.08900v1 (2020).

Lee, J. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36 , 1234–1240 (2020).

CAS Google Scholar

Beltagy, I., Lo, K. & Cohan, A. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (eds Inui, K., Jiang, J., Ng, V. & Wan, X.) 3615–3620 (Association for Computational Linguistics, 2019).

Futoma, J., Morris, J. & Lucas, J. A comparison of models for predicting early hospital readmissions. J. Biomed. Inform. 56 , 229–238 (2015).

Caruana, R. et al. Intelligible models for healthcare: predicting pneumonia risk and hospital 30-day readmission. In Proc. 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1721–1730 (Association for Computing Machinery, 2015).

Wagner, T. et al. Augmented curation of clinical notes from a massive EHR system reveals symptoms of impending COVID-19 diagnosis. Elife 9 , e58227 (2020).

Eisman, A. S. et al. Extracting angina symptoms from clinical notes using pre-trained transformer architectures. AMIA Annu. Symp. Proc. 2020 , 412–421 (American Medical Informatics Association, 2020).

Smit, A. et al. Combining automatic labelers and expert annotations for accurate radiology report labeling using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (eds Webber, B., Cohn, T., He, Y. & Liu, Y.) 1500–1519 (Association for Computational Linguistics, 2020).

Soni, S. & Roberts, K. Evaluation of dataset selection for pre-training and fine-tuning transformer language models for clinical question answering. In Proc. 12th Language Resources and Evaluation Conference 5532–5538 (European Language Resources Association, 2020).

Sezgin, E., Huang, Y., Ramtekkar, U. & Lin, S. Readiness for voice assistants to support healthcare delivery during a health crisis and pandemic. npj Digit. Med. 3 , 122 (2020).

Sakthive, V., Kesaven, M. P. V., William, J. M. & Kumar, S. K. M. Integrated platform and response system for healthcare using Alexa. Int. J. Commun. Computer Technol. 7 , 14–22 (2019).

Comstock, J. Buoy Health, CVS MinuteClinic partner to send patients from chatbot to care. mobihealthnews https://www.mobihealthnews.com/content/buoy-health-cvs-minuteclinic-partner-send-patients-chatbot-care (2018).

Razzaki, S. et al. A comparative study of artificial intelligence and human doctors for the purpose of triage and diagnosis. Preprint at https://doi.org/10.48550/arXiv.1806.10698 (2018).

Xiong, Y., Du, B. & Yan, P. Reinforced transformer for medical image captioning. In Machine Learning in Medical Imaging (eds. Suk, H.-I., Liu, M., Yan, P. & Lian, C.) 673–680 (Springer International Publishing, 2019).

Meng, Y., Speier, W., Ong, M. K. & Arnold, C. W. Bidirectional representation learning from transformers using multimodal electronic health record data to predict depression. IEEE J. Biomed. Health Inform. 25 , 3121–3129 (2021).

Choi, E. et al. Learning the graphical structure of electronic health records with graph convolutional transformer. Proc. Conf. AAAI Artif. Intell. 34 , 606–613 (2020).

Li, F. et al. Fine-tuning bidirectional encoder representations from transformers (BERT)–based models on large-scale electronic health record notes: an empirical study. JMIR Med. Inform. 7 , e14830 (2019).

Rasmy, L., Xiang, Y., Xie, Z., Tao, C. & Zhi, D. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. npj Digital Medicine 4 , 86 (2021).

Shang, J., Ma, T., Xiao, C. & Sun, J. Pre-training of graph augmented transformers for medication recommendation. in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (ed. Kraus, S.) 5953–5959 (International Joint Conferences on Artificial Intelligence Organization, 2019); https://doi.org/10.24963/ijcai.2019/825

Li, Y. et al. BEHRT: transformer for electronic health records. Sci. Rep. 10 , 7155 (2020).

Rao, S. et al. BEHRT-HF: an interpretable transformer-based, deep learning model for prediction of incident heart failure. Eur. Heart J. 41 (Suppl. 2), ehaa946.3553 (2020).

Qian, X. et al. Prospective assessment of breast cancer risk from multimodal multiview ultrasound images via clinically applicable deep learning. Nat. Biomed. Eng. 5 , 522–532 (2021).

Xing, L., Giger, M. L. & Min, J. K. Artificial Intelligence in Medicine: Technical Basis and Clinical Applications (Academic Press, 2020).

Reisman, M. EHRs: the challenge of making electronic data usable and interoperable. P. T. 42 , 572–575 (2017).

Cortés, R., Bonnaire, X., Marin, O. & Sens, P. Stream processing of healthcare sensor data: studying user traces to identify challenges from a big data perspective. Procedia Comput. Sci. 52 , 1004–1009 (2015).

Zhang, F., Cao, J., Khan, S. U., Li, K. & Hwang, K. A task-level adaptive MapReduce framework for real-time streaming data in healthcare applications. Future Gener. Comput. Syst. 43–44 , 149–160 (2015).

El Aboudi, N. & Benhlima, L. Big data management for healthcare systems: architecture, requirements, and implementation. Adv. Bioinformatics 2018 , 4059018 (2018).

Ta, V.-D., Liu, C.-M. & Nkabinde, G. W. Big data stream computing in healthcare real-time analytics. In IEEE International Conference on Cloud Computing and Big Data Analysis (ICCCBDA) 37–42 (ieeexplore.ieee.org, 2016).

Data-Driven Healthcare Organizations Use Big Data Analytics for Big Gains White Paper (IBM Software, 2017); https://silo.tips/download/ibm-software-white-paper-data-driven-healthcare-organizations-use-big-data-analy

Futoma, J., Simons, M., Panch, T., Doshi-Velez, F. & Celi, L. A. The myth of generalisability in clinical research and machine learning in health care. Lancet Digit. Health 2 , e489–e492 (2020).

Wang, X. et al. Inconsistent performance of deep learning models on mammogram classification. J. Am. Coll. Radiol. 17 , 796–803 (2020).

Nestor, B., McDermott, M. B. A. & Boag, W. Feature robustness in non-stationary health records: caveats to deployable model performance in common clinical machine learning tasks. Preprint at https://doi.org/10.48550/arXiv.1908.00690 (2019).

Wu, E. et al. How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nat. Med . https://doi.org/10.1038/s41591-021-01312-x (2021).

Barish, M., Bolourani, S., Lau, L. F., Shah, S. & Zanos, T. P. External validation demonstrates limited clinical utility of the interpretable mortality prediction model for patients with COVID-19. Nat. Mach. Intell. 3 , 25–27 (2020).

Davis, S. E., Lasko, T. A., Chen, G., Siew, E. D. & Matheny, M. E. Calibration drift in regression and machine learning models for acute kidney injury. J. Am. Med. Inform. Assoc. 24 , 1052–1061 (2017).

Wang, G. et al. A deep-learning pipeline for the diagnosis and discrimination of viral, non-viral and COVID-19 pneumonia from chest X-ray images. Nat. Biomed. Eng. 5 , 509–521 (2021).

Ning, W. et al. Open resource of clinical data from patients with pneumonia for the prediction of COVID-19 outcomes via deep learning. Nat. Biomed. Eng. 4 , 1197–1207 (2020).

Koenecke, A. et al. Racial disparities in automated speech recognition. Proc. Natl Acad. Sci. USA 117 , 7684–7689 (2020).

Abid, A., Farooqi, M. & Zou, J. Large language models associate muslims with violence. Nat. Mach. Intell. 3 , 461–463 (2021).

Obermeyer, Z., Powers, B., Vogeli, C. & Mullainathan, S. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366 , 447–453 (2019).

Adamson, A. S. & Smith, A. Machine learning and health care disparities in dermatology. JAMA Dermatol . 154 , 1247–1248 (2018).

Han, S. S. et al. Classification of the clinical images for benign and malignant cutaneous tumors using a deep learning algorithm. J. Invest. Dermatol. 138 , 1529–1538 (2018).

Subbaswamy, A., Adams, R. & Saria, S. Evaluating model robustness and stability to dataset shift. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (eds. Banerjee, A. & Fukumizu, K.) 2611–2619 (PMLR, 2021).

Izzo, Z., Ying, L. & Zou, J. How to learn when data reacts to your model: performative gradient descent. In Proceedings of the 38th International Conference on Machine Learning (eds. Meila, M. & Zhang, T.) 4641–4650 (PMLR, 2021).

Ghorbani, A., Kim, M. & Zou, J. A Distributional framework for data valuation. In Proceedings of the 37th International Conference on Machine Learning (eds. Iii, H. D. & Singh, A.) 3535–3544 (PMLR, 2020).

Zhang, L., Deng, Z., Kawaguchi, K., Ghorbani, A. & Zou, J. How does mixup help with robustness and generalization? In International Conference on Learning Representations 2021 Paper 2273 (ICLR, 2021).

Schulam, P. & Saria, S. Can you trust this prediction? Auditing pointwise reliability after learning. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics (eds. Chaudhuri, K. & Sugiyama, M.) 1022–1031 (PMLR, 2019).

Liu, X. et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat. Med. 26 , 1364–1374 (2020).

Cruz Rivera, S. et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Nat. Med. 26 , 1351–1363 (2020).

Nayak, P. Understanding searches better than ever before. Google The Keyword https://blog.google/products/search/search-language-understanding-bert/ (2019).

Baur, C., Albarqouni, S. & Navab, N. in OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis (eds Stoyanov, D. et al.) 260–267 (Springer International Publishing, 2018).

Kang, E., Koo, H. J., Yang, D. H., Seo, J. B. & Ye, J. C. Cycle-consistent adversarial denoising network for multiphase coronary CT angiography. Med. Phys. 46 , 550–562 (2019).

Vig, J. A multiscale visualization of attention in the transformer model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (Costa-jussà, M. R. & Alfonseca, E.) 37–42 (Association for Computational Linguistics, 2019).

Download references

Acknowledgements

This work was supported in part by the National Institutes of Health via grants F30HL156478 (to A.Z.), R01CA227713 (to L.X.), R01CA256890 (to L.X.), P30AG059307 (to J.Z.), U01MH098953 (to J.Z.), P01HL141084 (to J.C.W), R01HL163680 (to J.C.W), R01HL130020 (to J.C.W), R01HL146690 (to J.C.W.) and R01HL126527 (to J.C.W.); by the National Science Foundation grant CAREER1942926 (to J.Z.); and by the American Heart Association grant 17MERIT3361009 (to J.C.W.). Figures were created with BioRender.com.

Author information

Authors and affiliations.

Stanford Cardiovascular Institute, School of Medicine, Stanford University, Stanford, CA, USA

Angela Zhang & Joseph C. Wu

Department of Genetics, School of Medicine, Stanford University, Stanford, CA, USA

Angela Zhang

Greenstone Biosciences, Palo Alto, CA, USA

Department of Computer Science, Stanford University, Stanford, CA, USA

Angela Zhang & James Zou

Department of Radiation Oncology, School of Medicine, Stanford University, Stanford, CA, USA

Department of Biomedical Informatics, School of Medicine, Stanford University, Stanford, CA, USA

Departments of Medicine, Division of Cardiovascular Medicine Stanford University, Stanford, CA, USA

Joseph C. Wu

Department of Radiology, School of Medicine, Stanford University, Stanford, CA, USA

You can also search for this author in PubMed Google Scholar

Contributions

A.Z. and J.C.W. drafted the manuscript. All authors contributed to the conceptualization and editing of the manuscript.

Corresponding authors

Correspondence to Angela Zhang or Joseph C. Wu .

Ethics declarations

Competing interests.

J.C.W. is a co-founder and scientific advisory board member of Greenstone Biosciences. The other authors declare no competing interests.

Peer review

Peer review information.

Nature Biomedical Engineering thanks Pearse Keane, Faisal Mahmood and Hadi Shafiee for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Zhang, A., Xing, L., Zou, J. et al. Shifting machine learning for healthcare from development to deployment and from models to data. Nat. Biomed. Eng 6 , 1330–1345 (2022). https://doi.org/10.1038/s41551-022-00898-y

Download citation

Received : 24 January 2021

Accepted : 03 May 2022

Published : 04 July 2022

Issue Date : December 2022

DOI : https://doi.org/10.1038/s41551-022-00898-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Dreamer: a computational framework to evaluate readiness of datasets for machine learning.

Meysam Ahangaran
Vijaya B. Kolachalama

BMC Medical Informatics and Decision Making (2024)

Predicting 3-month poor functional outcomes of acute ischemic stroke in young patients using machine learning

Lamia Mbarek
Siding Chen
Yongjun Wang

European Journal of Medical Research (2024)

Leverage machine learning to identify key measures in hospital operations management: a retrospective study to explore feasibility and performance of four common algorithms

Wantao Zhang
Huajun Zhang

A vision–language foundation model for the generation of realistic chest X-ray images

Christian Bluethgen
Pierre Chambon
Akshay S. Chaudhari

Nature Biomedical Engineering (2024)

The limits of fair medical imaging AI in real-world generalization

Haoran Zhang
Marzyeh Ghassemi

Nature Medicine (2024)

Quick links

Explore articles by subject
Guide to authors
Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

IMAGES

AI-ML in Healthcare: Revolutionizing Patient Care
ML in Healthcare: 7 Dynamic Practices for Health Improvement
Top 6 Applications for Machine Learning (ML) in Healthcare
Streamlining the Healthcare Space Using Machine Learning and mHealth
Train Machine Learning Models for Healthcare Use Cases
Challenges to applying ML in healthcare

VIDEO

Alovia Healthcare
E Medical Visa in India
6th science 1st term quarterly exam question paper dharmapuri dt English medium 2023 2024
CMS Guidelines for Prolonged Services: Everything You Need to Know!
MLT
AI-Driven Healthcare Delivery in Pakistan: A Framework for Systemic Improvement

COMMENTS

Significance of machine learning in healthcare: Features ...
ML in healthcare and bioinformatics can process vast amounts of data and provide significant insights to help healthcare workers make timely decisions. ML enables patients' examinations easy with better access to medical history and forecast results based on treatment and lifestyle.
A Comprehensive Review on Machine Learning in Healthcare ...
We examined the effectiveness of machine learning algorithms in improving time series healthcare metrics for heart rate data transmission (accuracy and efficiency). In this paper, we reviewed several machine learning algorithms in healthcare applications.
Machine learning and deep learning-based approach in smart ...
In recent years, machine learning (ML) and deep learning (DL) have been the leading approaches to solving various challenges, such as disease predictions, drug discovery, medical image analysis, etc., in intelligent healthcare applications.
The role of machine learning in clinical research ...
Machine learning has the potential to contribute to clinical research through increasing the power and efficiency of pre-trial basic/translational research and enhancing the planning, conduct, and analysis of clinical trials. Key ML terms and principles may be found in Table 1.
Shifting machine learning for healthcare from development to ...
In the past decade, the application of machine learning (ML) to healthcare has helped drive the automation of physician tasks as well as enhancements in clinical capabilities and access to...
Machine learning applications in healthcare sector: An overview
Machine learning (ML) is essential in healthcare sector such as medical imaging diagnostics, improved radiotherapy, personalised treatment, crowdsourced data gathering, smart health records, ML-based behavioural modification, clinical trials, and research.

The role of machine learning in clinical research: transforming the future of evidence generation

Conclusions

The role of ML in preclinical drug discovery and development research

Drug target identification, candidate molecule generation, and mechanism elucidation

Clinical study protocol optimization

The role of ML in clinical trial participant management

Selection of patient populations for investigation

Participant identification and recruitment

Participant retention, monitoring, and protocol adherence

Data collection and management

Collection, processing, and management of data from wearable and other smart devices

Study data collection, verification, and surveillance

Endpoint identification, adjudication, and detection of safety signals

Approaches to missing data

Data analysis

Barriers to the integration of ML techniques in clinical research

Operational barriers to ML in clinical research

Philosophical barriers to ML in clinical research

Guidelines toward overcoming operational and philosophical barriers to ML in clinical research

Acknowledgements

Abbreviations

Authors’ contributions

Availability of data and materials

Declarations

Consent for publication

Associated Data

Data Availability Statement

Similar articles

Add to Collections

Shifting machine learning for healthcare from development to deployment and from models to data

Similar content being viewed by others

The future of digital health with federated learning

Automated clinical coding: what, why, and where we are?

Use of deep learning to develop continuous-risk models for adverse event prediction from electronic health records

Deep generative models

Augmenting datasets

Protecting patient privacy

Image-to-image translation

Opportunities

Federated learning

Cross-silo federated learning

Cross-device federated learning to handle health data from individuals

Limitations and opportunities

Natural language processing

Transformers

Transfer learning for NLP

Transformers for the understanding of clinical text

Transformers for the modelling of clinical events

Data-limiting factors in the deployment of ML

Delivering data to models

Deployment in the face of data shifts

Acknowledgements

Author information

Contributions

Corresponding authors

Ethics declarations

Peer review

Additional information

Rights and permissions

About this article

Share this article

This article is cited by

Predicting 3-month poor functional outcomes of acute ischemic stroke in young patients using machine learning

Leverage machine learning to identify key measures in hospital operations management: a retrospective study to explore feasibility and performance of four common algorithms

A vision–language foundation model for the generation of realistic chest X-ray images

The limits of fair medical imaging AI in real-world generalization

Quick links

IMAGES

VIDEO

COMMENTS