Replicability & Reproducibility in ML

Key Idea: What is Replicability & Reproducibility in ML?

Actionable advancements across scientific disciplines, including ML, are built on trust that requires sharing enough information about how a study is carried out to enable meaningful replication or reproduction by others. In principle, this process aims to facilitate self-correction across disciplines and weed-out literature that doesn't hold up to due scrutiny... but this doesn't always happen.

Watch: The Replication Crisis

Crash Course Statistics, Episode #31

Cite as: Brungard B. et al. Crash Course Statistics, Episode #31. Vol 31. Complexly; 2018.

Is there a reproducibility crisis in science?

Cite as: Is there a reproducibility crisis in science? Nature; 2016.

Based on the work of Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016).

Watch this episode of Crash Course Statistics, hosted by Adriene Hill, and the accompanying video offered by Nature.

In this episode of Crash Course Statistics, Adriene Hill discusses the Replication Crisis across scientific disciplines. While the video refers to general study design replicability and reproducibility of statistical analysis, these concepts are largely applicable to machine learning contexts.

A few key points:

  • Study Replication: Re-running studies to confirm results

  • Reproducible Analyses: Analyses described such that others can repeat your methods on the same or similar a dataset.

  • In 2016, a vast majority of researchers surveyed by Nature indicated that they believed the sciences were experiencing a study reproducibility crisis: an inability to reproduce findings from a large numbers of published studies across disciplines.

  • Contributing factors to the replicability crisis are thought to include:

        • Publication bias, driven by reward-based structures in academic and industrial institutions.

        • Data Dredging (or "p-hacking") and lack of clarity around the role of statistical significance in context

              • "Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold ... statistical significance, does not measure the size of an effect or the importance of a result." - American Statistical Association

        • Lack of clear disclosures around data analysis and preprocessing methods.

              • While good papers contain detailed and explicit descriptions of their methods (i.e. disclosures), this is not always the standard across journals or across industry settings.

  • Ideas around crisis mediation include:

        • More study replication, enabled by detailed methods disclosures and more funding for study replication

  • More publication of quality research that yield null results to make p-hacking less enticing

  • Promotion of data availability practices to emphasize reproducibility and transparency

Read: AI/ML Is Struggling with a Replication Crisis

Screenshot from Heaven WD. AI is wrestling with a replication crisis. MIT Technology Review. Published online November 12, 2020.

Review this MIT Technology Review piece, written by Will Douglas Heaven.

In this piece, Will Douglas Heaven outlines the factors which contribute to the inability to replicate or reproduce findings in AI/ML literature, including lack of standardized data and code-sharing practices as well as competing interests of academic research and industry product showcasing.

A few key points

  • Across the field of ML, concerns have been voiced that the ambition of novel experiments and model development supersede a dedication to sound and reproducible methodology.

  • Lack of transparency and disclosures about model development methodology prevents new ML models and techniques from being adequately assessed for robustness, bias, and safety.

  • Because the application of ML moves quickly from controlled environments to real-world scenarios that directly impact people’s lives, detailed disclosure practices that enable replication by different researchers across settings can expose errors and oversights sooner, making deployed ML models better for everyone.

  • Known barriers to replication and strategies to address them:

  • Lack of access to code

    • Code sharing practices, like those instituted by Papers with Code, are becoming more commonplace for ML literature, which enable greater opportunity for replication. However, sharing code alone is not enough to rerun an experiment. Without metadata and disclosures describing how models are trained and tuned, shared code can be useless. Ensuring that code is accompanied by detailed disclosures and metadata is essential for replication.

  • Lack of access to data

      • Data used in models of influence are often proprietary (i.e., Facebook user data) or subject to regulatory protections (i.e., protected health information). If researchers and ML practitioners are unable to share their data, it may instead be possible to provide directions so that others can build similar datasets, or to institute a process where a few independent auditors are given access to the data to verify results.

  • Lack of access to hardware

      • While the majority of ML research is run on computers that are available to the average lab, wealthy private tech companies are able to carry out further research on large and expensive collections of computers that few academic or public institutions have the resources to access. This is a problem in the context of replication; however, some point to the value of delayed replication of these models: ML methodologies that are inaccessible for other researchers to replicate in its early stages due to required computing power are often made more efficient (and thus more accessible for replication) as they are developed.

  • Other Strategies:

    • Implement checklists of required materials for ML conference paper submissions, including code and detailed descriptions of experiments.

        • When a checklist was introduced to the submission process for NeurIPS (a prominent conference on Neural Information Processing Systems) in 2019, the number of submissions including code jumped from less than 50% to around 75%.

    • Establishment of ML reproducibility challenges

    • Incentivizing replication efforts in academic and industry settings

    • Building replication work and disclosure documentation training into ML coursework

Cite as: Heaven WD. AI is wrestling with a replication crisis. MIT Technology Review. Published online November 12, 2020.

Explore: ML Reproducibility Challenge

Screenshot from ML Reproducibility Challenge 2021

To address identified replicability and reproducibility problems in the field of ML, several entities, journals, and conferences have established "reproducibility challenges" which encourage ML researchers and practitioners to investigate the reproducibility of models and papers.

Take a look at the webpage of one such reproducibility challenge from Papers with Code which challenged members of ML research community to investigate the reproducibility of papers accepted at top ML conferences in 2021, including NeurIPS, ICML, ICLR, ACL-IJCNLP, EMNLP, CVPR, ICCV, AAAI and IJCAI.

Explore: Documenting Reproducibility of applied ML Research

Screenshot from Irreproducibility in Machine Learning

Sayash Kapoor and Arvind Narayanan at Center for Information Technology Policy at Princeton University are seeking to document the reproducibility of applied ML research via systematic reviews and in-depth code reviews of ML literature.

Explore their running list of discovered ML reproducibility failures and the accompanying reasons as to why each model was not reproducible.

Read: Challenges to the Reproducibility of Machine Learning Models in Healthcare

Screenshot from Beam et al., 2020

Review this paper, written by Beam et al.

In this paper, Andrew Beam and colleagues discuss the challenges and prospects for increasing ML reproducibility in healthcare contexts.

A few key points:

  • Among other challenges to reproducibility and replication of ML models, restricted access to underlying data and code is especially relevant in healthcare contexts as privacy barriers are important considerations for data sharing.

  • Replication is especially important for ML studies in healthcare that use observational data as these data are often biased. Without due replication and scrutiny of disclosures, ML models could operationalize these biases.

  • ML models should be reproduced, and ideally replicated, before being deployed in clinical settings.

Cite as: Beam AL, Manrai AK, Ghassemi M. Challenges to the Reproducibility of Machine Learning Models in Health Care. JAMA. 2020 Jan 28;323(4):305-306. doi: 10.1001/jama.2019.20866. PMID: 31904799; PMCID: PMC7335677.

Tying it back: The Role of Disclosure in Replicability & Reproducibility

Disclosing dataset characteristics, and methods undertaken in dataset pre-processing and model development enables meaningful model replication and reproducibility efforts. In the absence of these disclosures, ML practitioners may contribute to an ongoing reproducibility and replication crisis in the ML literature and solution space (which is especially consequential in public health and healthcare contexts).

Exercise: Disclose Dataset Merging & Subsetting Methods

In this exercise, we are going to merge and subset pre-pandemic NHANES datasets from 2017-2020 and report our methods for replication by others.


  1. Make a copy of this Google Colaboratory notebook in your Google Drive account (also found on GitHub)

  2. Follow the annotated directions to merge and preprocess two datasets, then disclose our methods for reproduction by others.

NOTE: This exercise requires the ability to download large XLSX files to a local device.

Share: #ReproducibleML

Thought prompt: Think about the role disclosure plays in ML study replication and reproduction. Do you think that having standardized approaches to model and dataset disclosures in ML would help make reproduction and replication more feasible?

Share your thoughts on Twitter using the hashtags #MDSD4Health #ReproducibleML #ReplicableML

Tag us to join the conversation! @MDSD4Health

For ideas on how to take part in the conversation, check out our Twitter Participation Guide.

Bonus Material!

What's the Difference Between Replicability and Reproducibility?


Studies are considered replicable if an independent party can reach the same conclusion after performing the same set of experiments or analyses on new data.


Studies are considered reproducible if an independent party with access to the same data and code can obtain the same result.

Read more about the history of "replicability" and "reproducibility" as commonly-confused terminology in the sciences here!