ML Generalization

Key Idea: What is Generalization in ML?

Model Generalization refers to an ML model's ability to adapt to new data drawn from the same probability distribution as the one used to create the model. If our model is not generalizable, or overfit to our training data, we risk poor performance and increased error when the model is deployed in a real-world setting with real-world consequences.

Explore: Assumptions of ML Generalization


Cite as: Sculley, D. et al. Machine Learning Crash Course: Generalization. Google Developers; 2020.

Figure 1 from Sculley, D. et al. Machine Learning Crash Course: Generalization. Google Developers; 2020.

Review this mini lecture offered by D. Sculley from Google's Machine Learning Crash Course.

In this mini lecture, D. Sculley walks us through the concept of generalization in ML.

A few key points:

  • Test set methodology: splitting a dataset into training and testing sets

      • Training set: a subset of our dataset used to train a model.

      • Test set: a subset of our dataset used to test a trained model.

      • Good performance of a trained model on a test set is a common indicator used to gauge whether the model will be generalizable to new data from the same distribution.

  • Assumptions of ML Generalizability

  1. I.I.D. assumption: Our samples are drawn independently and identically (i.i.d.) at random from the distribution.

  2. Stationarity assumption: The distribution is stationary and does not change over time.

  3. Assumption of same distribution: We always pull from the same distribution - Including training, validation, and test sets.

However, these critical assumptions are sometimes violated in practice:

  • Advertisement selection models that base advertisements displayed to a user on what advertisements a user has previously seen (violates i.i.d. assumption)

  • Model trained on annual retail sales data in which users' purchase behaviors change seasonally (violates stationarity assumption).

Explore: ML Generalization & Overfitting

Figure 3 from Generalization: Peril of Overfitting. Google Developers; 2020.

Review this webpage from Google's Machine Learning Crash Course.

This webpage discusses the perils of overfitting models to training datasets.

A few key points:

  • An ML model's goal is to predict well on new, previously unseen data drawn from the same probability distribution; however, because it's impossible to access an entire distribution (the "whole truth"), we can only train the model using a sample of that distribution (This should sound a lot like the considerations of representative sampling of populations in statistics!)

  • A fundamental tension exists in ML between 1.) fitting our data well, and 2.) fitting the data as simply and generally as possible.

  • The principle of Ockham's razor for simplicity is applied to ML as follows: "The less complex an ML model, the more likely that a good empirical result is not just due to the peculiarities of the sample."

  • Generalization bounds: statistical descriptions of a model's ability to generalize to new data based on factors including the complexity of the model and the model's performance on training data.

Read: Promise and Perils of Big Data and AI/ML in Clinical Medicine and Biomedical Research

Figure 2 from Rodriguez et al., 2018

Review this paper written by Rodriguez et al.

This paper written by Fatima Rodriguez and colleagues outlines the opportunities for meaningful ML usage in healthcare based on lessons learned from other industries. One key point made by the authors is the potential abuses of test set methodologies by "cherry-picking" testing data on which a model performs well or testing several models on the testing data and only reporting the best performance, posing threat to model generalization when applied to real-world clinical settings.

A few key points:

  • Just as basic statistical literacy has become essential in the pursuit of research and interpretation clinical trial results, a basic understanding of ML will become necessary for clinicians to evaluate new developments in automation and augmentation.

  • Just as the p-value has been used and abused to misrepresent the implications or generalizability of findings, so too will oversights in test-set methodologies in ML research.

  • Understanding the mechanisms behind predicted associations and opportunities to exacerbate bias will be crucial as ML methods gain in popularity across the clinical sector. Researchers will need to advocate for the use of ML methods in combination with traditional biological, epidemiological, and biostatistical tools.

  • Clinical and healthcare research must:

      • remain hypothesis-driven

      • guard against spurious observations based on multiple testing

      • carefully consider the construction and characteristics of large datasets on which influential algorithms will be trained

Cite as: Rodriguez F, Scheinker D, Harrington RA. Promise and Perils of Big Data and Artificial Intelligence in Clinical Medicine and Biomedical Research. Circ Res. 2018;123(12):1282-1284. doi:10.1161/CIRCRESAHA.118.314119

Tying it back: The Role of Disclosure in Model Generalization

MDSD4Health Curriculum Video: Model Disclosures & Model Generalizability

Cite as: Shaveet, E. MDSD4Health Curriculum Video: Model Disclosures & Model Generalizability. MDSD4Health; 2022.

Review this brief video from our MDSD4Health YouTube channel about model disclosures and generalizability.

A few key points:

  • If a model trained on a subset of a given dataset (training set) performs well on a set aside batch of data from that same dataset (test set) we can be reasonably assured that it is generalizable to the distribution from which these data were sampled.

      • BUT this idea does not account for error in the assumption that the distribution from which our data were sampled is also the distribution that we're interested in. i.e., What if our dataset is not actually representative in the ways we think it is or want it to be?

  • A preceding generalizability consideration before splitting data is "Does our dataset actually represent the distribution of the population we think it does?"

  • Disclosing key aspects of our training and testing datasets can support others in identifying whether a given model trained on this dataset is appropriate for their desired applications.

Share: #GeneralizableML

Thought prompt: Consider the role of disclosure in model generalizability. If you were to develop your own dataset or model disclosure medium, what kind of disclosures would you want reported to maximize an ability to test generalizability?

Share your thoughts on Twitter using the hashtags #MDSD4Health #GeneralizableML

Tag us to join the conversation! @MDSD4Health

For ideas on how to take part in the conversation, check out our Twitter Participation Guide.