Model Cards

Key Idea: What are Model Cards?

Model Cards are documents that disclose an ML model's provenance, suggested usage, limitations, and performance metrics. Conceptualized by Mitchell and colleagues, Model Cards provide an opportunity for model developers to disclose a model's intended uses and development limitations to improve model transparency and mitigate discriminatory outcomes perpetuated by bias.

Read: The Concept Paper

Read the Model Cards concept paper written by Mitchell et al.

The paper was published as proceedings of the ACM Conference on Fairness, Accountability, and Transparency in 2019.

Cite as: Mitchell M, Wu S, Zaldivar A, et al. Model Cards for Model Reporting. In: Proceedings of the Conference on Fairness, Accountability, and Transparency. ACM; 2019:220-229. doi:10.1145/3287560.3287596

Break it Down: Anatomy of a Model Card

Model Cards are comprised of about 30 disclosures across 9 categories:

1.) Model Details, 2.) Intended Use, 3.) Factors, 4.) Metrics, 5.) Evaluation Data, 6.) Training Data, 7.) Quantitative Analyses, 8.) Ethical Considerations, and 9.) Caveats and Recommendations.

Explore a description of each category below and the disclosures contained within each.

Model Details

    1. Person or organization developing model

What person, organization, or entity developed the model?

    1. Model date

When was the model developed?

    1. Model version

Which version of the model is it, and how does it differ from previous versions?

    1. Model type

What type of model is it? Provider architecture details (i.e., Naive Bayes classifier, Convolutional Neural Network, etc.)

    1. Paper or other resource for more information

Where can resources for more information be found?

    1. Citation details

How should the model be cited?

    1. License

What is the licensing or intellectual property (IP) information?

    1. Feedback on the model

Who can be contacted and how for feedback on the model?

Intended Use

    1. Primary intended uses

Was the model developed with general or specific tasks in mind (e.g., plant recognition in the Pacific Northwest, labeling images)?

    1. Primary intended users

Was the model developed for entertainment purposes, for hobbyists, or enterprise solutions?

    1. Out-of-scope uses

Highlight restrictions or limitations of the model, including inappropriate contexts to which users may try to apply the model to (e.g., “not for use on text examples shorter than 100 tokens” or “for use on black-and-white images only; please consider our research group’s full-color-image classifier for color images.”)

Factors

    1. Relevant factors

What are foreseeable salient factors for which model performance may vary, and how were these determined?

NOTE: Relevant factors may include any of the following:

  • Groups: Data instances with similar characteristics (ex. For human-centric computer vision models, groupings of instances representing people who share one or more cultural, demographic or phenotypic characteristics may be relevant to disclose.)

  • Instrumentation: Instruments used to capture the input to the model (ex. For facial recognition models, camera hardware and software utilized in image capture may be relevant to disclose.)

  • Environment: Environment in which a model is deployed (ex. For facial recognition models that are less accurate under low lighting conditions, specifications across different lighting conditions may be relevant to disclose.)

    1. Evaluation factors

Which factors are being reported for evaluation purposes? Why were these factors chosen? Do the relevant factors and evaluation factors differ due to data availability concerns? (ex. if Fitzpatrick skin type is a relevant factor for face detection, but an evaluation dataset labeled by skin type by is not available, this may be relevant to disclose.)

Metrics

    1. Model performance measures

Report relevant performance metrics coupled with explanation of why these metrics were selected.

    1. Decision thresholds

What, if any, decision thresholds are used? Why were those decision thresholds chosen?

    1. Approaches to uncertainty and variability

How are the measurements and estimations of these metrics calculated? (e.g., standard deviation, variance, confidence intervals, Kullback-Leibler Divergence, etc.). Include details of how these values are approximated (e.g., average of 5 runs, 10-fold cross-validation).

  • For classification models: provide error types that can be derived from a confusion matrix

      • For classification models whose performance metrics are disaggregated by factors: provide confidence intervals for reported metrics by treating confusion matrices as probabilistic models of system performance

  • For score-based models: report appropriate summary statistics and measures of difference across groups

Training Data

    1. Datatsets

What datasets were used to train the model?

    1. Motivation

Why were these datasets chosen?

    1. Preprocessing

How were the training data preprocessed?


For more robust dataset reporting guidence, see Datasheets for Datasets and Other Dataset Disclosures

Evaluation Data

    1. Datatsets

What datasets were used to evaluate the model?

    1. Motivation

Why were these datasets chosen?

    1. Preprocessing

How were the evaluation data preprocessed?


For more robust dataset reporting guidence, see Datasheets for Datasets and Other Dataset Disclosures

Disaggregated Quantitative Analyses

    1. Unitary (univariable) results

How did the model perform with respect to each factor?

    1. Intersectional results

How did the model perform with respect to the intersection of evaluated factors?

Ethical Considerations

    1. Data

Does the model use any data that may be considered sensitive (e.g., protected classes)?

    1. Human life

Is the model intended to inform decisions about matters central to human life or flourishing (health, safety, etc.)? Or could it be used in this way?

    1. Mitigations

What risk mitigation strategies were used during model development?

    1. Risks and harms

What risks may be present in model usage? Identify the potential recipients, likelihood, and magnitude of harms

NOTE: If these cannot be determined, note that they were considered but remain unknown.

    1. Use cases:

Are there any known use cases associated with this model that led to fraught or discrimintaory outcomes?

Caveats and Recommendations

List any additional concerns in this section that were not covered in the previous sections.

ex:

  • Did results suggest any further testing?

  • Were there any relevant groups that were not represented in the evaluation dataset?

  • Are there additional recommendations for model use?

  • What are the ideal characteristics of an evaluation dataset for this model?

Explore: Example Model Cards

Model cards are published in a variety of formats.

  • interactive webpages

  • appedices in research or conception papers

  • GitHub repositories

  • HTML files

  • etc.

See a few examples of how others have shared model cards below.

Examples from Google

Examples from Salesforce

Read: Creating Model Cards in Python using Scikit-Learn

This Google Cloud blog post provides an overview of how to build out a model card using Google's Model Card Toolkit.

We'll use a version of the sample notebook provided in this post to create our own model card later!

A few key points:

  • Documenting disclosures about an ML model's usage, construction, and limitations via model cards can enhance clarity and shared understanding across stakeholders.

  • While the blog shows how to leverage Scikit learn to create model cards, these concepts are applicable to other platforms and frameworks (TensorFlow, PyTorch, XGBoostm etc.)

Healthcare Applications: Why do Model Cards Matter in Healthcare Contexts?

ML methods are increasingly used in clinical care to support improvements in diagnostics, treatment selection, adverse event risk stratification, and overall health system efficiency. However, models trained on historical data which capture patterns of health care disparities may perpetuate health inequities.

Model Cards and other model disclosure mediums (which enable deliberate communication of a model's origins, development limitations, performance metrics, intended uses, and training/test/validation dataset composition) are vital to informing preliminary adjustment and deployment decisions to mitgate automation harms in healthcare and promote health equity.

Deeper Dive: Ensuring Fairness in Machine Learning to Advance Health Equity

Read this paper written by Rajkomar et al.

This paper written by Alvin Rajkomar and colleagues argues that healthcare organizations and policymakers should go beyond the stance of using ML systems that "do not harm" and proactively design and use ML systems to advance health equity.

A few key points:

  • While well-developed ML models can be an effective resource in healthcare and diagnostics, models trained on historical data which capture patterns of health care disparities may perpetuate health inequities.

  • The implications of healthcare disparity exacerbation via ML led the American Medical Association (AMA) to pass policy recommendations to “promote development of thoughtfully designed, high-quality, clinically validated health care AI [artificial or augmented intelligence, such as machine learning] that … identifies and takes steps to address bias and avoids introducing or exacerbating health care disparities including when testing or deploying new AI tools on vulnerable populations”

  • Recognizing the influence of ML models, healthcare organizations and policymakers could go beyond the AMA's stance of using ML systems that "do not harm" and proactively design and use ML systems to advance health equity

  • To promote fairness in ML, participatory processes are needed that involve key stakeholders, and considers distributive justice within clinical and organizational contexts.

  • Principles of distributive justice in in ML:

      • Equal Outcomes: Assurance that protected groups have equal benefit in terms of patient outcomes from the deployment of ML models

      • Equal Performance: Assurance that a model's accuracy does not vary between patients in protected and nonprotected groups

      • Equal Allocation: Assurance that resource allocation decisions aided by ML are contextually appropriate

Cite as: Rajkomar A, Hardt M, Howell MD, Corrado G, Chin MH. Ensuring Fairness in Machine Learning to Advance Health Equity. Ann Intern Med. 2018;169(12):866. doi:10.7326/M18-1990

Exercise: Create a Model Card for a Healthcare Model

In this exercise, we are going to create a model card for a classifier developed using the Breast Cancer Wisconsin (Diagnostic) Dataset. This script, adapted from Google's Scikit-Learn Model Card Toolkit Demo, generates a model card as an HTML file.

Directions:

  1. Make a copy of the this Google Colaboratory notebook in your Google Drive account (also found on GitHub)

  2. Follow the annotated directions to produce the model card.

Note: This Colab script is adapted from a script developed for Google's Model Card Toolkit Demo. Changes made to the original notebook include addition of comments, annotations, expansion/rephrasing of code block explanations, and re-arrangement of model evaluation steps. All executable code has been left unchanged.

Share: #ModelCards

Thought prompt: Why are model cards relevant in public health contexts? How might they impact how models are built and deployed?

Share your thoughts on Twitter using the hashtags #MDSD4Health #ModelCards

Tag us to join the conversation! @MDSD4Health

For ideas on how to take part in the conversation, check out our Twitter Participation Guide.