Datasheets for Datasets

Key Idea: What are Datasheets for Datasets?

Datasheets for Datasets are documents that disclose the motivation, composition, collection process, recommended uses of a dataset to be used in a machine learning model by its creators. Conceptualized by Gebru and colleagues, Datasheets for Datasets document the provenance, creation, and use of ML datasets to improve transparency and mitigate discriminatory outcomes.

Watch: Datasheets for Datasets: Explained

Timnit Gebru on Datasheets for Datasets

Cite as: Gebru T. Datasheets for Datasets. Communications of the ACM; 2021.

Watch this interview of Timnit Gebru, published by Communications of the Association for Computing Machinery (ACM).

In this video, first author of the Datasheets for Datasets paper, Timnit Gebru, explains the impetus and implications of this work.

A few key points:

  • Data are the raw materials used in ML models, so how they are collected (and from whom) determines what ML models can be built and how they are used.

  • There are no present industry standards about how to document ML datasets.

  • The Datasheets for Datasets paper proposes a documentation and disclosure framework that is useful and relevant for ML stakeholders of varying backgrounds.

  • Datasheets enable dataset creators to be intentional throughout the dataset curation process.

      • While datasheets can be used for existing datasets, the paper stresses that there is tremendous value in the use of disclosures contained therein as a guiding framework throughout the curation process.

Read: The Concept Paper

Review Gebru et al.'s Datasheets for Datasets Concept paper, published in Communications of the ACM.

Cite as: Gebru T, Morgenstern J, Vecchione B, et al. Datasheets for datasets. Commun ACM. 2021;64(12):86-92. doi:10.1145/3458723

Break it Down: Anatomy of a Datasheet for Dataset

Datasheets for Datasets are comprised of 57 questions across 7 categories:

1.) Motivation, 2.) Composition, 3.) Collection Process, 4.) Preprocessing, Cleaning, & Labeling, 5.) Uses, 6.) Distribution, and 7.) Maintenance.

Explore a description of each category below and the question set contained within each.

Motivation

  1. Dataset purpose

For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled?

  1. Entity affiliation

Who created the dataset and on behalf of which entity? Which team(s) or research group(s) on behalf of which entity company, institution, organization(s)?

  1. Funding

Who funded the creation of the dataset? If there is an associated grant, provide the name of the grantor and the grant name and number.

  1. Other comments

Composition

  1. Instance representation

What do the instances that comprise the dataset represent (documents, photos, people, countries)? Are there multiple types of instances (e.g., films and ratings; people and interactions; nodes and edges)?

  1. Number of instances

How many instances are there in total by type?

  1. Sample and population specification

Does the dataset contain all possible instances or is it a sample of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? Describe how this representativeness was validated/verified. If it is not representative of the larger set, describe why not (e.g., intentional oversampling, unavailability of other instances, etc.).

  1. Data explication

What data does each instance consist of? Raw data (e.g., unprocessed text or images) or features?

  1. Labels

Is there a label or target associated with each instance?

  1. Missing data

Is any information missing from individual instances? If so, explain why (e.g., not available, etc.). Note: This does not include intentionally removed information, but might include, for example, redacted text.

  1. Instance relationships

Are relationships between individual instances made explicit (e.g., before and after intervention)? Provide a description.

  1. Splitting

Are there recommended data splits for training, development/validation, testing? Provide a description of these splits, explaining the rationale behind them.

  1. Errors, noise, and redundancies

Are there any errors, sources of noise, or redundancies in the dataset?

  1. Self-contained vs. externally reliant

Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources:

    1. Are there guarantees that they will exist, and remain constant, over time?

    2. Are there official archival versions of the complete dataset (including the external resources as they existed at the time the dataset was created)?

    3. Are there any restrictions (licenses, fees, etc.) associated with any of the external resources that might apply to a dataset consumer?

  1. Confidentiality & Privacy

Does the dataset contain data that might be considered confidential or private, legally or otherwise (e.g., data that are protected by legal privilege; content of individuals’ non-public communications)?

  1. Emotional implications

Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?

  1. Dataset characteristics

Does the dataset include information about age, gender, etc.? Provide the respective characteristic distribution.

  1. Re-identification

Is it possible to identify individuals directly or indirectly (e.g., in combination with other data) from the dataset?

  1. Content sensitivity

Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals race or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers, criminal history)?

  1. Other comments

Collection Process

  1. Collection methods

How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings, etc.), reported by individuals (survey responses, etc.), or indirectly inferred/derived from other data (part-of-speech tags, model-based guesses for age or language, etc.)? Were the data validated/verified?

  1. Collection apparatus

What mechanisms or procedures were used to collect the data (hardware apparatuses or sensors, manual human curation, software programs, software APIs, etc.)?

  1. Sampling strategy

If the dataset is a sample from a larger set, what was the sampling strategy (deterministic, probabilistic with specific sampling probabilities, etc.)?

  1. Data curation and collection labor

Who was involved in the data collection or curation process (students, crowdworkers, contractors, etc.) and how were they compensated for their time?

  1. Timeframe

Over what timeframe was the data collected? Does this timeframe correspond to the creation timeframe of the data associated with the instances (recent crawl of old news articles)?

  1. Ethical review

Were any ethical review processes conducted (by an institutional review board, etc.)? Provide a description of the processes and outcomes.

  1. Individual/participant involvement

Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (websites, etc.)?

  1. Individual/participant notification

Were individuals whose data appear in the dataset notified about the data collection?

  1. Individual/participant consent

Did the individuals whose data appear in the dataset consent to the collection and use of these data?

  1. Consent revocation

If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses?

  1. Impact analyses

Has an analysis of the potential impact of the dataset and its use on data subjects (data protection impact analysis, etc.) been conducted? If so, provide a description of analysis and outcomes.

  1. Other comments

Preprocessing, Cleaning, & Labeling

  1. Preprocessing methods

Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?

  1. Raw data availability

Was the raw data saved in addition to the preprocessed/cleaned/ labeled data for provenance or to support unanticipated future uses?

  1. Preprocessing software

What software was used to preprocess, clean, and label the data?

  1. Other comments

Uses

  1. Prior use

What known tasks has the dataset been used for already?

  1. Use repository

Is there a public repository that links to any or all papers or systems that use the dataset?

  1. Other uses

What other tasks could the dataset be used for?

  1. Preprocessing impact on use

Is there anything about the composition of the dataset or the way it was collected and preprocessed/ cleaned/labeled that might impact future uses? Is there anything that a dataset consumer might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues, etc.) or other risks or harms (legal risks, financial harms, etc.)? If so, provide a description. What, if anything, might a dataset consumer could do to mitigate these risks or harms?

  1. Use restrictions

Are there tasks for which the dataset should not be used?

  1. Other comments

Distribution

  1. External distribution

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?

  1. Distribution methods

How will the dataset be distributed (tarball, API, GitHub repo, etc.)? Does the dataset have a digital object identifier (DOI)?

  1. Distribution time

When will the dataset be distributed?

  1. Curating entity proprietorship

Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? If so, describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions.

  1. Third-party proprietorship

Have any third parties imposed IP-based or other restrictions on the data associated with the instances?

  1. Export controls and regulatory restrictions

Do any export controls or other regulatory restrictions apply to the dataset or to individual instances?

  1. Other comments

Maintenance

  1. Dataset hosting

Who will be supporting/hosting/maintaining the dataset?

  1. Contact

How can the owner/curator/ manager of the dataset be contacted?

  1. Erratum

Is there an erratum associated with this dataset?

  1. Dataset updates

Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? If so, how often, by whom, and how will updates be communicated to dataset consumers (mailing list, GitHub, etc.)

  1. Retention & Disposition

If the dataset relates to people, are there limits on the retention of the data associated with the instances (for example, were the individuals in question told that their data would be retained for a fixed period of time and then deleted)?

  1. Version control

Will older versions of the dataset continue to be supported/hosted/maintained? If so, describe how its obsolescence will be communicated to dataset consumers.

  1. Community contributions

If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, will these contributions be validated/verified? Is there a process for communicating/distributing these contributions to dataset consumers? If so, please provide a description.

  1. Other comments

Public Health Applications: Why do Datasheets Matter in Public Health Contexts?

Mitigation of health disparities and promotion of health equity across populations are key foci within the realms of public and population health. Datasheets aim to mitigate discriminatory outcomes across contexts resulting from obscure usage of potentially non-representative data used in ML models.

Given the synergy of these ideals, the use of datasheets in for datasets in public health ML contexts presents an opportunity to leverage dataset disclosures to advance:

  • mindful review of datasets and their limitations prior to use in models

  • recognition of data scarcity which limit detection of multivariate relationships

  • promotion of more equitable insights obtained from models

Deeper Dive: ML and algorithmic fairness in public and population health

Review this paper written by Mhasawade et al.

This paper written by Vishwali Mhasawade and colleagues describes the current landscape of machine learning and algorithmic fairness in public and population health.

Among their key points is that limitations of the data which populate ML models must be considered when examining the role of algorithmic determinism in public and population health contexts.

A few key points:

  • Much of ML in health contexts has focused on models developed or deployed for hospital or clinic settings rather than those focused on complex determinants of health outside of healthcare entities.

  • Public health research often focuses on the complex relationships and possible mediations between multi-level factors; therefore, when enough reliable data are available, ML methods can allow us to leverage novel data sources for the interpretable identification and assessment of multi-level factors in health outcomes, but data limitations (including source bias) make this difficult.

  • Current uses of ML in population and public health contexts are diverse; however, each faces challenges in ensuring that insights gleaned are externally valid (largely stemming from limitations of person-generated data sources).

Cite as: Mhasawade V, Zhao Y, Chunara R. Machine learning and algorithmic fairness in public and population health. Nat Mach Intell. 2021;3(8):659-666. doi:10.1038/s42256-021-00373-4

After reading the paper, consider the utility of dataset disclosure mediums (like Datasheets for Datasets) in the development of ML models for public health contexts.

Exercise: Create a Datasheet for a Public Health Dataset!

In this exercise, we are going to create a datasheet for our NHANES data subset using a script that generates and exports a datasheet as a PDF file.

Directions:

  1. Make a copy of the DS4DS Google Colaboratory notebook in your Google Drive account (also found on GitHub)

  2. Follow the annotated directions to produce your own datasheet.

    • NOTES:

      1. We are using the NHANES data subset we created in Module 2, so be sure it's downloaded and saved to an easily accessible folder!

      2. Upload the dataset in the cell where you see "⬅️🗂️" to enable automated disclosures.

      3. Manually type disclosures in the cells where you see "⬅️✏️."

Share: #Datasheets4Datasets

Thought prompt: Why are datasheets relevant in public health contexts? How might they impact how models are built and deployed?

Share your thoughts on Twitter using the hashtags #MDSD4Health #Datasheets4Datasets

Tag us to join the conversation! @MDSD4Health

For ideas on how to take part in the conversation, check out our Twitter Participation Guide.