Mitigating Dataset Harms Requires Stewardship:
Lessons from 1000 Papers

This repository contains supplemental data for the paper Mitigating Dataset Harms Requires Stewardship: Lessons from 1000 papers.

License information

All files in this repository are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Data collected using the Semantic Scholar API (detailed in the descriptions below) are licensed under the Semantic Scholar Dataset License.

We make the following .csv files available:

msceleb1m_all.csv, dukemtmc_all.csv, lfw_all.csv

These are the full corpora we collected, containing 1,404, 1,393, and 7,732 papers respectively. The following columns are given, and reflect information given by Semantic Scholar:

paperId: the Semantic Scholar id of the paper
cites {dataset id}: for each dataset used to build the corpus, 1 if the paper cites {dataset id} and 0 otherwise—see summary.csv for dataset ids.
title, abstract, year, venue, arxivId, doi
pdfUrl: a URL where the paper may be publicly available, found via Semantic Scholar or arXiv

msceleb1m_labeled.csv, dukemtmc_labeled.csv, lfw_labeled.csv

These are the samples of papers that we analyzed, containing 276, 275, and 400 papers respectively. In addition to all the columns above, the following additional columns are given:

uses dataset or derivative: 1 if we determined that the paper uses a dataset or derivative and 0 otherwise
dataset(s) / model(s) used: a comma separated list of datasets or models used, denoted by the id provided in summary.csv in brackets (e.g., [D8], [M5])
unable to disambiguate: 1 if we were unable to determine the specific dataset(s) used or whether a dataset was used, and 0 otherwise

summary.csv

This is a table summarizing our analysis.

dataset_list.csv

This file contains the names of the 54 face and person recognition datasets we compiled to select our three datasets of interest, the number of total citations on Semantic Scholar (at the time of collection in August 2020), and their Semantic Scholar Corpus ID which can be used to access metadata from the Semantic Scholar API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mitigating Dataset Harms Requires Stewardship:
Lessons from 1000 Papers

License information

About

Releases

Packages

Contributors 2

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
dataset_list.csv		dataset_list.csv
dukemtmc_all.csv		dukemtmc_all.csv
dukemtmc_labeled.csv		dukemtmc_labeled.csv
lfw_all.csv		lfw_all.csv
lfw_labeled.csv		lfw_labeled.csv
msceleb1m_all.csv		msceleb1m_all.csv
msceleb1m_labeled.csv		msceleb1m_labeled.csv
s2-logo.svg		s2-logo.svg
summary.csv		summary.csv

citp/mitigating-dataset-harms

Folders and files

Latest commit

History

Repository files navigation

Mitigating Dataset Harms Requires Stewardship:Lessons from 1000 Papers

License information

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Mitigating Dataset Harms Requires Stewardship:
Lessons from 1000 Papers

Packages