The People’s Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage

Keywords

The People’s SpeechSpeech RecognitionDataset DocumentationMachine Learning

Full Study

Google Drive Link

Institute(s)

OracleIntelNVIDIALanding AIFactoredMLCommonsHarvard University

Year

2021

Abstract

The People’s Speech is a free-to-download 30,000-hour and growing supervised conversational English speech recognition dataset licensed for academic and commercial usage under CC-BY-SA (with a CC-BY subset). The data is collected via searching the Internet for appropriately licensed audio data with existing transcriptions. We describe our data collection methodology and release our data collection system under the Apache 2.0 license. We show that a model trained on this dataset achieves a 9.98% word error rate on Librispeech’s test-clean test set. Finally, we discuss the legal and ethical issues surrounding the creation of a sizable machine learning corpora and plans for continued maintenance of the project under MLCommons’s sponsorship.

Author(s)

Daniel GalvezGreg DiamosJuan CiroJuan Felipe CerónKeith AchornAnjali GopiDavid KanterMaximilian LamMark MazumderVijay Janapa Reddi

Tool

People’s Speech