Multilingual Spoken Words

Tags

DatasetsMachine LearningSpeech Recognition

Provider

MLCommons

URL

https://mlcommons.org/en/multilingual-spoken-words/

Abstract

Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages collectively spoken by over 5 billion people, for academic research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0. The dataset contains more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours). The dataset has many use cases, ranging from voice-enabled consumer devices to call center automation.

Paper

Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation