Publication Data
Building Transcribed Speech Corpora Quickly and Cheaply for Many Languages
Abstract: We present a system for quickly and cheaply building
transcribed speech corpora containing utterances from many speakers in a variety of
acoustic conditions. The system consists of a client application running on an Android
mobile device with an intermittent Internet connection to a server. The client
application collects demographic information about the speaker, fetches textual prompts
from the server for the speaker to read, records the speaker’s voice, and uploads the
audio and associated metadata to the server. The system has so far been used to collect
over 3000 hours of transcribed audio in 17 languages around the world.
