Speech recognizers are typically trained with data from a standard dialect and do
not generalize to non-standard dialects. Mismatch mainly occurs in the acoustic
realization of words, which is represented by acoustic models and pronunciation
lexicon. Standard techniques for addressing this mismatch are generative in nature
and include acoustic model adaptation and expansion of lexicon with pronunciation
variants, both of which have limited effectiveness. We present a discriminative
pronunciation model whose parameters are learned jointly with parameters from the
language models. We tease apart the gains from modeling the transitions of
canonical phones, the transduction from surface to canonical phones, and the
language model. We report experiments on African American Vernacular English (AAVE)
using NPR's StoryCorps corpus. Our models improve the performance over the baseline
by about 2.1% on AAVE, of which 0.6% can be attributed to the pronunciation model.
The model learns the most relevant phonetic transformations for AAVE speech.