Speech transcription of web videos requires ﬁrst detecting segments with
transcribable speech. We refer to this as segmentation. Commonly used segmentation
techniques are inadequate for domains such as YouTube, where videos may have a
large variety of background and recording conditions. In this work, we investigate
alternative audio features and a discriminative classiﬁer, which together yield a
lower frame error rate (25.3%) on YouTube videos compared to the commonly used
Gaussian mixture models trained on cepstral features (30.6%). The alternative audio
features perform particularly well in noisy conditions.