This paper proposes a novel approach for directly-modeling speech at the waveform
level using a neural network. This approach uses the neural network-based
statistical parametric speech synthesis framework with a specially designed output
layer. As acoustic feature extraction is integrated to acoustic model training, it
can overcome the limitations of conventional approaches, such as two-step (feature
extraction and acoustic modeling) optimization, use of spectra rather than
waveforms as targets, use of overlapping and shifting frames as unit, and fixed
decision tree structure. Experimental results show that the proposed approach can
directly maximize the likelihood defined at the waveform domain.