Various neural network architectures have been proposed in the literature to model
2D correlations in the input signal, including convolutional layers, frequency
LSTMs and 2D LSTMs such as time-frequency LSTMs, grid LSTMs and ReNet LSTMs. It has
been argued that frequency LSTMs can model translational variations similar to
CNNs, and 2D LSTMs can model even more variations , but no proper comparison has
been done for speech tasks. While convolutional layers have been a popular
technique in speech tasks, this paper compares convolutional and LSTM architectures
to model time-frequency patterns as the first layer in an LDNN  architecture.
This comparison is particularly interesting when the convolutional layer degrades
performance, such as in noisy conditions or when the learned filterbank is not
constant-Q . We find that grid-LDNNs offer the best performance of all
techniques, and provide between a 1-4% relative improvement over an LDNN and CLDNN
on 3 different large vocabulary Voice Search tasks.