INITIALIZATION MATTERS: ORTHOGONAL PREDICTIVE STATE RECURRENT NEURAL NETWORKS
ICLR (2018) (to appear)
Krzysztof Choromanski, Carlton Downey, Byron Emereth Boots
Learning to predict complex time-series data is a fundamental challenge in a range of disciplines including Machine Learning, Robotics, and Natural Language Processing. Predictive State Recurrent Neural Networks (PSRNNs) Downey et al. (2017) are a state-of-the-art approach for modeling time-series data which combine the benefits of probabilistic filters and Recurrent Neural Networks in a single model. PSRNNs leverage the concept of Hilbert Space Embeddings of distributions Smola et al. (2007) to embed predictive states into a Reproducing Kernel Hilbert Space, then estimate, predict, and update these embedded states using Kernel Bayes Rule. Practical implementations of PSRNNs are made possible by the machinery of Random Features, where input features are mapped into a new space where dot products approximate the kernel well. Unfortunately it turns out that PSRNNs often require a large number of RFs to obtain good results, resulting in large models which are slow to execute and slow to train. Orthogonal Random Features (ORFs)Yu et al. (2016) is an improvement on RFs which has been shown to decrease the number of RFs required in a number of applications. Unfortunately it is not clear that ORFs can be applied to PSRNNs, as PSRNNs rely on Kernel Ridge Regression as a core component of their learning algorithm, and the theoretical guarantees of ORF do not apply in this setting. In this paper we extend the theory of ORFs to Kernel Ridge Regression and show that ORFs can be used to obtain Orthogonal PSRNNs (OPSRNNs), which are smaller and faster than PSRNNs.