Jump to Content
Aurko Roy

Aurko Roy

Aurko Roy is currently a Research Scientist in the Google Brain team where he conducts research on the intersection of generative models, structured prediction and natural language processing. He received his PhD in Algorithms, Combinatorics & Optimization from Georgia Tech in 2017.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract There has been remarkable recent progress in factoid open-domain question answering (QA), where a short phrase or entity is sufficient to answer the question. A lot less work has been done in the more challenging task of long-form QA, where the goal is to generate elaborate, paragraph-long answers to more open-ended questions. In this work, we present a new system based on sparse attention and contrastive retriever learning, which achieves state-of-the-art performance on ELI5, a popular long-form QA dataset in the KILT benchmark (Petroni et al. 2020). However, a detailed analysis of our system reveals several concerning trends which are hampering progress in this important area: (1) little to no evidence our model's generations are actually grounded in the retrieved documents, a desirable property which is not captured by metrics in the KILT benchmark; (2) a significant training / valid / test set overlap in ELI5, with atleast 75\% validation questions having a paraphrased question in training data; (3) significant issues in the use of the popular evaluation metric ROUGE-L, with a very low margin of improvement (2-5 ROUGE-L) from lower-bound trivial baselines (like input copying) to upper-bound reference baselines; (4) inherent difficulty of human evaluation in this task due to long length of generated answers and unfamiliarity with topics. View details
    Efficient Content-Based Sparse Attention with Routing Transformers
    Ashish Teku Vaswani
    David Grangier
    Transactions of the Association for Computational Linguistics (2021)
    Preview abstract Self-attention has recently been adopted for a wide range of sequence modeling problems. Despite its effectiveness, self-attention suffers from quadratic compute and memory requirements with respect to sequence length. Successful approaches to reduce this complexity focused on attending to local sliding windows or a small set of locations independent of content. Our work proposes to learn dynamic sparse attention patterns that avoid allocating computation and memory to attend to content unrelated to the query of interest. This work builds upon two lines of research: it combines the modeling flexibility of prior work on content-based sparse attention with the efficiency gains from approaches based on local, temporal sparse attention. Our model, the Routing Transformer, endows selfattention with a sparse routing module based on online k-means while reducing the overall complexity of attention to O(n^1.5d) from O(n^2d) for sequence length n and hidden dimension d. We show that our model outperforms comparable sparse attention models on language modeling on Wikitext-103 (15.8 vs 18.3 perplexity), as well as on image generation on ImageNet-64 (3.43 vs 3.44 bits/dim) while using fewer self-attention layers. Additionally, we set a new state-of-the-art on the newly released PG-19 data-set, obtaining a test perplexity of 33.2 with a 22 layer Routing Transformer model trained on sequences of length 8192. View details
    Preview abstract Paraphrasing exemplifies the ability to abstract semantic content from surface forms. Recent work on automatic paraphrasing is dominated by methods leveraging Machine Translation (MT) as an intermediate step. This contrasts with humans, who can paraphrase without being bilingual. This work proposes to learn paraphrasing models from an unlabeled monolingual corpus only. To that end, we propose a residual variant of vector-quantized variational auto-encoder. We compare with MT-based approaches on paraphrase identification, generation, and training augmentation. Monolingual paraphrasing outperforms unsupervised MT in all settings. Comparisons with supervised MT are more mixed: monolingual paraphrasing is interesting for identification and augmentation; supervised MT is superior for generation. View details
    Improving Interpolation in Autoencoders
    David Berthelot
    Colin Raffel
    Ian Goodfellow
    ICLR (2019)
    Preview abstract Autoencoders provide a powerful framework for learning compressed representations by encoding all of the information needed to reconstruct a data point in a latent code. In some cases, autoencoders can “interpolate”: By decoding the convex combination of the latent codes for two datapoints, the autoencoder can produce an output which semantically mixes characteristics from the datapoints. In this paper, we propose a regularization procedure which encourages interpolated outputs to appear more realistic by fooling a critic network which has been trained to recover the mixing coefficient from interpolated data. We then develop a simple benchmark task where we can quantitatively measure the extent to which various autoencoders can interpolate and show that our regularizer dramatically improves interpolation in this setting. We also demonstrate empirically that our regularizer produces latent codes which are more effective on downstream tasks, suggesting a possible link between interpolation abilities and learning useful representations. View details
    Fast Decoding in Sequence Models Using Discrete Latent Variables
    Lukasz Kaiser
    Ashish Vaswani
    Niki J. Parmar
    Samy Bengio
    Jakob Uszkoreit
    Noam Shazeer
    ICML (2018)
    Preview abstract Auto-regressive sequence models based on deep neural networks, such as RNNs, Wavenet and Transformer are the state of the art on many tasks. However, they lack parallelism and are thus slow for long sequences. RNNs lack parallelism both during training and decoding, while architectures like WaveNet and Transformer are much more parallel during training, but still lack parallelism during decoding. We present a method to extend sequence models using discrete latent variables that makes decoding much more parallel. The main idea behind this approach is to first autoencode the target sequence into a shorter discrete latent sequence, which is generated auto-regressively, and finally decode the full sequence from this shorter latent sequence in a parallel manner. We verify that our method works on the task of neural machine translation, where our models are an order of magnitude faster than comparable auto-regressive models. We also introduce a new method for constructing discrete latent variables that allows us to obtain good BLEU scores. View details
    Thermometer Encoding: One Hot Way To Resist Adversarial Examples
    Jacob Buckman
    Colin Raffel
    Ian Goodfellow
    ICLR (2018)
    Preview abstract It is well known that for neural networks, it is possible to construct inputs which are misclassified by the network yet indistinguishable from true data points, known as ``adversarial examples''. We propose a simple modification to standard neural network architectures, \emph{thermometer encoding}, which significantly increases the robustness of the network to adversarial examples. We demonstrate this robustness with experiments on the MNIST, CIFAR-10, CIFAR-100, and SVHN datasets, and show that models with thermometer-encoded inputs consistently have higher accuracy on adversarial examples, while also maintaining the same accuracy on non-adversarial examples and training more quickly. View details
    Adversarial Patch
    Tom Brown
    Dandelion Mane
    Justin Gilmer
    NIPS Workshop (2017)
    Preview abstract We present a method to create universal, robust, targeted adversarial image patches in the real world. The patches are universal because they can be used to attack any scene, robust because they work under a wide variety of transformations, and targeted because they can cause a classifier to output any target class. These adversarial patches can be printed, added to any scene, photographed, and presented to image classifiers; even when the patches are small, they cause the classifiers to ignore the other items in the scene and report a chosen target class. View details
    Reinforcement Learning under Model Mismatch
    Huan Xu
    Sebastian Pokutta
    NIPS (2017)
    Hierarchical Clustering via Spreading Metrics
    Sebastian Pokutta
    JMLR (2017)
    Strong reductions for extended formulations
    Gabor Braun
    Sebastian Pokutta
    IPCO (2016)
    Hierarchical Clustering via Spreading Metrics
    Sebastian Pokutta
    NIPS (2016)
    The matching problem has no small symmetric SDP
    Gabor Braun
    Sebastian Pokutta
    Arefin Huq
    Jonah Brown-Cohen
    Prasad Raghavendra
    Benjamin Weitz
    Daniel Zink
    Mathematical Programming (2016)
    The matching problem has no small symmetric SDP
    Gabor Braun
    Jonah Brown-Cohen
    Arefin Huq
    Sebastian Pokutta
    Prasad Raghavendra
    Benjamin Weitz
    Daniel Zink
    SODA (2016)