Jump to Content
Alex Alemi

Alex Alemi

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract The paper introduces a new method for attempting to learn variational approximations to Bayesian posterior predictive distributions that doesn’t require (1) the posterior predic- tive distribution itself, (2) the posterior distribution (3) exact samples from the posterior (4) or any test time marginalization. View details
    Does Knowledge Distillation Really Work?
    Samuel Stanton
    Pavel Izmailov
    Polina Kirichenko
    Andrew Gordon Wilson
    NeurIPS (2021)
    Preview abstract Knowledge distillation is a popular technique for training a small student network to match a larger teacher model, such as an ensemble of networks.In this paper, we show that while knowledge distillation has a useful regularizing effect, it does not typically work as it is commonly understood:there often remains a surprisingly large discrepancy between the predictive distributions of the teacher and the student, even in cases when the student has the capacity to perfectly match the teacher. We show that the dataset used for distillation and the amount of temperature scaling applied to the logits play a crucial role in how closely the student matches the teacher, and discuss optimal ways of setting these hyper-parameters inpractice. View details
    Preview abstract Perhaps surprisingly, recent studies have shown probabilistic model likelihoods have poor specificity for out-of-distribution (OOD) detection and often assign higher likelihoods to OOD data than in-distribution data. To ameliorate this issue we propose DoSE, the density of states estimator. Drawing on the statistical physics notion of ``density of states,'' the DoSE decision rule avoids direct comparison of model probabilities, and instead utilizes the ``probability of the model probability,'' or indeed the frequency of any reasonable statistic. The frequency is calculated using nonparametric density estimators (e.g., KDE and one-class SVM) which measure the typicality of various model statistics given the training data and from which we can flag test points with low typicality as anomalous. Unlike many other methods, DoSE requires neither labeled data nor OOD examples. DoSE is modular and can be trivially applied to any existing, trained model. We demonstrate DoSE's state-of-the-art performance against other unsupervised OOD detectors on previously established ``hard'' benchmarks. View details
    Preview abstract Neural Tangents is a library designed to enable research into infinite-width neural networks. It provides a high-level API for specifying complex and hierarchical neural network architectures. These networks can then be trained and evaluated either at finite-width as usual or in their infinite-width limit. Infinite-width networks can be trained analytically using exact Bayesian inference or using gradient descent via the Neural Tangent Kernel. Additionally, Neural Tangents provides tools to study gradient descent training dynamics of wide but finite networks in either function space or weight space. The entire library runs out-of-the-box on CPU, GPU, or TPU. All computations can be automatically distributed over multiple accelerators with near-linear scaling in the number of devices. Neural Tangents is available at https://github.com/google/neural-tangents We also provide an accompanying interactive Colab notebook at https://colab.sandbox.google.com/github/neural-tangents/neural-tangents/blob/master/notebooks/neural_tangents_cookbook.ipynb View details
    Preview abstract Estimating and optimizing Mutual Information (MI) is core to many problems in machine learning; however, bounding MI in high dimensions is challenging. To establish tractable and scalable objectives, recent work has turned to variational bounds parameterized by neural networks, but the relationships and tradeoffs between these bounds remains unclear. In this work, we unify these recent developments in a single framework. We find that the existing variational lower bounds degrade when the MI is large, exhibiting either high bias or high variance. To address this problem, we introduce a continuum of lower bounds that encompasses previous bounds and flexibly trades off bias and variance. On high-dimensional, controlled problems, we empirically characterize the bias and variance of the bounds and their gradients and demonstrate the effectiveness of our new bounds for estimation and representation learning. View details
    Preview abstract In order to gain insights into the generalization properties of deep neural networks, in this preliminary work we suggest studying the generalization properties of infinite ensembles of infinitely wide neural networks. Amazingly, this model family admits tractable calculations for many information theoretic quantities. Below we both derive these quantities and report some initial empirical investigations in the search for signals that correlate with generalization on both toy and real datasets. View details
    Preview abstract In classic papers, Zellner [1988, 2002] demonstrated that Bayesian inference could be derived as the solution to an information theoretic functional. Below we derive a generalized form of this functional as a variational lower bound of a predictive information bottleneck objective. This generalized functional encompasses most modern inference procedures and suggests novel ones. View details
    Preview abstract Graph embedding methods represent nodes in a continuous vector space, preserving information from the graph (e.g. by sampling random walks). There are many hyper-parameters to these methods (such as random walk length) which have to be manually tuned for every graph. In this paper, we replace random walk hyper-parameters with trainable parameters that we automatically learn via backpropagation. In particular, we learn a novel attention model on the power series of the transition matrix, which guides the random walk to optimize an upstream objective. Unlike previous approaches to attention models, the method that we propose utilizes attention parameters exclusively on the data (e.g. on the random walk), and not used by the model for inference. We experiment on link prediction tasks, as we aim to produce embeddings that best-preserve the graph structure, generalizing to unseen information. We improve state-of-the-art on a comprehensive suite of real world datasets including social, collaboration, and biological networks. Adding attention to random walks can reduce the error by 20% to 45% on datasets we attempted. Further, our learned attention parameters are different for every graph, and our automatically-found values agree with the optimal choice of hyper-parameter if we manually tune existing methods. View details
    Preview abstract Without explictly being designed to do so, VIB (Alemi et al., 2017) gives two natural metrics for handling and quantifying uncertainty in neural networks. In this work we present a simple case study, demonstrating that VIB can improve a networks classification calibration as well as its ability to detect out of sample data. View details
    Fixing a Broken ELBO
    Ben Poole
    Josh Dillon
    Proceedings of the 35th International Conference on Machine Learning, PMLR, Stockholmsmässan, Stockholm Sweden (2018), pp. 159-168
    Preview abstract Recent work in unsupervised representation learning has focused on learning deep directed latent variable models. Fitting these models by maximizing the marginal likelihood or evidence is typically intractable, thus a common approximation is to maximize the evidence lower bound (ELBO) instead. However, maximum likelihood training (whether exact or approximate) does not necessarily result in a good latent representation, as we demonstrate both theoretically and empirically. In particular, we derive variational lower and upper bounds on the mutual information between the input and the latent variable, and use these bounds to derive a rate-distortion curve that characterizes the tradeoff between compression and reconstruction accuracy. Using this framework, we demonstrate that there is a family of models with identical ELBO, but different quantitative and qualitative characteristics. Our framework also suggests a simple new method to ensure that latent variable models with powerful stochastic decoders do not ignore their latent code. View details
    Preview abstract In this paper, we investigate the degree to which the encoding of a β-VAE captures label information across multiple architectures on Binary Static MNIST and Omniglot. Even though they are trained in a completely unsupervised manner, we demonstrate that a β-VAE can retain a large amount of label information, even when asked to learn a highly compressed representation. View details
    Preview abstract We propose a simple, tractable lower bound on the mutual information contained in the joint generative density of any latent variable generative model: the GILBO (Generative Information Lower BOund). It offers a data independent measure of the complexity of the learned latent variable description, giving the log of the effective description length. It is well-defined for both VAEs and GANs. We compute the GILBO for 800 GANs and VAEs trained on MNIST and discuss the results. View details
    Preview abstract In this preliminary and speculative work, we offer a unique perspective and framework to think about a wide class of existing objectives in Machine Learning. We discuss its implications, and identify some formal connections to Thermodynamics. View details
    TensorFlow Distributions
    Josh Dillon
    Dustin Tran
    Dave Moore
    Workshop on Probabilistic Programming Languages, Semantics, and Systems (PPS 2018) (2017)
    Preview abstract The TensorFlow Distributions library implements a vision of probability theory adapted to the modern deep-learning paradigm of end-to-end differentiable computation. Building on two basic abstractions, it offers flexible building blocks for probabilistic computation. Distributions provide fast, numerically stable methods for generating samples and computing statistics, e.g., log density. Bijectors provide composable volume-tracking transformations with automatic caching. Together these enable modular construction of high dimensional distributions and transformations not possible with previous libraries (e.g., pixelCNNs, autoregressive flows, and reversible residual networks). They are the workhorse behind deep probabilistic programming systems like Edward and empower fast black-box inference in probabilistic models built on deep-network components. TensorFlow Distributions has proven an important part of the TensorFlow toolkit within Google and in the broader deep learning community. View details
    Preview abstract We present a variational approximation to the information bottleneck of Tishby et al. (1999). This variational approach allows us to parameterize the information bottleneck model using a neural network and leverage the reparameterization trick for efficient training. We call this method "Deep Variational Information Bottleneck", or Deep VIB. We show that models trained with the VIB objective outperform those that are trained with other forms of regularization, in terms of generalization performance and robustness to adversarial attack. View details
    Improved generator objectives for GANs
    Jascha Sohl-dickstein
    NIPS Workshop on Adversarial Learning (2016)
    Preview abstract We present a new framework to understand GAN training as alternating density ratio estimation with divergence minimization. This provides a new interpretation for the GAN generator objective used in practice and explains the problem of poor sample diversity. Furthermore, we derive a family of objectives that target arbitrary f-divergences without minimizing a lower bound, and use them to train generative image models that target either improved sample quality or sample diversity. View details
    Preview abstract We study the effectiveness of neural sequence models for premise selection in automated theorem proving, one of the main bottlenecks in the formalization of mathematics. We propose a two stage approach for this task that yields good results for the premise selection task on the Mizar corpus while avoiding the hand-engineered features of existing state-of-the-art models. To our knowledge, this is the first time deep learning has been applied to theorem proving. View details
    Preview abstract Very deep convolutional networks have been central to the largest advances in image recognition performance in recent years. One example is the Inception architecture that has been shown to achieve very good performance at relatively low computational cost. Recently, the introduction of residual connections in conjunction with a more traditional architecture has yielded state-of-the-art performance in the 2015 ILSVRC challenge; its performance was similar to the latest generation Inception-v3 network. This raises the question of whether there are any benefit in combining the Inception architecture with residual connections. Here we give clear empirical evidence that training with residual connections accelerates the training of Inception networks significantly. There is also some evidence of residual Inception networks outperforming similarly expensive Inception networks without residual connections by a thin margin. We also present several new streamlined architectures for both residual and non-residual Inception networks. These variations improve the single-frame recognition performance on the ILSVRC 2012 classification task significantly. We further demonstrate how proper activation scaling stabilizes the training of very wide residual Inception networks. With an ensemble of three residual and one Inception-v4, we achieve 3.08 percent top-5 error on the test set of the ImageNet classification (CLS) challenge. View details
    No Results Found