Jump to Content
George Dahl

George Dahl

George Dahl received his Ph.D. from the University of Toronto under the supervision of Geoff Hinton, where he worked on deep learning approaches to problems in speech recognition, computational chemistry, and natural language text processing, including some of the first successful deep acoustic models. He has been a research scientist at Google on the Brain team since 2015. His research focuses on highly flexible models that learn their own features, end-to-end, and make efficient use of data and computation for supervised, unsupervised, and reinforcement learning. In particular, he is interested in applications to linguistic and perceptual data as well as chemical, biological, and medical data. Google Scholar profile
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    A mobile-optimized artificial intelligence system for gestational age and fetal malpresentation assessment
    Ryan Gomes
    Bellington Vwalika
    Chace Lee
    Angelica Willis
    Joan T. Price
    Christina Chen
    Margaret P. Kasaro
    James A. Taylor
    Elizabeth M. Stringer
    Scott Mayer McKinney
    Ntazana Sindano
    William Goodnight, III
    Justin Gilmer
    Benjamin H. Chi
    Charles Lau
    Terry Spitz
    Kris Liu
    Jonny Wong
    Rory Pilgrim
    Akib Uddin
    Lily Hao Yi Peng
    Kat Chou
    Jeffrey S. A. Stringer
    Shravya Ramesh Shetty
    Communications Medicine (2022)
    Preview abstract Background Fetal ultrasound is an important component of antenatal care, but shortage of adequately trained healthcare workers has limited its adoption in low-to-middle-income countries. This study investigated the use of artificial intelligence for fetal ultrasound in under-resourced settings. Methods Blind sweep ultrasounds, consisting of six freehand ultrasound sweeps, were collected by sonographers in the USA and Zambia, and novice operators in Zambia. We developed artificial intelligence (AI) models that used blind sweeps to predict gestational age (GA) and fetal malpresentation. AI GA estimates and standard fetal biometry estimates were compared to a previously established ground truth, and evaluated for difference in absolute error. Fetal malpresentation (non-cephalic vs cephalic) was compared to sonographer assessment. On-device AI model run-times were benchmarked on Android mobile phones. Results Here we show that GA estimation accuracy of the AI model is non-inferior to standard fetal biometry estimates (error difference -1.4 ± 4.5 days, 95% CI -1.8, -0.9, n=406). Non-inferiority is maintained when blind sweeps are acquired by novice operators performing only two of six sweep motion types. Fetal malpresentation AUC-ROC is 0.977 (95% CI, 0.949, 1.00, n=613), sonographers and novices have similar AUC-ROC. Software run-times on mobile phones for both diagnostic models are less than 3 seconds after completion of a sweep. Conclusions The gestational age model is non-inferior to the clinical standard and the fetal malpresentation model has high AUC-ROCs across operators and devices. Our AI models are able to run on-device, without internet connectivity, and provide feedback scores to assist in upleveling the capabilities of lightly trained ultrasound operators in low resource settings. View details
    Adaptive Gradient Methods at the Edge of Stability
    Behrooz Ghorbani
    David Cardoze
    Jeremy Cohen
    Justin Gilmer
    Shankar Krishnan
    NeuRIPS 2022 (2022) (to appear)
    Preview abstract Little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we show that during full-batch training, the maximum eigenvalue of the \emph{preconditioned} Hessian typically equilibrates at the stability threshold of a related non-adaptive algorithm. For Adam with step size $\eta$ and $\beta_1 = 0.9$, this stability threshold is $38/\eta$. Similar effects occur during minibatch training, especially as the batch size grows. Yet, even though adaptive methods train at the “Edge of Stability,” their behavior in this regime differs in a crucial way from that of their non-adaptive counterparts. Whereas non-adaptive algorithms are forced to remain in low-curvature regions of the loss landscape, we demonstrate that adaptive gradient methods often advance into high-curvature regions, while adapting the preconditioner to compensate. We believe that our findings will serve as a foundation for the community’s future understanding of adaptive gradient methods in deep learning. View details
    Preview abstract In this work, we study the evolution of the loss Hessian across many classification tasks in order to understand the effect the curvature of the loss has on the training dynamics. Whereas prior work has focused on how different learning rates affect the loss Hessian observed during training, we also analyze the effects of model initialization, architectural choices, and common training heuristics such as gradient clipping and learning rate warmup. Our results demonstrate that successful model and hyperparameter choices allow the early optimization trajectory to either avoid---or navigate out of---regions of high curvature and into flatter regions that tolerate a higher learning rate. Our results suggest a unifying perspective on how disparate mitigation strategies for training instability ultimately address the same underlying failure mode of neural network optimization, namely poor conditioning. Inspired by the conditioning perspective, we show that learning rate warmup can improve training stability just as much as batch normalization, layer normalization, MetaInit, GradInit, and Fixup initialization. View details
    Machine learning guided aptamer discovery
    Ali Bashir
    Geoff Davis
    Michelle Therese Dimon
    Qin Yang
    Scott Ferguson
    Zan Armstrong
    Nature Communications (2021)
    Preview abstract Aptamers are discovered by searching a large library for sequences with desirable binding properties. These libraries, however, are physically constrained to a fraction of the theoretical sequence space and limited to sampling strategies that are easy to scale. Integrating machine learning could enable identification of high-performing aptamers across this unexplored fitness landscape. We employed particle display (PD) to partition aptamers by affinity and trained neural network models to improve physically-derived aptamers and predict affinity in silico. These predictions were used to locally improve physically derived aptamers as well as identify completely novel, high-affinity aptamers de novo. We experimentally validated the predictions, improving aptamer candidate designs at a rate 10-fold higher than random perturbation, and generating novel aptamers at a rate 448-fold higher than PD alone. We characterized the explanatory power of the models globally and locally and showed successful sequence truncation while maintaining affinity. This work combines machine learning and physical discovery, uses principles that are widely applicable to other display technologies, and provides a path forward for better diagnostic and therapeutic agents. View details
    Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model
    Guodong Zhang
    James Martens
    Sushant Sachdeva
    Chris Shallue
    Roger Grosse
    2019 Conference on Neural Information Processing Systems (2019)
    Preview abstract Increasing the batch size is a popular way to speed up neural network training, but beyond some critical batch size, larger batch sizes yield diminishing returns. In this work, we study how the critical batch size changes based on properties of the optimization algorithm, including acceleration and preconditioning, through two different lenses: large scale experiments, and analysis of a simple noisy quadratic model (NQM). We experimentally demonstrate that optimization algorithms that employ preconditioning, specifically Adam and K-FAC, result in much larger critical batch sizes than stochastic gradient descent with momentum. We also demonstrate that the NQM captures many of the essential features of real neural network training, despite being drastically simpler to work with. The NQM predicts our results with preconditioned optimizers, previous results with accelerated gradient descent, and other results around optimal learning rates and large batch training, making it a useful tool to generate testable predictions about neural network optimization. View details
    Embedding Text in Hyperbolic Spaces
    Bhuwan Dhingra
    Chris Shallue
    Mohammad Norouzi
    NAACL Workshop (2018)
    Preview abstract Natural language text exhibits implicit hierarchical structure in a variety of respects. Ideally we could incorporate our prior knowledge of the existence of some sort of hierarchy into unsupervised learning algorithms that work on text data. Recent work by Nickel and Kiela (2017) proposed using hyperbolic instead of Euclidean embedding spaces to represent hierarchical data and demonstrated encouraging results on supervised embedding tasks. In this work, apply their approach to unsupervised learning of word and sentence embeddings. Although we obtain mildly positive results, we describe the challenges we faced in using the hyperbolic metric for these problems both in terms of improving performance in downstream tasks and in understanding the learned hierarchical structures. View details
    Preview abstract Sequence-to-sequence alignment is a widely-used analysis method in bioinformatics. One common use of sequence alignment is to infer information about an unknown query sequence from the annotations of similar sequences in a database, such as predicting the function of a novel protein sequence by aligning to a database of protein families or predicting the presence/absence of species in a metagenomics sample by aligning reads to a database of reference genomes. In this work we describe a deep learning approach to solve such problems in a single step by training a deep neural network (DNN) to predict the database-derived labels directly from the query sequence. We demonstrate the value of this DNN approach on a hard problem of practical importance: determining the species of origin of next-generation sequencing reads from 16s ribosomal DNA. In particular, we show that when trained on 16s sequences from more than 13,000 distinct species, our DNN can predict the species of origin of individual reads more accurately than existing machine learning baselines and alignment-based methods like BWA or BLAST, achieving absolute performance within 2.0% of perfect memorization of the training inputs. Moreover, the DNN remains accurate and outperforms read alignment approaches when the query sequences are especially noisy or ambiguous. Finally, these DNN models can be used to assess metagenomic community composition on a variety of experimental 16s read datasets. Our results are a first step towards our long-term goal of developing a general-purpose deep learning model that can learn to predict any type of label from short biological sequences. View details
    Large scale distributed neural network training through online distillation
    Rohan Anil
    Gabriel Pereyra
    Alexandre Tachard Passos
    Robert Ormandi
    Geoffrey Hinton
    ICLR (2018)
    Preview abstract While techniques such as ensembling and distillation promise model quality improvements when paired with almost any base model they are seldom used as the multi-stage training setups they require are cumbersome and the extra hyperparameters introduced make the process of tuning even more expensive. In this paper we explore a variant of distillation which is relatively straightforward to use as it does not require a complicated multi-stage setup. We also show that distillation can be used as a meaningful distributed learning algorithm: instead of independent workers exchanging gradients, which requires worrying about delays and synchronization, independent workers can exchange full model checkpoints. This can be done far less frequently than exchanging gradients, breaking one of the scalability barriers of stochastic gradient descent. We have experiments on Criteo clickthrough rate, and the largest to-date dataset used for neural language modeling, based on Common Crawl and containing $6\times 10^{11}$ tokens. In these experiments we show we can scale at least $2\times$ as well as the maximum limit of distributed stochastic gradient descent. Finally, we also show that online distillation can dramatically reduce the churn in the predictions between different versions of a model. View details
    Preview abstract Neural language models are a critical component of state-of-the-art systems for machine translation, summarization, audio transcription, and other tasks. These language models are almost universally autoregressive in nature, generating sentences one token at a time from left to right. This paper studies the influence of token generation order on model quality via a novel two-pass language model that produces partially-filled sentence “templates” and then fills in missing tokens. We compare various strategies for structuring these two passes and observe a surprisingly large variation in model quality. We find the most effective strategy generates function words in the first pass followed by content words in the second. We believe these experimental results justify a more extensive investigation of generation order for neural language models. View details
    Motivating the Rules of the Game for Adversarial Example Research
    Justin Gilmer
    Ryan P. Adams
    Ian Goodfellow
    David Andersen
    arxiv (2018)
    Preview abstract Advances in machine learning have led to broad deployment of systems with impressive performance on important problems. Nonetheless, these systems can be induced to make errors on data that are surprisingly similar to examples the learned system handles correctly. The existence of these errors raises a variety of questions about out-of-sample generalization and whether bad actors might use such examples to abuse deployed systems. As a result of these security concerns, there has been a flurry of recent papers proposing algorithms to defend against such malicious perturbations of correctly handled examples. It is unclear how such misclassifications represent a different kind of security problem than other errors, or even other attacker-produced examples that have no specific relationship to an uncorrupted input. In this paper, we argue that adversarial example defense papers have, to date, mostly considered abstract, toy games that do not relate to any specific security concern. Furthermore, defense papers have not yet precisely described all the abilities and limitations of attackers that would be relevant in practical security. Towards this end, we establish a taxonomy of motivations, constraints, and abilities for more plausible adversaries. Finally, we provide a series of recommendations outlining a path forward for future work to more clearly articulate the threat model and perform more meaningful evaluation. View details
    Peptide-Spectra Matching with Weak Supervision
    Sam Schoenholz
    Sean Hackett
    Laura Deming
    Eugene Melamud
    Navdeep Jaitly
    Fiona McAllister
    Jonathon O'Brien
    Bryson Bennett
    Daphne Koller
    arXiv (2018)
    Preview abstract As in many other scientific domains, we face a fundamental problem when using machine learning to identify proteins from mass spectrometry data: large ground truth datasets mapping inputs to correct outputs are extremely difficult to obtain. Instead, we have access to imperfect hand-coded models crafted by domain experts. In this paper, we apply deep neural networks to an important step of the protein identification problem, the pairing of mass spectra with short sequences of amino acids called peptides. We train our model to differentiate between top scoring results from a state-of-the art classical system and hard-negative second and third place results. Our resulting model is much better at identifying peptides with spectra than the model used to generate its training data. In particular, we achieve a 43% improvement over standard matching methods and a 10% improvement over a combination of the matching method and an industry standard cross-spectra reranking tool. Importantly, in a more difficult experimental regime that reflects current challenges facing biologists, our advantage over the previous state-of-theart grows to 15% even after reranking. We believe this approach will generalize to other challenging scientific problems. View details
    Artificial Intelligence Based Breast Cancer Nodal Metastasis Detection: Insights into the Black Box for Pathologists
    Timo Kohlberger
    Mohammad Norouzi
    Jenny Smith
    Arash Mohtashamian
    Niels Olson
    Lily Peng
    Jason Hipp
    Martin Stumpe
    Archives of Pathology & Laboratory Medicine (2018)
    Preview abstract Context - Nodal metastasis of a primary tumor influences therapy decisions for a variety of cancers. Histologic identification of tumor cells in lymph nodes can be laborious and error-prone, especially for small tumor foci. Objective - To evaluate the application and clinical implementation of a state-of-the-art deep learning–based artificial intelligence algorithm (LYmph Node Assistant or LYNA) for detection of metastatic breast cancer in sentinel lymph node biopsies. Design - Whole slide images were obtained from hematoxylin-eosin–stained lymph nodes from 399 patients (publicly available Camelyon16 challenge dataset). LYNA was developed by using 270 slides and evaluated on the remaining 129 slides. We compared the findings to those obtained from an independent laboratory (108 slides from 20 patients/86 blocks) using a different scanner to measure reproducibility. Results - LYNA achieved a slide-level area under the receiver operating characteristic (AUC) of 99% and a tumor-level sensitivity of 91% at 1 false positive per patient on the Camelyon16 evaluation dataset. We also identified 2 “normal” slides that contained micrometastases. When applied to our second dataset, LYNA achieved an AUC of 99.6%. LYNA was not affected by common histology artifacts such as overfixation, poor staining, and air bubbles. Conclusions - Artificial intelligence algorithms can exhaustively evaluate every tissue patch on a slide, achieving higher tumor-level sensitivity than, and comparable slide-level performance to, pathologists. These techniques may improve the pathologist's productivity and reduce the number of false negatives associated with morphologic detection of tumor cells. We provide a framework to aid practicing pathologists in assessing such algorithms for adoption into their workflow (akin to how a pathologist assesses immunohistochemistry results). View details
    Preview abstract Recent hardware developments have made unprecedented amounts of data parallelism available for accelerating neural network training. Among the simplest ways to harness next-generation accelerators is to increase the batch size in standard mini-batch neural network training algorithms. In this work, we aim to experimentally characterize the effects of increasing the batch size on training time, as measured by the number of steps necessary to reach a goal out-of-sample error. Eventually, increasing the batch size will no longer reduce the number of training steps required, but the exact relationship between the batch size and how many training steps are necessary is of critical importance to practitioners, researchers, and hardware designers alike. We study how this relationship varies with the training algorithm, model, and data set and find extremely large variation between workloads. Along the way, we reconcile disagreements in the literature on whether batch size affects model quality. Finally, we discuss the implications of our results for efforts to train neural networks much faster in the future. View details
    Relational inductive biases, deep learning, and graph networks
    Peter Battaglia
    Jessica Blake Chandler Hamrick
    Victor Bapst
    Alvaro Sanchez
    Vinicius Zambaldi
    Andrea Tacchetti
    David Raposo
    Adam Santoro
    Ryan Faulkner
    Caglar Gulcehre
    Francis Song
    Andy Ballard
    Justin Gilmer
    Ashish Vaswani
    Kelsey Allen
    Charles Nash
    Victoria Jayne Langston
    Chris Dyer
    Nicolas Heess
    Daan Wierstra
    Matt Botvinick
    Yujia Li
    Razvan Pascanu
    arXiv (2018)
    Preview abstract The purpose of this paper is to explore relational inductive biases in modern AI, especially deep learning, describing a rough taxonomy of existing approaches, and introducing a common mathematical framework for expressing and unifying various approaches. The key theme running through this work is structure—how the world is structured, and how the structure of different computational strategies determines their strengths and weaknesses. View details
    Detecting Cancer Metastases on Gigapixel Pathology Images
    Krishna Kumar Gadepalli
    Mohammad Norouzi
    Timo Kohlberger
    Subhashini Venugopalan
    Aleksei Timofeev
    Jason Hipp
    Lily Peng
    Martin Stumpe
    arXiv (2017)
    Preview abstract Each year, the treatment decisions for more than 230,000 breast cancer patients in the U.S. hinge on whether the cancer has metastasized away from the breast. Metastasis detection is currently performed by pathologists reviewing large expanses of biological tissues. This process is labor intensive and error-prone. We present a framework to automatically detect and localize tumors as small as 100 x 100 pixels in gigapixel microscopy images sized 100,000 x 100,000 pixels. Our method leverages a convolutional neural network (CNN) architecture and obtains state-of-the-art results on the Camelyon16 dataset in the challenging lesion-level tumor detection task. At 8 false positives per image, we detect 92.4% of the tumors, relative to 82.7% by the previous best automated approach. For comparison, a human pathologist attempting exhaustive search achieved 73.2% sensitivity. We achieve image-level AUC scores above 97% on both the Camelyon16 test set and an independent set of 110 slides. In addition, we discover that two slides in the Camelyon16 training set were erroneously labeled normal. Our approach could considerably reduce false negative rates in metastasis detection. View details
    Neural Message Passing for Quantum Chemistry
    Justin Gilmer
    Samuel S. Schoenholz
    Patrick F. Riley
    ICML (2017)
    Preview abstract Supervised learning on molecules has incredible potential to be useful in chemistry, drug discovery, and materials science. Luckily, several promising and closely related neural network models invariant to molecular symmetries have already been described in the literature. These models learn a message passing algorithm and aggregation function to compute a function of their entire input graph. At this point, the next step is to find a particularly effective variant of this general approach and apply it to chemical prediction benchmarks until we either solve them or reach the limits of the approach. In this paper, we reformulate existing models into a single common framework we call Message Passing Neural Networks (MPNNs) and explore additional novel variations within this framework. Using MPNNs we demonstrate state of the art results on an important molecular property prediction benchmark, results we believe are strong enough to justify retiring this benchmark. View details
    Prediction errors of molecular machine learning models lower than hybrid DFT error
    Felix Faber
    Luke Hutchinson
    Huang Bing
    Justin Gilmer
    Sam Schoenholz
    Steven Kearnes
    Patrick Riley
    Anatole von Lilienfeld
    Journal of Chemical Theory and Computation (2017)
    Preview abstract We investigate the impact of choosing regressors and molecular representations for the construction of fast machine learning (ML) models of thirteen electronic ground-state properties of organic molecules. The performance of each regressor/representation/property combination is assessed with learning curves which report approximation errors as a function of training set size. Molecular structures and properties at hybrid density functional theory (DFT) level of theory used for training and testing come from the QM9 database [Ramakrishnan et al, Scientific Data 1 140022 (2014)] and include dipole moment, polarizability, HOMO/LUMO energies and gap, electronic spatial extent, zero point vibrational energy, enthalpies and free energies of atomization, heat capacity and the highest fundamental vibrational frequency. Various representations from the literature have been studied (Coulomb matrix, bag of bonds, BAML and ECFP4, molecular graphs (MG)), as well as newly developed distribution based variants including histograms of distances (HD), and angles (HDA/MARAD), and dihedrals (HDAD). Regressors include linear models (Bayesian ridge regression (BR) and linear regression with elastic net regularization (EN)), random forest (RF), kernel ridge regression (KRR) and two types of neural networks, graph convolutions (GC) and gated graph networks (GG). We present numerical evidence that ML model predictions for all properties can reach an approximation error to DFT which is on par with chemical accuracy. These findings indicate that ML models could be more accurate than DFT if explicitly electron correlated quantum (or experimental) data was provided. View details
    No Results Found