Jump to Content
James Laudon

James Laudon

James Laudon is a member of the Google Brain team, whose mission is to develop deep learning technologies and deploy them throughout Google. His research interests focus on hardware and software co-design for high-performance systems and he's currently working on domain-specific computer architectures for machine learning and applying machine learning to system design. Before joining the Brain team in 2017, James was founder and site director for the Google Madison office. Prior to joining Google in 2007 he contributed to the architecture and implementation of multiple computer systems including the Stanford DASH, SGI Origin 2000, and Sun UltraSPARC T1. James has a B.S. in Electrical Engineering from the University of Wisconsin – Madison and a M.S. and Ph.D. in Electrical Engineering from Stanford University.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Graph Transformer: A Generalized Method for Computation Graph Optimizations
    Amirali Abdolrashidi
    Anna Darling Goldie
    Azalia Mirhoseini
    Daniel Wong
    Hanxiao Liu
    Qiumin Xu
    Shen Wang
    Sudip Roy
    (2020)
    Preview abstract Runtime and scalability of neural networks can be significantly affected by computational graph optimization during compilation. Most existing automated graph optimizations are impractical for deployment due to the significant amount of compute required and their inability to generalize to new, previously held-out graphs. To address both limitations, we propose an end-to-end deep reinforcement learning method named Graph Transformer (GTf), based on a scalable sequential attention mechanism over an inductive graph neural network that is transferable to new, unseen graphs. GTf generates decisions on the entire graph in a single-shot fashion, rather than on each individual node progressively, drastically speeding up the search compared to prior methods. Moreover, we propose recurrent attention layers to jointly optimize dependent graph optimization tasks and demonstrate 33%-60% speedup of three graph optimization tasks compared to Tensorflow default optimizations. On a diverse set of representative graphs consisting of 1k-80k nodes, including Inception-v3, Transformer-XL, and WaveNet, GTf achieves an average 21% improvement over human experts and 18% improvement over the prior art with 15x faster convergence, on a device placement task evaluated in real systems. View details
    Preview abstract The looming end of Moore's Law and ascending use of deep learning drives the design of custom accelerators that are optimized for specific neural architectures. Accelerator design forms a challenging constrained optimization problem over a complex, high-dimensional and structured input space with a costly to evaluate objective function. Existing approaches for accelerator design are sample-inefficient do not transfer knowledge between related optimizations tasks with different design constraints (e.g. area budget) or neural architecture configurations. In this work, we propose a transferable architecture exploration framework, dubbed Apollo, that leverages recent advances in black-box function optimization for sample-efficient accelerator design. We use Apollo to optimize accelerator configurations of a diverse set of neural architectures with alternative design constraints. We show that Apollo finds optimal design configurations more sample-efficiently than baseline approaches. We further show that transferring knowledge between target architectures with different design constraints helps to find optimal configurations faster. This encouraging outcome portrays a promising path forward in shortening the timeline for accelerator design. View details
    In-Datacenter Performance Analysis of a Tensor Processing Unit
    Norman P. Jouppi
    Nishant Patil
    Gaurav Agrawal
    Raminder Bajwa
    Sarah Bates
    Suresh Bhatia
    Nan Boden
    Al Borchers
    Rick Boyle
    Pierre-luc Cantin
    Clifford Chao
    Chris Clark
    Jeremy Coriell
    Mike Daley
    Matt Dau
    Ben Gelb
    Tara Vazir Ghaemmaghami
    Rajendra Gottipati
    William Gulland
    Robert Hagmann
    C. Richard Ho
    Doug Hogberg
    John Hu
    Dan Hurt
    Julian Ibarz
    Aaron Jaffey
    Alek Jaworski
    Alexander Kaplan
    Harshit Khaitan
    Andy Koch
    Naveen Kumar
    Steve Lacy
    James Law
    Diemthu Le
    Chris Leary
    Zhuyuan Liu
    Kyle Lucke
    Alan Lundin
    Gordon MacKean
    Adriana Maggiore
    Maire Mahony
    Kieran Miller
    Rahul Nagarajan
    Ravi Narayanaswami
    Ray Ni
    Kathy Nix
    Thomas Norrie
    Mark Omernick
    Narayana Penukonda
    Andy Phelps
    Jonathan Ross
    ISCA (2017) (to appear)
    Preview abstract Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU. View details
    Throughput-Oriented Multicore Processors
    Robert Golla
    Greg Grohoski
    Multicore Processors and Systems, Springer (2009), pp. 205-230
    Virtual Private Caches
    Kyle J. Nesbit
    James E. Smith
    Proceedings of the 34th Annual International Symposium on Computer Architecture (2007), pp. 57-68
    The Coming Wave of Multithreaded Chip Multiprocessors
    Lawrence Spracklen
    International Journal of Parallel Programming, vol. 35 (2007), pp. 299-330
    Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency
    Kunle Olukotun
    Lance Hammond
    Morgan & Claypool Publishers (2007)
    Fair Queuing Memory Systems
    Kyle J. Nesbit
    Nidhi Aggarwal
    James E. Smith
    Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-39), IEEE, Orlando, FL, USA (2006)
    Maximizing CMP Throughput with Mediocre Cores
    John D. Davis
    Kunle Olukotun
    Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, IEEE Computer Society, Saint Louis, MO, USA (2005), pp. 51-62
    The SGI Origin 2000: A ccNUMA Highly Scalable Server
    Daniel Lenoski
    Proceedings of the 24th Annual International Symposium on Computer Architecture, ACM, Denver, CO, USA (1997), pp. 241-251
    Interleaving: A Multithreading Technique Targeting Multiprocessors and Workstations
    Anoop Gupta
    Mark Horowitz
    Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, ACM, San Jose, CA, USA (1994), pp. 308 - 318
    The DASH Prototype: Logic Overhead and Performance
    Daniel Lenoski
    Truman Joe
    David Nakahira
    Luis Stevens
    Anoop Gupta
    John Hennessy
    IEEE Transactions on Parallel and Distributed Systems, vol. 4 (1993), pp. 41-61
    The DASH Prototype: Implementation and Performance
    Daniel Lenoski
    Truman Joe
    David Nakahira
    Luis Stevens
    Anoop Gupta
    John Hennessy
    Proceedings of the 19th Annual International Symposium on Computer Architecture, ACM, Queensland, Australia (1992), pp. 92-103
    The Stanford Dash Multiprocessor
    Daniel Lenoski
    Kourosh Gharachorloo
    Anoop Gupta
    John L. Hennessy
    Mark Horowitz
    Monica S. Lam
    IEEE Computer, vol. 25 (1992), pp. 63-79
    Overview and Status of the Stanford DASH Multiprocessor
    Daniel Lenoski
    Kourosh Gharachorloo
    Anoop Gupta
    John Hennessy
    Proceedings of the International Symposium on Shared Memory Multiprocessing, Tokyo, Japan (1991)
    The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor
    Daniel Lenoski
    Kourosh Gharachorloo
    Anoop Gupta
    John Hennessy
    Proceedings of the 17th Annual International Symposium on Computer Architecture, ACM, Seattle, WA, USA (1990), pp. 148-159
    Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors
    Kourosh Gharachorloo
    Daniel Lenoski
    Phillip Gibbons
    Anoop Gupta
    John Hennessy
    Proceedings of the 17th Annual International Symposium on Computer Architecture, ACM, Seattle, WA, USA (1990), pp. 15-26
    The ZS-1 Central Processor
    James E. Smith
    Greg E. Dermer
    Brian D. Vanderwarn
    Steve D. Klinger
    Chris M. Rozewski
    Dan L. Fowler
    Keith R. Scidmore
    Proceedings of the Second International Conference on Architectural Support for Programming Languages and Operating Systems, IEEE, Palo Alto, CA, USA (1987), pp. 199-204