Jump to Content
Anna Huang

Anna Huang

Anna Huang is a Research Scientist at Google Brain, working on the Magenta project. Her research focuses on designing generative models to make creating music more approachable. She is the creator of Music Transformer and also the ML model Coconet that powered Google’s first AI Doodle the Bach Doodle, in 2 days harmonizing 55 million melodies from users around the world.

She holds a PhD in computer science from Harvard University and was a recipient of the NSF Graduate Research Fellowship. She spent the later parts of her PhD as a visiting research student at the Montreal Institute of Learning Algorithms (MILA), where she also currently co-advises students. She publishes in machine learning, human-computer interaction, and music, at conferences such as ICLR, IUI, CHI, and ISMIR. She is currently an editor for the TISMIR journal's special issue on AI and Music Creativity.

As a composer, she wrote for a cappella, chamber ensembles and orchestra, and also tape and live electronics that was performed on the 40-channel HYDRA loudspeaker orchestra. Recently, she was a judge for the AI Song Contest. She holds a master's in media arts and sciences from the MIT Media Lab, and a dual bachelor's degree in computer science and music composition from University of Southern California. She grew up in Hong Kong, where she learned to play the guzheng.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    MIDI-DDSP: Hierarchical modeling of music for detailed control
    Yusong Wu
    Yi Deng
    Rigel Jacob Swavely
    Kyle Kastner
    TIm Cooijmans
    Aaron Courville
    ICLR 2022 (2022) (to appear)
    Preview abstract Musical expression requires control of both \textit{what} notes that are played, and \textit{how} they are performed. Conventional audio synthesizers provide detailed expressive controls, but at the cost of realism. Black-box neural audio synthesis and concatenative samplers can produce realistic audio, but have few mechanisms for control. In this work, we introduce MIDI-DDSP a hierarchical model of musical instruments that enables both realistic neural audio synthesis and detailed user control. Starting from interpretable Differentiable Digital Signal Processing (DDSP) synthesis parameters, we infer musical notes and high-level properties of their expressive performance (such as timbre, vibrato, dynamics, and articulation). This creates a 3-level hierarchy (notes, performance, synthesis) that affords individuals the option to intervene at each level, or utilize trained priors (performance given notes, synthesis given performance) for creative assistance. Through quantitative experiments and listening tests, we demonstrate that this hierarchy can reconstruct high-fidelity audio, accurately predict performance attributes for a note sequence, independently manipulate the attributes of a given performance, and as a complete system, generate realistic audio from a novel note sequence. By utilizing an interpretable hierarchy, with multiple levels of granularity, MIDI-DDSP opens the door to assistive tools to empower individuals across a diverse range of musical experience. View details
    AI Song Contest: Human-AI Co-Creation in Songwriting
    Hendrik Vincent Koops
    Ed Newton-Rex
    Monica Dinculescu
    Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR) (2020)
    Preview abstract Machine learning is challenging the way we make music. Although research in deep generative models has dramatically improved the capability and fluency of music models, recent work has shown that it can be challenging for humans to partner with this new class of algorithms. In this paper, we present findings on what 13 musician/developer teams, a total of 61 users, needed when co-creating a song with AI, the challenges they faced, and how they leveraged and repurposed existing characteristics of AI to overcome some of these challenges. Many teams adopted modular approaches, such as independently running multiple smaller models that align with the musical building blocks of a song, before re-combining their results. As ML models are not easily steerable, teams also generated massive numbers of samples and curated them post-hoc, or used a range of strategies to direct the generation or algorithmically ranked the samples. Ultimately, teams not only had to manage the ``flare and focus'' aspects of the creative process, but also juggle that with a parallel process of exploring and curating multiple ML models and outputs. These findings reflect a need to design machine learning-powered music interfaces that are more decomposable, steerable, interpretable, and adaptive, which in return will enable artists to more effectively explore how AI can extend their personal expression. View details
    Bach Doodle: Approachable music composition with machine learning at scale
    Curtis Hawthorne
    Monica Dinculescu
    Leon Hong
    Jacob Howcroft
    Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR) (2019)
    Preview abstract Many of us like music, but composing can feel intimidating, not knowing where to begin. Even when we have a melody, without sufficient skills in harmony we are deterred from developing it into a composition. Machine learning could potentially extend our creative abilities by offering generative models that can fill in the missing parts of our composition. To make music composition more approachable, we designed a composition web-app where users can create their own melody and have it harmonized by a machine learning model. For inputting melodies, we designed a simplified sheet music interface that facilitates easy trial and error, and found that users adapted to it quickly even when they were not familiar with western music notation. Users can rapidly explore different possibilities in harmonizations by tweaking their melody and requesting for new harmonizations. The harmonizations are provided by Coconet, a flexible generative model of counterpoint. Several technical challenges had to be overcome to support an interactive experience at scale. First, as most users do not have dedicated hardware to run machine learning models, we re-implemented Coconet in TensorFlow.js so that it could run in the browser. Second, our initial re-implementation took more than 40 seconds to generate two measures of music. By adopting dilated depth-wise separable convolutions and model quantization, we reduced it down to 2 seconds. Third, to prepare for large-scale deployment, we calibrated a speed test to determine if a user’s device is fast enough for running the model in the browser, if not the harmonization requests were sent to remote TPU servers. In three days, the web-app received more than 50 million queries for harmonization around the world. Users could choose to rate their compositions and contribute them to a public dataset, which we are releasing with this paper. We hope that the community might find this dataset useful for, ranging from ethnomusicological studies, to music education to improving machine learning models. We end with a quote from a user: “It's really fun to play with. This might be the first time in my life I feel competent at music.” View details
    Preview abstract Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces with ABA structure. The Transformer (Vaswani et al., 2017), a sequence model based on self-attention, has achieved compelling results in many generation tasks that require maintaining long-range coherence. This suggests that self-attention might also be well-suited to modeling music. In musical composition and performance, however, relative timing is critically important. Existing approaches for representing relative positional information in the Transformer modulate attention based on pairwise distance (Shaw et al., 2018). This is impractical for long sequences such as musical compositions since their memory complexity for intermediate relative information is quadratic in the sequence length. We propose an algorithm that reduces their intermediate memory requirement to linear in the sequence length. This enables us to demonstrate that a Transformer with our modified relative attention mechanism can generate minute-long compositions (thousands of steps, four times the length modeled in Oore et al., 2018) with compelling structure, generate continuations that coherently elaborate on a given motif, and in a seq2seq setup generate accompaniments conditioned on melodies. We evaluate the Transformer with our relative attention mechanism on two datasets, JSB Chorales and Piano-e-Competition, and obtain state-of-the-art results on the latter. View details
    Preview abstract Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling both long- and short-term structure. Fortunately, most music is also highly structured and primarily composed of discrete note events played on musical instruments. Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude (~0.01 ms (8 kHz) to ~100 s). This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms. The networks and the dataset together present a promising approach toward creating new expressive and interpretable neural models of music. View details
    Visualizing Music Self-Attention
    Monica Dinculescu
    Ashish Vaswani
    NIPS Workshop on Interpretability and Robustness in Audio, Speech, and Language (2018)
    Preview abstract Like language, music can be represented as a sequence of discrete symbols that form a hierarchical syntax, with notes being roughly like characters and motifs of notes like words. Unlike text however, music relies heavily on repetition on multiple timescales to build structure and meaning. The Music Transformer has shown compelling results in generating music with structure~\citep{huang2018music}. In this paper, we introduce a tool for visualizing self-attention on polyphonic music with an interactive pianoroll. We use music transformer as both a descriptive tool and a generative model. For the former, we use it to analyze existing music to see if the resulting self-attention structure corroborates with the musical structure known from music theory. For the latter, we inspect the model's self-attention during generation, in order to understand how past notes affect future ones. We also compare and contrast the attention structure of regular attention to that of relative attention \citep{shaw2018self, huang2018music}, and examine its impact on the resulting generated music. For example, for the JSB Chorales dataset, a model trained with relative attention is more consistent in attending to all the voices in the preceding timestep and the chords before, and at cadences to the beginning of a phrase, allowing it to create an arc. We hope that our analyses will offer more evidence for relative self-attention as a powerful inductive bias for modeling music. We invite the reader to checkout video animations of music attention and interact with the visualizations at \url{https://storage.googleapis.com/nips-workshop-visualization/index.html}. View details
    Preview abstract We argue for the benefit of designing deep generative models through mixed-initiative combinations of deep learning algorithms and human specifications for authoring sequential content, such as stories and music. Sequence models have shown increasingly convincing results in domains such as auto-completion, speech to text, and translation; however, longer-term structure remains a major challenge. Given lengthy inputs and outputs, deep generative systems still lack reliable representations of beginnings, middles, and ends, which are standard aspects of creating content in domains such as music composition. This paper aims to contribute a framework for mixed-initiative learning approaches, specifically for creative deep generative systems, and presents a case study of a deep generative model for music, Counterpoint by Convolutional Neural Network (Coconet). View details
    Counterpoint by Convolution
    Tim Cooijmans
    Aaron Courville
    Proceedings of ISMIR 2017
    Preview abstract Machine learning models of music typically break down the task of composition into a chronological process, composing a piece of music in a single pass from beginning to end. On the contrary, human composers write music in a nonlinear fashion, scribbling motifs here and there, often revisiting choices previously made. We explore the use of blocked Gibbs sampling as an analogue to the human approach, and introduce COCONET, a convolutional neural network in the NADE family of generative models (Uria et al., 2016). Despite ostensibly sampling from the same distribution as the NADE ancestral sampling procedure, we find that a blocked Gibbs approach significantly improves sample quality. We provide evidence that this is due to some conditional distributions being poorly modeled. Moreover, we show that even the cheap approximate blocked Gibbs procedure from Yao et al. (2014) yields better samples than ancestral sampling. We demonstrate the versatility of our method on unconditioned polyphonic music generation. View details
    No Results Found