Jump to Content
Yang Li

Yang Li

Yang Li is a Staff Research Scientist at Google, and an affiliate faculty member in Computer Science & Engineering at the University of Washington. Yang’s research focuses on the intersection between deep learning and human computer interaction. See Yang Li's personal website.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract We explore the boundaries of scaling up a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix. View details
    Preview abstract Conversational agents show the promise to allow users to interact with mobile devices using language. However, to perform diverse UI tasks with natural language, developers typically need to create separate datasets and models for each specific task, which is expensive and effort-consuming. Recently, pre-trained large language models (LLMs) have been shown capable of generalizing to various downstream tasks when prompted with a handful of examples from the target task. This paper investigates the feasibility of enabling versatile conversational interactions with mobile UIs using a single LLM. We designed prompting techniques to adapt an LLM to mobile UIs. We experimented with four important modeling tasks that address various scenarios in conversational interaction. Our method achieved competitive performance on these challenging tasks without requiring dedicated datasets and training, offering a lightweight and generalizable approach to enable language-based mobile interaction. View details
    Towards Semantically-Aware UI Design Tools: Design, Implementation, and Evaluation of Semantic Grouping Guidelines
    Peitong Duan
    Bjoern Hartmann
    Karina Nguyen
    Marti Hearst
    ICML 2023 Workshop on Artificial Intelligence and Human-Computer Interaction (2023)
    Preview abstract A coherent semantic structure, where semantically-related elements are appropriately grouped, is critical for proper understanding of a UI. Ideally, UI design tools should help designers establish coherent semantic grouping. To work towards this, we contribute five semantic grouping guidelines that capture how human designers think about semantic grouping and are amenable to implementation in design tools. They were obtained from empirical observations on existing UIs, a literature review, and iterative refinement with UI experts’ feedback. We validated our guidelines through an expert review and heuristic evaluation; results indicate these guidelines capture valuable information about semantic structure. We demonstrate the guidelines’ use for building systems by implementing a set of computational metrics. These metrics detected many of the same severe issues that human design experts marked in a comparative study. Running our metrics on a larger UI dataset suggests many real UIs exhibit grouping violations. View details
    Preview abstract Mobile UI understanding is important for enabling various interaction tasks such as UI automation and accessibility. Previous mobile UI modeling often depends on the view hierarchy information of a screen, which directly provides the structural data of the UI, with the hope to bypass challenging tasks of visual modeling from screen pixels. However, view hierarchies are not always available, and are often corrupted with missing object descriptions or misaligned structure information. As a result, despite the use of view hierarchies could offer short-term gains, it may ultimately hinder the applicability and performance of the model. In this paper, we propose Spotlight, a vision-only approach for mobile UI understanding. Specifically, we enhance a vision-language model that only takes the screenshot of the UI and a region of interest on the screen---the focus---as the input. This general architecture of Spotlight is easily scalable and capable of performing a range of UI modeling tasks. Our experiments show that our model establishes SoTA results on several representative UI tasks and outperforms previous methods that use both screenshots and view hierarchies as inputs. Furthermore, we explore multi-task learning and few-shot prompting capacities of the proposed models, demonstrating promising results in the multi-task learning direction. View details
    Preview abstract User interface design is a complex task that involves designers examining a wide range of options. We present Spacewalker, a tool that allows designers to rapidly search a large design space for an optimal web UI with integrated support. Designers first annotate each attribute they want to explore in a typical HTML page, using a simple markup extension we designed. Spacewalker then parses the annotated HTML specification, and intelligently generates and distributes various configurations of the web UI to crowd workers for evaluation. We enhanced a genetic algorithm to accommodate crowd worker responses from pairwise comparison of UI designs, which is crucial for obtaining reliable feedback. Based on our experiments, Spacewalker allows designers to effectively search a large design space of a UI, using the language they are familiar with, and improve their design rapidly at a minimal cost. View details
    TapNet: The Design, Training, Implementation, and Applications of a Multi-Task Learning CNN for Off-Screen Mobile Input
    Michael Xuelin Huang
    Nazneen Nazneen
    Alex Chao
    ACM CHI Conference on Human Factors in Computing Systems, ACM (2021)
    Preview abstract Off-screen interaction offers great potential for one-handed and eyes-free mobile interaction. While a few existing studies have explored the built-in mobile phone sensors to sense off-screen signals, none met practical requirement. This paper discusses the design, training, implementation and applications of TapNet, a multi-task network that detects tapping on the smartphone using built-in accelerometer and gyroscope. With sensor location as auxiliary information, TapNet can jointly learn from data across devices and simultaneously recognize multiple tap properties, including tap direction and tap location. We developed four datasets consisting of over 180K training samples, 38K testing samples, and 87 participants in total. Experimental evaluation demonstrated the effectiveness of the TapNet design and its significant improvement over the state of the art. Along with the datasets, codebase, and extensive experiments, TapNet establishes a new technical foundation for off-screen mobile input. View details
    Preview abstract Natural language descriptions of user interface (UI) elements such as alternative text are crucial for accessibility and language-based interaction in general. Yet, these descriptions are constantly missing in mobile UIs. We propose widget captioning, a novel task for automatically generating language descriptions for UI elements from multimodal input including both the image and the structural representations of user interfaces. We collected a largescale dataset for widget captioning with crowdsourcing. Our dataset contains 162,859 language phrases created by human workers for annotating 61,285 UI elements across 21,750 unique UI screens. We thoroughly analyze the dataset, and train and evaluate a set of deep model configurations to investigate how each feature modality as well as the choice of learning strategies impact the quality of predicted captions. The task formulation and the dataset as well as our benchmark models contribute a solid basis for this novel multimodal captioning task that connects language and user interfaces. View details
    Preview abstract We present a new problem: grounding natural language instructions to mobile user interface actions, and create three new datasets for it. For full task evaluation, we create PixelHelp, a corpus that pairs English instructions with actions performed by people on a mobile UI emulator. To scale training, we decouple the language and action data by (a) annotating action phrase spans in How-To instructions and (b) synthesizing grounded descriptions of actions for mobile user interfaces. We use a Transformer to extract action phrase tuples from long-range natural language instructions. A grounding Transformer then contextually represents UI objects using both their content and screen position and connects them to object descriptions. Given a starting screen and instruction, our model achieves 70.59% accuracy on predicting complete ground-truth action sequences in PixelHelp. View details
    Area Attention
    Lukasz Kaiser
    Samy Bengio
    ICML (2019)
    Preview abstract Existing attention mechanisms, are mostly point-based in that a model is designed to attend to a single item in a collection of items (the memory). Intuitively, an area in the memory that may contain multiple items can be worth attending to as well. Although Softmax, which is typically used for computing attention alignments, assigns non-zero probability for every item in memory, it tends to converge to a single item and cannot efficiently attend to a group of items that matter. We propose area attention: a way to attend to an area of the memory, where each area contains a group of items that are either spatially adjacent when the memory has a 2-dimensional structure, such as images, or temporally adjacent for 1-dimensional memory, such as natural language sentences. Importantly, the size of an area, i.e., the number of items in an area, can vary depending on the learned coherence of the adjacent items. Using an area of items, instead of a single, we hope attention mechanisms can better capture the nature of the task. Area attention can work along multi-head attention for attending multiple areas in the memory. We evaluate area attention on two tasks: character-level neural machine translation and image captioning, and improve upon strong (state-of-the-art) baselines in both cases. In addition to proposing the novel concept of area attention, we contribute an efficient way for computing it by leveraging the technique of summed area tables. View details
    Preview abstract Neural language models have been widely used in various NLP tasks, including machine translation, next word prediction and conversational agents. However, it is challenging to deploy these models on mobile devices due to their slow prediction speed, where the bottleneck is to compute top candidates in the softmax layer. In this paper, we introduce a novel softmax layer approximation algorithm by exploiting the clustering structure of context vectors. Our algorithm uses a light-weight screening model to predict a much smaller set of candidate words based on the given context, and then conducts an exact softmax only within that subset. Training such a procedure end-to-end is challenging as traditional clustering methods are discrete and non-differentiable, and thus unable to be used with back-propagation in the training process. Using the Gumbel softmax, we are able to train the screening model end-to-end on the training set to exploit data distribution. The algorithm achieves an order of magnitude faster inference than the original softmax layer for predicting top-k words in various tasks such as beam search in machine translation or next words prediction. For example, for machine translation task on German to English dataset with around 25K vocabulary, we can achieve 20.4 times speed up with 98.9% precision@1 and 99.3% precision@5 with the original softmax layer prediction, while state-of-the-art (Zhang et al., 2018) only achieves 6.7x speedup with 98.7% precision@1 and 98.1% precision@5 for the same task. View details
    Investigating Cursor-based Interactions to Support Non-Visual Exploration in the Real World
    Anhong Guo
    Xu Wang
    Patrick Clary
    Ken Goldman
    Jeffrey Bigham
    Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility (2018)
    Preview abstract The human visual system processes complex scenes to focus attention on relevant items. However, blind people cannot visually skim for an area of interest. Instead, they use a combination of contextual information, knowledge of the spatial layout of their environment, and interactive scanning to find and attend to specific items. In this paper, we define and compare three cursor-based interactions to help blind people attend to items in a complex visual scene: window cursor (move their phone to scan), finger cursor (point their finger to read), and touch cursor (drag their finger on the touchscreen to explore). We conducted a user study with 12 participants to evaluate the three techniques on four tasks, and found that: window cursor worked well for locating objects on large surfaces, finger cursor worked well for accessing control panels, and touch cursor worked well for helping users understand spatial layouts. A combination of multiple techniques will likely be best for supporting a variety of everyday tasks for blind users. View details
    Doppio: Tracking UI Flows and Code Changes for App Development
    Senpo Hu
    CHI 2018: ACM Conference on Human Factors in Computing Systems
    Preview abstract Developing interactive systems often involves a large set of callback functions for handling user interaction, which makes it challenging to manage UI behaviors, create descriptive documentation, and track revisions. We developed Doppio, a tool that automatically tracks and visualizes UI flows and their changes based on source code elements and their revisions. For each input event listener of a widget, e.g., onClick of an Android View class, Doppio captures and associates its UI output from an execution of the program with its code snippet from the source code. It automatically generates a screenflow diagram that is organized by the callback methods and interaction flow, where developers can review the code and UI revisions interactively. Doppio, implemented as an IDE plugin, is seamlessly integrated into a common development workflow. Our experiments show that Doppio was able to generate quality visual documentation and helped participants understand unfamiliar source code and track changes. View details
    M3 Gesture Menu: Design and Experimental Analyses of Marking Menus for Touchscreen Mobile Interaction
    Kun Li
    Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, ACM, New York, NY, USA, 249:1-249:14
    Preview abstract Despite their learning advantages in theory, marking menus have faced adoption challenges in practice, even on today's touchscreen-based mobile devices. We address these challenges by designing, implementing, and evaluating multiple versions of M3 Gesture Menu (M3), a reimagination of marking menus targeted at mobile interfaces. M3 is defined on a grid rather than in a radial space, relies on gestural shapes rather than directional marks, and has constant and stationary space use. Our first controlled experiment on expert performance showed M3 was faster and less error-prone by a factor of two than traditional marking menus. A second experiment on learning demonstrated for the first time that users could successfully transition to recall-based execution of a dozen commands after three ten-minute practice sessions with both M3 and Multi-Stroke Marking Menu. Together, M3, with its demonstrated resolution, learning, and space use benefits, contributes to the design and understanding of menu selection in the mobile-first era of end-user computing. View details
    Preview abstract Existing sequence prediction methods are mostly concerned with time-independent sequences, in which the actual time span between events is irrelevant and the distance between events is simply the difference between their order positions in the sequence. While this time-independent view of sequences is applicable for data such as natural languages, e.g., dealing with words in a sentence, it is inappropriate and inefficient for many real world events that are observed and collected at unequally spaced points of time as they naturally arise, e.g., when a person goes to a grocery store or makes a phone call. The time span between events can carry important information about the sequence dependence of human behaviors. To leverage continuous time in sequence prediction, we propose two methods for integrating time into event representation, based on the intuition on how time is tokenized in everyday life and previous work on embedding contextualization. We particularly focus on using these methods in recurrent neural networks, which have gained popularity in many sequence prediction tasks. We evaluated these methods as well as baseline models on two learning tasks: mobile app usage prediction and music recommendation. The experiments revealed that the proposed methods for time-dependent representation offer consistent gain on accuracy compared to baseline models that either directly use continuous time value in a recurrent neural network or do not use time. View details
    Preview abstract Model compression is essential for serving large deep neural nets on devices with limited resources or applications that require real-time responses. For advanced NLP problems, a neural language model usually consists of recurrent layers (e.g., using LSTM cells), an embedding matrix for representing input tokens, and a softmax layer for generating output tokens. For problems with a very large vocabulary size, the embedding and the softmax matrices can account for more than half of the model size. For instance, the bigLSTM model achieves state-of-the-art performance on the One-Billion-Word (OBW) dataset with around 800k vocabulary, and its word embedding and softmax matrices use more than 6GBytes space, and are responsible for over 90\% of the model parameters. In this paper, we propose GroupReduce, a novel compression method for neural language models, based on vocabulary-partition (block) based low-rank matrix approximation and the inherent frequency distribution of tokens (the power-law distribution of words). We start by grouping words into c blocks based on their frequency, and then refine the clustering iteratively by constructing weighted low-rank approximation for each block, where the weights are based the frequencies of the words in the block. The experimental results show our method can significantly outperform traditional compression methods such as low-rank approximation and pruning. On the OBW dataset, our method achieved 6.6x compression rate for the embedding and softmax matrices, and when combined with quantization, our method can achieve 26x compression rate without losing prediction accuracy. View details
    Preview abstract Touchscreen mobile devices can afford rich interaction behaviors but they are complex to model. Scrollable two-dimensional grids are a common user interface on mobile devices that allow users to access a large number of items on a small screen by direct touch. By analyzing touch input and eye gaze of users during grid interaction, we reveal how multiple performance components come into play in such a task, including navigation, visual search and pointing. These findings inspired us to design a novel predictive model that combines these components for modeling grid tasks. We realized these model components by employing both traditional analytical methods and data-driven machine learning approaches. In addition to showing high accuracy achieved by our model in predicting human performance on a test dataset, we demonstrate how such a model can lead to a significant reduction in interaction time when used in a predictive user interface. View details
    Preview abstract Predicting human performance in interaction tasks allows designers or developers to understand the expected performance of a target interface without actually testing it with real users. In this work, we present a deep neural net to model and predict human performance in performing a sequence of UI tasks. In particular, we focus on a dominant class of tasks, i.e., target selection from a vertical list or menu, which resembles many interaction scenarios in modern user interfaces. We experimented with our deep neural net using a public dataset collected from a desktop laboratory environment and a dataset collected from hundreds of touchscreen smartphone users via crowdsourcing. Our model significantly outperformed previous methods in various settings. Importantly, our method, as a deep model, can easily incorporate additional UI attributes such as visual appearance and content semantics without changing model architectures—these attributes were hard to capture with previous methods. We discussed our insights into the behaviors of our model. View details
    Enhancing Cross-Device Interaction Scripting with Interactive Illustrations
    Bjorn Hartmann
    CHI 2016: ACM Conference on Human Factors in Computing Systems
    Preview abstract Cross-device interactions involve input and output on multiple computing devices. Implementing and reasoning about interactions that cover multiple devices with a diversity of form factors and capabilities can be complex. To assist developers in programming cross-device interactions, we created DemoScript, a technique that automatically analyzes a cross-device interaction program while it is being written. DemoScript visually illustrates the step-by-step execution of a selected portion or the entire program with a novel, automatically generated cross-device storyboard visualization. In addition to helping developers understand the behavior of the program, DemoScript also allows developers to revise their program by interactively manipulating the cross-device storyboard. We evaluated DemoScript with 8 professional programmers and found that DemoScript significantly improved development efficiency by helping developers interpret and manage cross-device interaction; it also encourages testing to think through the script in a development process. View details
    Preview abstract To address the increasing functionality (or information) overload of smartphones, prior research has explored a variety of methods to extend the input vocabulary of mobile devices. In particular, body tapping has been previously proposed as a technique that allows the user to quickly access a target functionality by simply tapping at a specific location of the body with a smartphone. Though compelling, prior work often fell short in enabling users’ unconstrained tapping locations or behaviors. To address this problem, we developed a novel recognition method that combines both offline—before the system sees any user-defined gestures—and online learning to reliably recognize arbitrary, user-defined body tapping gestures, only using a smartphone’s built-in sensors. Our experiment indicates that our method significantly outperforms baseline approaches in several usage conditions. In particular, provided only with a single sample per location, our accuracy is 30.8% over an SVM baseline and 24.8% over a template matching method. Based on these findings, we discuss how our approach can be generalized to other user-defined gesture problems. View details
    Gesture On: Always-On Touch Gestures for Fast Mobile Access from Device Standby Mode
    Hao Lu
    CHI 2015: ACM Conference on Human Factors in Computing Systems, ACM, pp. 3355-3364
    Preview abstract Contributes a system that overrides the mobile platform kernel behavior to enable touchscreen gesture shortcuts in standby mode. A user can issue a gesture on the touchscreen before the screen is even turned on. View details
    Weave: Scripting Cross-Device Wearable Interaction
    CHI 2015: ACM Conference on Human Factors in Computing Systems, ACM, pp. 3923-3932
    Preview abstract Provides a set of high-level APIs, based on JavaScript, and integrated tool support for developers to easily distribute UI output and combine user input and sensing events across devices for cross-device interaction. View details
    Hierarchical Route Maps for Efficient Navigation
    Noah Wang
    Daisuke Sakamoto
    Takeo Igarashi
    IUI 2014: International Conference on Intelligent User Interfaces
    Preview
    Detecting Tapping Motion on the Side of Mobile Devices By Probabilistically Combining Hand Postures
    William McGrath
    UIST 2014: ACM Symposium on User Interface Software and Technology, ACM
    Preview
    Gesturemote: interacting with remote displays through touch gestures
    Hao Lü
    Matei Negulescu
    AVI 2014: International Working Conference on Advanced Visual Interfaces
    Preview
    InkAnchor: Enhancing Informal Ink-Based Note Taking on Touchscreen Mobile Phones
    Yi Ren
    Edward Lank
    CHI 2014: ACM Conference on Human Factors in Computing Systems
    Preview
    Teaching Motion Gestures via Recognizer Feedback
    Ankit Kamal
    Edward Lank
    IUI 2014: International Conference on Intelligent User Interfaces
    Preview
    GestKeyboard: Enabling Gesture-Based Interaction on Ordinary Physical Keyboard
    Haimo Zhang
    CHI 2014: ACM Conference on Human Factors in Computing Systems
    Preview
    Optimistic Programming of Touch Interaction
    Hao Lu
    Haimo Zhang
    TOCHI: ACM Transactions on Computer-Human Interaction (2014)
    Preview
    Open project: a lightweight framework for remote sharing of mobile applications
    Matei Negulescu
    UIST '13: Proceedings of the 26th annual ACM symposium on User interface software and technology (2013), pp. 281-290
    Preview
    CrowdLearner: Rapidly Creating Mobile Recognizers Using Crowdsourcing
    Shahriyar Amini
    UIST'13: Proceedings of the 26th annual ACM symposium on User interface software and technology (2013), pp. 163-172
    Preview
    FFitts Law: Modeling Finger Touch with Fitts’ Law
    Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI 2013), ACM, New York, NY, USA, pp. 1363-1372
    Preview abstract Fitts’ law has proven to be a strong predictor of pointing performance under a wide range of conditions. However, it has been insufficient in modeling small-target acquisition with finger-touch based input on screens. We propose a dual-distribution hypothesis to interpret the distribution of the endpoints in finger touch input. We hypothesize the movement endpoint distribution as a sum of two independent normal distributions. One distribution reflects the relative precision governed by the speed-accuracy tradeoff rule in the human motor system, and the other captures the absolute precision of finger touch independent of the speed-accuracy tradeoff effect. Based on this hypothesis, we derived the FFitts model—an expansion of Fitts’ law for finger touch input. We present three experiments in 1D target acquisition, 2D target acquisition and touchscreen keyboard typing tasks respectively. The results showed that FFitts law is more accurate than Fitts’ law in modeling finger input on touchscreens. At 0.91 or a greater R2 value, FFitts’ index of difficulty is able to account for significantly more variance than conventional Fitts’ index of difficulty based on either a nominal target width or an effective target width in all the three experiments. View details
    Tap, swipe, or move: attentional demands for distracted smartphone input
    Matei Negulescu
    Jaime Ruiz
    Edward Lank
    Proceedings of the International Working Conference on Advanced Visual Interfaces, ACM, New York, NY, USA (2012), pp. 173-180
    Preview
    Gesture-based interaction: a new dimension for mobile user interfaces
    Proceedings of the International Working Conference on Advanced Visual Interfaces, ACM, New York, NY, USA (2012), pp. 6-6
    Preview
    Gesture Search: Random Access to Smartphone Content
    IEEE Computer: Pervasive Computing, vol. 11 (2012), pp. 10-13
    Preview
    Gesture coder: a tool for programming multi-touch gestures by demonstration
    Hao Lu
    Proceedings of the 2012 ACM annual conference on Human Factors in Computing Systems, ACM, New York, NY, USA, pp. 2875-2884
    Preview
    Gesture Avatar: A Technique for Operating Mobile User Interfaces Using Gestures
    Hao Lü
    CHI 2011: ACM Conference on Human Factors in Computing Systems, pp. 207-216
    Preview
    Deep Shot: A Framework for Migrating Tasks Across Devices Using Mobile Phone Cameras
    Sean Chang
    CHI 2011: ACM Conference on Human Factors in Computing Systems, pp. 2163-2172
    Preview
    DoubleFlip: A Motion Gesture Delimiter for Mobile Interaction
    Jaime Ruiz
    CHI 2011: ACM Conference on Human Factors in Computing Systems, pp. 2717-2720
    Preview
    Experimental Analysis of Touch-Screen Gesture Designs in Mobile Environments
    Andrew Bragdon
    Eugene Nelson
    Ken Hinckley
    CHI 2011: ACM Conference on Human Factors in Computing Systems, pp. 403-412
    Preview
    User-Defined Motion Gestures for Mobile Interaction
    Jaime Ruiz
    Edward Lank
    CHI 2011: ACM Conference on Human Factors in Computing Systems, pp. 197-206
    Preview
    Protractor: A Fast and Accurate Gesture Recognizer
    CHI 2010: ACM Conference on Human Factors in Computing Systems, ACM
    Preview
    Gesture Search: A Tool for Fast Mobile Data Access
    UIST'10: Symposium on User Interface Software and Technology, ACM (2010), pp. 87-96
    Preview abstract Modern mobile phones can store a large amount of data, such as contacts, applications and music. However, it is difficult to access specific data items via existing mobile user interfaces. In this paper, we present Gesture Search, a tool that allows a user to quickly access various data items on a mobile phone by drawing gestures on its touch screen. Gesture Search contributes a unique way of combining gesture-based interaction and search for fast mobile data access. It also demonstrates a novel approach for coupling gestures with standard GUI interaction. A real world deployment with mobile phone users showed that Gesture Search enabled fast, easy access to mobile data in their day-to-day lives. Gesture Search has been released to public and is currently in use by hundreds of thousands of mobile users. It was rated positively by users, with a mean of 4.5 out of 5 for over 5000 ratings. View details
    FrameWire: A Tool for Automatically Extracting Interaction Logic from Paper Prototyping Tests
    Xiang Cao
    Katherine Everitt
    Morgan Dixon
    James Landay
    CHI 2010: ACM Conference on Human Factors in Computing Systems, pp. 503-512