Yang Li
Yang Li is a Staff Research Scientist at Google, and an affiliate faculty member in Computer Science & Engineering at the University of Washington. Yang’s research focuses on the intersection between deep learning and human computer interaction.
See Yang Li's personal website.
Authored Publications
Google Publications
Other Publications
Sort By
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Josip Djolonga
Piotr Padlewski
Basil Mustafa
Carlos Riquelme
Yi Tay
Siamak Shakeri
Daniel Salz
Michael Tschannen
Mandar Joshi
Filip Pavetić
Anurag Arnab
Yuanzhong Xu
Keran Rong
Computer Vision and Pattern Recognition Conference (CVPR) (2024)
Preview abstract
We explore the boundaries of scaling up a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
View details
Towards Semantically-Aware UI Design Tools: Design, Implementation, and Evaluation of Semantic Grouping Guidelines
Peitong Duan
Bjoern Hartmann
Karina Nguyen
Marti Hearst
ICML 2023 Workshop on Artificial Intelligence and Human-Computer Interaction (2023)
Preview abstract
A coherent semantic structure, where semantically-related elements are appropriately grouped, is critical for proper understanding of a UI. Ideally, UI design tools should help designers establish coherent semantic grouping. To work towards this, we contribute five semantic grouping guidelines that capture how human designers think about semantic grouping and are amenable to implementation in design tools. They were obtained from empirical observations
on existing UIs, a literature review, and iterative refinement with UI experts’ feedback. We validated our guidelines through an expert review and heuristic evaluation; results indicate these guidelines capture valuable information about semantic structure. We demonstrate the guidelines’ use for building systems by implementing a set of computational metrics. These metrics detected many of the same severe issues that human design experts marked in a comparative study. Running our metrics on a larger UI dataset suggests many real UIs exhibit grouping violations.
View details
Preview abstract
Mobile UI understanding is important for enabling various interaction tasks such as UI automation and accessibility. Previous mobile UI modeling often depends on the view hierarchy information of a screen, which directly provides the structural data of the UI, with the hope to bypass challenging tasks of visual modeling from screen pixels. However, view hierarchies are not always available, and are often corrupted with missing object descriptions or misaligned structure information. As a result, despite the use of view hierarchies could offer short-term gains, it may ultimately hinder the applicability and performance of the model. In this paper, we propose Spotlight, a vision-only approach for mobile UI understanding. Specifically, we enhance a vision-language model that only takes the screenshot of the UI and a region of interest on the screen---the focus---as the input. This general architecture of Spotlight is easily scalable and capable of performing a range of UI modeling tasks. Our experiments show that our model establishes SoTA results on several representative UI tasks and outperforms previous methods that use both screenshots and view hierarchies as inputs. Furthermore, we explore multi-task learning and few-shot prompting capacities of the proposed models, demonstrating promising results in the multi-task learning direction.
View details
Preview abstract
Conversational agents show the promise to allow users to interact with mobile devices using language. However, to perform diverse UI tasks with natural language, developers typically need to create separate datasets and models for each specific task, which is expensive and effort-consuming. Recently, pre-trained large language models (LLMs) have been shown capable of generalizing to various downstream tasks when prompted with a handful of examples from the target task. This paper investigates the feasibility of enabling versatile conversational interactions with mobile UIs using a single LLM. We designed prompting techniques to adapt an LLM to mobile UIs. We experimented with four important modeling tasks that address various scenarios in conversational interaction. Our method achieved competitive performance on these challenging tasks without requiring dedicated datasets and training, offering a lightweight and generalizable approach to enable language-based mobile interaction.
View details
Preview abstract
User interface design is a complex task that involves designers examining a wide range of options. We present Spacewalker, a tool that allows designers to rapidly search a large design space for an optimal web UI with integrated support. Designers first annotate each attribute they want to explore in a typical HTML page, using a simple markup extension we designed. Spacewalker then parses the annotated HTML specification, and intelligently generates and distributes various configurations of the web UI to crowd workers for evaluation. We enhanced a genetic algorithm to accommodate crowd worker responses from pairwise comparison of UI designs, which is crucial for obtaining reliable feedback. Based on our experiments, Spacewalker allows designers to effectively search a large design space of a UI, using the language they are familiar with, and improve their design rapidly at a minimal cost.
View details
TapNet: The Design, Training, Implementation, and Applications of a Multi-Task Learning CNN for Off-Screen Mobile Input
Michael Xuelin Huang
Nazneen Nazneen
Alex Chao
ACM CHI Conference on Human Factors in Computing Systems, ACM (2021)
Preview abstract
Off-screen interaction offers great potential for one-handed and eyes-free mobile interaction. While a few existing studies have explored the built-in mobile phone sensors to sense off-screen signals, none met practical requirement. This paper discusses the design, training, implementation and applications of TapNet, a multi-task network that detects tapping on the smartphone using built-in accelerometer and gyroscope. With sensor location as auxiliary information, TapNet can jointly learn from data across devices and simultaneously recognize multiple tap properties, including tap direction and tap location. We developed four datasets consisting of over 180K training samples, 38K testing samples, and 87 participants in total. Experimental evaluation demonstrated the effectiveness of the TapNet design and its significant improvement over the state of the art. Along with the datasets, codebase, and extensive experiments, TapNet establishes a new technical foundation for off-screen mobile input.
View details
Preview abstract
Natural language descriptions of user interface (UI) elements such as alternative text are crucial for accessibility and language-based interaction in general. Yet, these descriptions are constantly missing in mobile UIs. We propose widget captioning, a novel task for automatically generating language descriptions for UI elements from multimodal input including both the image and the structural representations of user interfaces. We collected a largescale dataset for widget captioning with crowdsourcing. Our dataset contains 162,859 language phrases created by human workers for annotating 61,285 UI elements across 21,750 unique UI screens. We thoroughly analyze the dataset, and train and evaluate a set of deep model configurations to investigate how each feature modality as well as the choice of learning strategies impact the quality of predicted captions. The task formulation and the dataset as well as our benchmark models contribute a solid basis for this novel multimodal captioning task that connects language and user interfaces.
View details
Preview abstract
We present a new problem: grounding natural language instructions to mobile user interface actions, and create three new datasets for it. For full task evaluation, we create PixelHelp, a corpus that pairs English instructions with actions performed by people on a mobile UI emulator. To scale training, we decouple the language and action data by (a) annotating action phrase spans in How-To instructions and (b) synthesizing grounded descriptions of actions for mobile user interfaces. We use a Transformer to extract action phrase tuples from long-range natural language instructions. A grounding Transformer then contextually represents UI objects using both their content and screen position and connects them to object descriptions. Given a starting screen and instruction, our model achieves 70.59% accuracy on predicting complete ground-truth action sequences in PixelHelp.
View details
Preview abstract
Neural language models have been widely used in various NLP tasks, including
machine translation, next word prediction and conversational agents. However, it
is challenging to deploy these models on mobile devices due to their slow prediction speed, where the bottleneck is to compute top candidates in the softmax
layer. In this paper, we introduce a novel softmax layer approximation algorithm
by exploiting the clustering structure of context vectors. Our algorithm uses a
light-weight screening model to predict a much smaller set of candidate words
based on the given context, and then conducts an exact softmax only within that
subset. Training such a procedure end-to-end is challenging as traditional clustering methods are discrete and non-differentiable, and thus unable to be used with
back-propagation in the training process. Using the Gumbel softmax, we are able
to train the screening model end-to-end on the training set to exploit data distribution. The algorithm achieves an order of magnitude faster inference than the
original softmax layer for predicting top-k words in various tasks such as beam
search in machine translation or next words prediction. For example, for machine translation task on German to English dataset with around 25K vocabulary,
we can achieve 20.4 times speed up with 98.9% precision@1 and 99.3% precision@5 with the original softmax layer prediction, while state-of-the-art (Zhang
et al., 2018) only achieves 6.7x speedup with 98.7% precision@1 and 98.1% precision@5 for the same task.
View details
Preview abstract
Existing attention mechanisms, are mostly point-based in that a model is designed to attend to a single item in a collection of items (the memory). Intuitively, an area in the memory that may contain multiple items can be worth attending to as well. Although Softmax, which is typically used for computing attention alignments, assigns non-zero probability for every item in memory, it tends to converge to a single item and cannot efficiently attend to a group of items that matter. We propose area attention: a way to attend to an area of the memory, where each area contains a group of items that are either spatially adjacent when the memory has a 2-dimensional structure, such as images, or temporally adjacent for 1-dimensional memory, such as natural language sentences. Importantly, the size of an area, i.e., the number of items in an area, can vary depending on the learned coherence of the adjacent items. Using an area of items, instead of a single, we hope attention mechanisms can better capture the nature of the task. Area attention can work along multi-head attention for attending multiple areas in the memory. We evaluate area attention on two tasks: character-level neural machine translation and image captioning, and improve upon strong (state-of-the-art) baselines in both cases. In addition to proposing the novel concept of area attention, we contribute an efficient way for computing it by leveraging the technique of summed area tables.
View details
Preview abstract
Model compression is essential for serving large deep neural nets on devices with limited resources or applications that require real-time responses. For advanced NLP problems, a neural language model usually consists of recurrent layers (e.g., using LSTM cells), an embedding matrix for representing input tokens, and a softmax layer for generating output tokens. For problems with a very large vocabulary size, the embedding and the softmax matrices can account for more than half of the model size. For instance, the bigLSTM model achieves state-of-the-art performance on the One-Billion-Word (OBW) dataset with around 800k vocabulary, and its word embedding and softmax matrices use more than 6GBytes space, and are responsible for over 90\% of the model parameters. In this paper, we propose GroupReduce, a novel compression method for neural language models, based on vocabulary-partition (block) based low-rank matrix approximation and the inherent frequency distribution of tokens (the power-law distribution of words). We start by grouping words into
c
blocks based on their frequency, and then refine the clustering iteratively by constructing weighted low-rank approximation for each block, where the weights are based the frequencies of the words in the block. The experimental results show our method can significantly outperform traditional compression methods such as low-rank approximation and pruning. On the OBW dataset, our method achieved 6.6x compression rate for the embedding and softmax matrices, and when combined with quantization, our method can achieve 26x compression rate without losing prediction accuracy.
View details
M3 Gesture Menu: Design and Experimental Analyses of Marking Menus for Touchscreen Mobile Interaction
Kun Li
Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, ACM, New York, NY, USA, 249:1-249:14
Preview abstract
Despite their learning advantages in theory, marking menus have faced adoption challenges in practice, even on today's touchscreen-based mobile devices. We address these challenges by designing, implementing, and evaluating multiple versions of M3 Gesture Menu (M3), a reimagination of marking menus targeted at mobile interfaces. M3 is defined on a grid rather than in a radial space, relies on gestural shapes rather than directional marks, and has constant and stationary space use. Our first controlled experiment on expert performance showed M3 was faster and less error-prone by a factor of two than traditional marking menus. A second experiment on learning demonstrated for the first time that users could successfully transition to recall-based execution of a dozen commands after three ten-minute practice sessions with both M3 and Multi-Stroke Marking Menu. Together, M3, with its demonstrated resolution, learning, and space use benefits, contributes to the design and understanding of menu selection in the mobile-first era of end-user computing.
View details
Preview abstract
Touchscreen mobile devices can afford rich interaction behaviors but they are complex to model. Scrollable two-dimensional grids are a common user interface on mobile devices that allow users to access a large number of items on a small screen by direct touch. By analyzing touch input and eye gaze of users during grid interaction, we reveal how multiple performance components come into play in such a task, including navigation, visual search and pointing. These findings inspired us to design a novel predictive model that combines these components for modeling grid tasks. We realized these model components by employing both traditional analytical methods and data-driven machine learning approaches. In addition to showing high accuracy achieved by our model in predicting human performance on a test dataset, we demonstrate how such a model can lead to a significant reduction in interaction time when used in a predictive user interface.
View details
Preview abstract
Predicting human performance in interaction tasks allows designers or developers to understand the expected performance of a target interface without actually testing it with real users. In this work, we present a deep neural net to model and predict human performance in performing a sequence of UI tasks. In particular, we focus on a dominant class of tasks, i.e., target selection from a vertical list or menu, which resembles many interaction scenarios in modern user interfaces. We experimented with our deep neural net using a public dataset collected from a desktop laboratory environment and a dataset collected from hundreds of touchscreen smartphone users via crowdsourcing. Our model significantly outperformed previous methods in various settings. Importantly, our method, as a deep model, can easily incorporate additional UI attributes such as visual appearance and content semantics without changing model architectures—these attributes were hard to capture with previous methods. We discussed our insights into the behaviors of our model.
View details
Preview abstract
Developing interactive systems often involves a large set of callback functions for handling user interaction, which makes it challenging to manage UI behaviors, create descriptive documentation, and track revisions. We developed Doppio, a tool that automatically tracks and visualizes UI flows and their changes based on source code elements and their revisions. For each input event listener of a widget, e.g., onClick of an Android View class, Doppio captures and associates its UI output from an execution of the program with its code snippet from the source code. It automatically generates a screenflow diagram that is organized by the callback methods and interaction flow, where developers can review the code and UI revisions interactively. Doppio, implemented as an IDE plugin, is seamlessly integrated into a common development workflow. Our experiments show that Doppio was able to generate quality visual documentation and helped participants understand unfamiliar source code and track changes.
View details
Preview abstract
Existing sequence prediction methods are mostly concerned with time-independent sequences, in which the actual time span between events is irrelevant and the distance between events is simply the difference between their order positions in the sequence. While this time-independent view of sequences is applicable for data such as natural languages, e.g., dealing with words in a sentence, it is inappropriate and inefficient for many real world events that are observed and collected at unequally spaced points of time as they naturally arise, e.g., when a person goes to a grocery store or makes a phone call. The time span between events can carry important information about the sequence dependence of human behaviors. To leverage continuous time in sequence prediction, we propose two methods for integrating time into event representation, based on the intuition on how time is tokenized in everyday life and previous work on embedding contextualization. We particularly focus on using these methods in recurrent neural networks, which have gained popularity in many sequence prediction tasks. We evaluated these methods as well as baseline models on two learning tasks: mobile app usage prediction and music recommendation. The experiments revealed that the proposed methods for time-dependent representation offer consistent gain on accuracy compared to baseline models that either directly use continuous time value in a recurrent neural network or do not use time.
View details
Investigating Cursor-based Interactions to Support Non-Visual Exploration in the Real World
Anhong Guo
Xu Wang
Patrick Clary
Ken Goldman
Jeffrey Bigham
Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility (2018)
Preview abstract
The human visual system processes complex scenes to focus attention on relevant items. However, blind people cannot visually skim for an area of interest. Instead, they use a combination of contextual information, knowledge of the spatial layout of their environment, and interactive scanning to find and attend to specific items. In this paper, we define and compare three cursor-based interactions to help blind people attend to items in a complex visual scene: window cursor (move their phone to scan), finger cursor (point their finger to read), and touch cursor (drag their finger on the touchscreen to explore). We conducted a user study with 12 participants to evaluate the three techniques on four tasks, and found that: window cursor worked well for locating objects on large surfaces, finger cursor worked well for accessing control panels, and touch cursor worked well for helping users understand spatial layouts. A combination of multiple techniques will likely be best for supporting a variety of everyday tasks for blind users.
View details
Preview abstract
To address the increasing functionality (or information) overload of smartphones, prior research has explored a variety of methods to extend the input vocabulary of mobile devices. In particular, body tapping has been previously proposed as a technique that allows the user to quickly access a target functionality by simply tapping at a specific location of the body with a smartphone. Though compelling, prior work often fell short in enabling users’ unconstrained tapping locations or behaviors. To address this problem, we developed a novel recognition method that combines both offline—before the system sees any user-defined gestures—and online learning to reliably recognize arbitrary, user-defined body tapping gestures, only using a smartphone’s built-in sensors. Our experiment indicates that our method significantly outperforms baseline approaches in several usage conditions. In particular, provided only with a single sample per location, our accuracy is 30.8% over an SVM baseline and 24.8% over a template matching method. Based on these findings, we discuss how our approach can be generalized to other user-defined gesture problems.
View details
Enhancing Cross-Device Interaction Scripting with Interactive Illustrations
Bjorn Hartmann
CHI 2016: ACM Conference on Human Factors in Computing Systems
Preview abstract
Cross-device interactions involve input and output on multiple computing devices. Implementing and reasoning about interactions that cover multiple devices with a diversity of form factors and capabilities can be complex. To assist developers in programming cross-device interactions, we created DemoScript, a technique that automatically analyzes a cross-device interaction program while it is being written. DemoScript visually illustrates the step-by-step execution of a selected portion or the entire program with a novel, automatically generated cross-device storyboard visualization. In addition to helping developers understand the behavior of the program, DemoScript also allows developers to revise their program by interactively manipulating the cross-device storyboard. We evaluated DemoScript with 8 professional programmers and found that DemoScript significantly improved development efficiency by helping developers interpret and manage cross-device interaction; it also encourages testing to think through the script in a development process.
View details
Preview abstract
Contributes a system that overrides the mobile platform kernel behavior to enable touchscreen gesture shortcuts in standby mode. A user can issue a gesture on the touchscreen before the screen is even turned on.
View details
Weave: Scripting Cross-Device Wearable Interaction
CHI 2015: ACM Conference on Human Factors in Computing Systems, ACM, pp. 3923-3932
Preview abstract
Provides a set of high-level APIs, based on JavaScript, and integrated tool support for developers to easily distribute UI output and combine user input and sensing events across devices for cross-device interaction.
View details
Reflection: Enabling Event Prediction As an On-Device Service for Mobile Interaction
Preview
UIST 2014: ACM Symposium on User Interface Software and Technology
Hierarchical Route Maps for Efficient Navigation
Preview
Noah Wang
Daisuke Sakamoto
Takeo Igarashi
IUI 2014: International Conference on Intelligent User Interfaces
Gesturemote: interacting with remote displays through touch gestures
Preview
Hao Lü
Matei Negulescu
AVI 2014: International Working Conference on Advanced Visual Interfaces
CrowdLearner: Rapidly Creating Mobile Recognizers Using Crowdsourcing
Preview
Shahriyar Amini
UIST'13: Proceedings of the 26th annual ACM symposium on User interface software and technology (2013), pp. 163-172
FFitts Law: Modeling Finger Touch with Fitts’ Law
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI 2013), ACM, New York, NY, USA, pp. 1363-1372
Preview abstract
Fitts’ law has proven to be a strong predictor of pointing performance under a wide range of conditions. However, it has been insufficient in modeling small-target acquisition with finger-touch based input on screens. We propose a dual-distribution hypothesis to interpret the distribution of the endpoints in finger touch input. We hypothesize the movement endpoint distribution as a sum of two independent normal distributions. One distribution reflects the relative precision governed by the speed-accuracy tradeoff rule in the human motor system, and the other captures the absolute precision of finger touch independent of the speed-accuracy tradeoff effect. Based on this hypothesis, we derived the FFitts model—an expansion of Fitts’ law for finger touch input. We present three experiments in 1D target acquisition, 2D target acquisition and touchscreen keyboard typing tasks respectively. The results showed that FFitts law is more accurate than Fitts’ law in modeling finger input on touchscreens. At 0.91 or a greater R2 value, FFitts’ index of difficulty is able to account for significantly more variance than conventional Fitts’ index of difficulty based on either a nominal target width or an effective target width in all the three experiments.
View details
Open project: a lightweight framework for remote sharing of mobile applications
Preview
Matei Negulescu
UIST '13: Proceedings of the 26th annual ACM symposium on User interface software and technology (2013), pp. 281-290
Tap, swipe, or move: attentional demands for distracted smartphone input
Preview
Matei Negulescu
Jaime Ruiz
Edward Lank
Proceedings of the International Working Conference on Advanced Visual Interfaces, ACM, New York, NY, USA (2012), pp. 173-180
Gesture coder: a tool for programming multi-touch gestures by demonstration
Preview
Hao Lu
Proceedings of the 2012 ACM annual conference on Human Factors in Computing Systems, ACM, New York, NY, USA, pp. 2875-2884
Gesture Search: Random Access to Smartphone Content
Preview
IEEE Computer: Pervasive Computing, vol. 11 (2012), pp. 10-13
Gesture-based interaction: a new dimension for mobile user interfaces
Preview
Proceedings of the International Working Conference on Advanced Visual Interfaces, ACM, New York, NY, USA (2012), pp. 6-6
Experimental Analysis of Touch-Screen Gesture Designs in Mobile Environments
Preview
Andrew Bragdon
Eugene Nelson
Ken Hinckley
CHI 2011: ACM Conference on Human Factors in Computing Systems, pp. 403-412
User-Defined Motion Gestures for Mobile Interaction
Preview
Jaime Ruiz
Edward Lank
CHI 2011: ACM Conference on Human Factors in Computing Systems, pp. 197-206
Protractor: A Fast and Accurate Gesture Recognizer
Preview
CHI 2010: ACM Conference on Human Factors in Computing Systems, ACM
Gesture Search: A Tool for Fast Mobile Data Access
UIST'10: Symposium on User Interface Software and Technology, ACM (2010), pp. 87-96
Preview abstract
Modern mobile phones can store a large amount of data, such as contacts, applications and music. However, it is difficult to access specific data items via existing mobile user interfaces. In this paper, we present Gesture Search, a tool that allows a user to quickly access various data items on a mobile phone by drawing gestures on its touch screen. Gesture Search contributes a unique way of combining gesture-based interaction and search for fast mobile data access. It also demonstrates a novel approach for coupling gestures with standard GUI interaction. A real world deployment with mobile phone users showed that Gesture Search enabled fast, easy access to mobile data in their day-to-day lives. Gesture Search has been released to public and is currently in use by hundreds of thousands of mobile users. It was rated positively by users, with a mean of 4.5 out of 5 for over 5000 ratings.
View details
Beyond Pinch and Flick: Enriching Mobile Gesture Interaction
Preview
IEEE Comuter, vol. 42 (2009), pp. 87-89
FrameWire: A Tool for Automatically Extracting Interaction Logic from Paper Prototyping Tests
Xiang Cao
Katherine Everitt
Morgan Dixon
James Landay
CHI 2010: ACM Conference on Human Factors in Computing Systems, pp. 503-512