David A. Ross
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
Spae: Semantic pyramid autoencoder for multimodal generation with frozen LLMs
Lijun Yu
Zhiruo Wang
Yonatan Bisk
Alex Hauptmann
Lu Jiang
NeurIPS (2023)
Preview abstract
In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos. SPAE converts between raw pixels and interpretable lexical tokens (or words) extracted from the LLM's vocabulary. The resulting tokens capture both the semantic meaning and the fine-grained details needed for visual reconstruction, effectively translating the visual content into a language comprehensible to the LLM, and empowering it to perform a wide array of multimodal tasks. Our approach is validated through in-context learning experiments with frozen PaLM 2 and GPT 3.5 on a diverse set of image understanding and generation tasks. Our method marks the first successful attempt to enable a frozen LLM to generate image content while surpassing state-of-the-art performance in image understanding tasks, under the same setting, by over 25%.
View details
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Lijun Yu
Xiuye Gu
Rachel Hornung
Hassan Akbari
Ming-Chang Chiu
Josh Dillon
Agrim Gupta
Meera Hahn
Anja Hauth
David Hendon
Alonso Martinez
Grant Schindler
Huisheng Wang
Jimmy Yan
Xuan Yang
Lu Jiang
arxiv Preprint (2023) (to appear)
Preview abstract
We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/
View details
UnLoc: a unified framework for video localization tasks
Xuehan Xiong
Anurag Arnab
Zhonghao Wang
Weina Ge
International Conference on Computer Vision (2023)
Preview abstract
We adapt large-scale image-text pretrained models such as CLIP for temporal localization tasks in untrimmed videos, which is still a relatively unexplored task. We do so by designing a new approach called UnLoc, which uses a pretrained image and text tower, and feeds tokens to a video-text fusion model. The output of the fusion module are then used to construct a feature pyramid in which each level connects to a head to predict a per-frame relevancy score and start/end time displacements. Unlike previous works, our architecture enables zero-shot Moment Retrieval, TAL and action segmentation with a single stage model, without the need for action proposals or representation masking. Unlike specialised models, we achieve state of the art results on three different localization tasks with a unified approach - in some cases outperforming previous works by large margins.
View details
Preview abstract
We present AIST++, a new multi-modal dataset of 3D dance motion and music, along with FACT, a Full-AttentionCross-modal Transformer network for generating 3D dance motion conditioned on music.The proposed AIST++dataset contains 1.1M frames of 3D dance motion in 1408sequences, covering 10 dance genres with multi-view videos with known camera poses—the largest dataset of this kind to our knowledge. We show that naively applying sequence models such as transformers to this dataset for the task of music conditioned 3D motion generation does not produce satisfactory 3D motion that is well correlated with the input music. We overcome these shortcomings by introducing key changes in its architecture design and supervision: FACT model involves a deep cross-modal transformer block with full-attention that is trained to predict N future motions.We empirically show that these changes are key factors in generating long sequences of realistic dance motion that is well-attuned to the input music. We conduct extensive experiments on AIST++ with user studies, where our method outperforms recent state-of-the-art methods both qualitatively and quantitatively.
View details
DOPS: Learning to Detect 3D Objects and Predict their 3D Shapes
Mahyar Najibi
Zhichao Lu
Vivek Mansing Rathod
Larry S. Davis
CVPR 2020
Preview abstract
We propose DOPS, a fast single-stage 3D object detection method for LIDAR data. Previous methods often make
domain-specific design decisions, for example projecting
points into a bird-eye view image in autonomous driving scenarios. In contrast, we propose a general-purpose
method that works on both indoor and outdoor scenes. The
core novelty of our method is a fast, single-pass architecture
that both detects objects in 3D and estimates their shapes.
3D bounding box parameters are estimated in one pass for
every point, aggregated through graph convolutions, and
fed into a branch of the network that predicts latent codes
representing the shape of each detected object. The latent shape space and shape decoder are learned on a synthetic dataset and then used as supervision for the end-toend training of the 3D object detection pipeline. Thus our
model is able to extract shapes without access to groundtruth shape information in the target dataset. During experiments, we find that our proposed method achieves stateof-the-art results by ∼5% on object detection in ScanNet
scenes, and it gets top results by 3.4% in the Waymo Open
Dataset, while reproducing the shapes of detected cars.
View details
Virtual Multi-view Fusion for 3D Semantic Segmentation
Xiaoqi(Michael) Yin
Brian Brewington
European Conference on Computer Vision (2020)
Preview abstract
Semantic segmentation of 3D meshes is an important problem for 3D scene understanding. In this paper we revisit the classic multiview representation of 3D meshes and study several techniques that make them effective for 3D semantic segmentation of meshes. Given a 3D mesh reconstructed from RGBD sensors, our method effectively chooses different virtual views of the 3D mesh and renders multiple 2D channels for training an effective 2D semantic segmentation model. Features from multiple per view predictions are finally fused on 3D mesh vertices to predict mesh semantic segmentation labels. Using the large scale indoor 3D semantic segmentation benchmark of ScanNet, we show that our virtual views enable more effective training of 2D semantic segmentation networks than previous multiview approaches. When the 2D per pixel predictions are aggregated on 3D surfaces, our virtual multiview fusion method is able to achieve significantly better 3D semantic segmentation results compared to all prior multiview approaches and competitive with recent 3D convolution approaches.
View details
Pillar-based Object Detection for Autonomous Driving
Yue Wang
Justin Solomon
ECCV (2020)
Preview abstract
We present a simple and flexible object detection framework optimized for autonomous driving. Building on the observation that point clouds in this application are extremely sparse, we propose a practical pillar-based approach to fix the imbalance issue caused by anchors. In particular, our algorithm incorporates a cylindrical projection into multi-view feature learning, predicts bounding box parameters per pillar rather than per point or per anchor, and includes an aligned pillar-to-point projection module to improve the final prediction. Our anchor-free approach avoids hyperparameter search associated with past methods, simplifying 3D object detection while significantly improving upon state-of-the-art.
View details
Preview abstract
Can we guess human action from dialogue alone? In this work we investigate the link between spoken words and actions in movies. We note that movie scripts describe actions, as well as contain the speech of characters and hence can be used to learn this correlation with no additional supervision. We train a speech to action classifier on 1k movie scripts downloaded from IMSDb and show that such a classifier performs well for certain classes, and when applied to the speech segments of a large \textit{unlabelled} movie corpus (288k videos, 188M speech segments), provides weak labels for over 800k video clips. By training on these video clips, we demonstrate superior action recognition performance on standard action recognition benchmarks, without using a single labelled action example.
View details
An LSTM Approach to Temporal 3D Object Detection in LiDAR Point Clouds
Rui Huang
Wanyue Zhang
ECCV (2020)
Preview abstract
Detecting objects in 3D LiDAR data is a core technology for autonomous driving and other robotics applications. Although LiDAR data is acquired over time, most of the 3D object detection algorithms propose object bounding boxes independently for each frame and neglect the useful information available in the temporal domain. To address this problem, in this paper we propose a sparse LSTM-based multi-frame 3d object detection algorithm. We use a U-Net style 3D sparse convolution network to extract features for each frame's LiDAR point-cloud. These features are fed to the LSTM module together with the hidden and memory features from last frame to predict the 3d objects in the current frame as well as hidden and memory features that are passed to the next frame. Experiments on the Waymo Open Dataset show that our algorithm outperforms the traditional frame by frame approach by 7.5% mAP@0.7 and other multi-frame approaches by 1.2% while using less memory and computation per frame. To the best of our knowledge, this is the first work to use an LSTM for 3D object detection in sparse point clouds.
View details
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
Carl Martin Vondrick
Jitendra Malik
CVPR (2018)
Preview abstract
This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips. We will release the dataset publicly.
AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for developing new approaches for video understanding.
View details
Rethinking the Faster R-CNN Architecture for Temporal Action Localization
Jia Deng
Yu-Wei Chao
CVPR 2018
Preview abstract
We propose TAL-Net, an improved approach to temporal action localization in video that is inspired by the Faster R-CNN object detection framework. TAL-Net addresses three key shortcomings of existing approaches: (1) we improve receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations; (2) we better exploit the temporal context of actions
for both proposal generation and action classification by appropriately extending receptive fields; and (3) we explicitly consider multi-stream feature fusion and demonstrate that fusing motion late is important. We achieve state-of-the-art performance for both action proposal and localiza-
tion on THUMOS’14 detection benchmark and competitive performance on ActivityNet challenge.
View details
The Intervalgram: An Audio Feature for Large-Scale Cover-Song Recognition
Thomas C. Walters
From Sounds to Music and Emotions: 9th International Symposium, CMMR 2012, London, UK, June 19-22, 2012, Revised Selected Papers, Springer Berlin Heidelberg (2013), pp. 197-213
Preview abstract
We present a system for representing the musical content of short pieces of audio using a novel chroma-based representation known as the ‘intervalgram’, which is a summary of the local pattern of musical intervals in a segment of music. The intervalgram is based on a chroma representation derived from the temporal profile of the stabilized auditory image [10] and is made locally pitch invariant by means of a ‘soft’ pitch transposition to a local reference. Intervalgrams are generated for a piece of music using multiple overlapping windows. These sets of intervalgrams are used as the basis of a system for detection of identical melodic and harmonic progressions in a database of music. Using a dynamic-programming approach for comparisons between a reference and the song database, performance is evaluated on the ‘covers80’ dataset [4]. A first test of an intervalgram-based system on this dataset yields a precision at top-1 of 53.8%, with an ROC curve that shows very high precision up to moderate recall, suggesting that the intervalgram is adept at identifying the easier-to-match cover songs in the dataset with high robustness. The intervalgram is designed to support locality-sensitive hashing, such that an index lookup from each single intervalgram feature has a moderate probability of retrieving a match, with few false matches. With this indexing approach, a large reference database can be quickly pruned before more detailed matching, as in previous content-identification systems.
View details
On Using Nearly-Independent Feature Families for High Precision and Confidence
Omid Madani
Manfred Georg
Fourth Asian Machine Learning Conference, JMLR workshop and conference proceedings (2012), pp. 269-284
Preview abstract
Often we require classification at a very high precision level, such
as 99%. We report that when very different sources of evidence such as
text, audio, and video features are available, combining the outputs
of base classifiers trained on each feature type separately, aka late
fusion, can substantially increase the recall of the combination at
high precisions, compared to the performance of a single classifier
trained on all the feature types i.e., early fusion, or compared to
the individual base classifiers. We show how the probability of a
joint false-positive mistake can be upper bounded by the product of
individual probabilities of conditional false-positive mistakes, by
identifying a simple key criterion that needs to hold. This provides
an explanation for the high precision phenomenon, and motivates
referring to such feature families as (nearly) independent. We assess
the relevant factors for achieving high precision empirically, and
explore combination techniques informed by the analysis. We compare a
number of early and late fusion methods, and observe that classifier
combination via late fusion can more than double the recall at high
precision.
View details
Automatic Language Identification in Music Videos with Low Level Audio and Visual Features
Preview
Vijay Chandrasekhar
Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2011)
Survey and Evaluation of Audio Fingerprinting Schemes for Mobile Query-By-Example Applications
Vijay Chandrasekhar
Matt Sharifi
12th International Society for Music Information Retrieval Conference (ISMIR) (2011)
Preview abstract
We survey and evaluate popular audio fingerprinting schemes in a common framework with short query probes captured from cell phones. We report and discuss results important for mobile applications: Receiver Operating Characteristic (ROC) performance, size of fingerprints generated
compared to size of audio probe, and transmission delay if
the fingerprint data were to be transmitted over a wireless
link. We hope that the evaluation in this work will guide
work towards reducing latency in practical mobile audio retrieval applications
View details
The Power of Comparative Reasoning
Dennis Strelow
Ruei-Sung Lin
International Conference on Computer Vision, IEEE (2011)
Preview abstract
Rank correlation measures are known for their resilience
to perturbations in numeric values and are widely used
in many evaluation metrics. Such ordinal measures have
rarely been applied in treatment of numeric features as a
representational transformation. We emphasize the benefits
of ordinal representations of input features both theoretically and empirically. We present a family of algorithms for computing ordinal embeddings based on partial order statistics. Apart from having the stability benefits of ordinal measures, these embeddings are highly nonlinear, giving rise to sparse feature spaces highly favored by several machine learning methods. These embeddings are deterministic, data independent and by virtue of being based on partial order statistics, add another degree of resilience to noise. These machine-learning-free methods when applied to the task of fast similarity search outperform state-of-theart machine learning methods with complex optimization setups. For solving classification problems, the embeddings provide a nonlinear transformation resulting in sparse binary codes that are well-suited for a large class of machine learning algorithms. These methods show significant improvement on VOC 2010 using simple linear classifiers which can be trained quickly. Our method can be extended to the case of polynomial kernels, while permitting very efficient computation. Further, since the popular MinHash algorithm is a special case of our method, we demonstrate an efficient scheme for computing MinHash on conjunctions of binary features. The actual method can be implemented in about 10 lines of code in most languages (2 lines in MATLAB), and does not require any data-driven optimization.
View details
SPEC Hashing: Similarity Preserving algorithm for Entropy-based Coding
Preview
Ruei-Sung Lin
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2010)
No Results Found