Jump to Content
Jonathan T. Barron

Jonathan T. Barron

Research Areas

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    RegNeRF: Regularizing Neural Radiance Fields for View Synthesis from Sparse Inputs
    Michael Niemeyer
    Ben Mildenhall
    Andreas Geiger
    Noha Radwan
    Computer Vision and Pattern Recognition (CVPR) (2022)
    Preview abstract Neural Radiance Fields (NeRF) have emerged as a powerful representation for the task of novel view synthesis due to their simplicity and state-of-the-art performance. Though NeRF can produce photorealistic renderings of unseen viewpoints when many input views are available, its performance drops significantly when this number is reduced. We observe that the majority of artifacts in sparse input scenarios are caused by errors in the estimated scene geometry, and by divergent behavior at the start of training. We address this by regularizing the geometry and appearance of patches rendered from unobserved viewpoints, and annealing the ray sampling space during training. We additionally use a normalizing flow model to regularize the color of unobserved viewpoints. Our model outperforms not only other methods that optimize over a single scene, but in many cases also conditional models that are extensively pre-trained on large multi-view datasets. View details
    HumanNeRF: Free-viewpoint Rendering of Moving People from Monocular Video
    Chung-Yi Weng
    Pratul Srinivasan
    CVPR (Computer Vision and Pattern Recognition), IEEE and the Computer Vision Foundation (2022) (to appear)
    Preview abstract We introduce a free-viewpoint rendering method -- HumanNeRF -- that works on a given monocular video of a human performing complex body motions, e.g. a video from YouTube. Our method enables pausing the video at any frame and rendering the subject from arbitrary new camera viewpoints or even a full 360-degree camera path for that particular frame and body pose. This task is particularly challenging, as it requires synthesizing photorealistic details of the body, as seen from various camera angles that may not exist in the input video, as well as synthesizing fine details such as cloth folds and facial appearance. Our method optimizes for a volumetric representation of the person in a canonical T-pose, in concert with a motion field that maps the estimated canonical representation to every frame of the video via backward warps. The motion field is decomposed into skeletal rigid and non-rigid motions, produced by deep networks. We show significant performance improvements over prior work, and compelling examples of free-viewpoint renderings from monocular video of moving humans in challenging uncontrolled capture scenarios. View details
    Defocus Map Estimation and Blur Removal from a Single Dual-Pixel Image
    Ioannis Gkioulekas
    Jiawen Chen
    Neal Wadhwa
    Pratul Srinivasan
    Shumian Xin
    Tianfan Xue
    International Conference on Computer Vision (2021)
    Preview abstract We present a method to simultaneously estimate an image's defocus map, i.e., the amount of defocus blur at each pixel, and remove the blur to recover a sharp all-in-focus image using only a single camera capture. Our method leverages data from dual-pixel sensors that are common on many consumer cameras. Though originally designed to assist camera autofocus, dual-pixel sensors have been used to separately recover both defocus maps and all-in-focus images. Past approaches have solved these two problems in isolation and often require large labeled datasets for supervised training. In contrast with those prior works, we show that the two problems are connected, model the optics of dual-pixel images, and set up an optimization problem to jointly solve for both. We use data captured with a consumer smartphone camera to demonstrate that after a one time calibration step, our approach improves upon past approaches for both defocus map estimation and blur removal, without any supervised training. View details
    IBRNet: Learning Multi-View Image-Based Rendering
    Kyle Genova
    Pratul Srinivasan
    Qianqian Wang
    Ricardo Martin-Brualla
    Zhicheng Wang
    Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2021) (to appear)
    Preview abstract We present a method that synthesizes novel views of complex scenes by interpolating a sparse set of nearby views.The core of our method is a multilayer perceptron (MLP)that generates RGBA at each 5D coordinate from multi-view image features. Unlike neural scene representation work that optimizes per-scene functions for rendering, we learn a generic view interpolation function that naturally generalizes to novel scene types and camera setups. Compared to previous generic image-based rendering (IBR) methods like Multiple-plane images (MPIs) that use discrete volume representations, our method instead produces RGBAs at continuous 5D locations (3D spatial locations and 2D viewing directions), enabling high-resolution imagery rendering.Our rendering pipeline is fully differentiable, and the only input required to train our method are multi-view posed images. Experiments show that our method outperforms previous IBR methods, and achieves state-of-the-art performance when fine tuned on each test scene. View details
    NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections
    Ricardo Martin-Brualla*
    Noha Radwan*
    Alexey Dosovitskiy
    Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
    Preview abstract We present a learning-based method for synthesizing novel views of complex scenes using only unstructured collections of in-the-wild photographs. We build on Neural Radiance Fields (NeRF), which uses the weights of a multilayer perceptron to model the density and color of a scene as a function of 3D coordinates. While NeRF works well on images of static subjects captured under controlled settings, it is incapable of modeling many ubiquitous, real-world phenomena in uncontrolled images, such as variable illumination or transient occluders. We introduce a series of extensions to NeRF to address these issues, thereby enabling accurate reconstructions from unstructured image collections taken from the internet. We apply our system, dubbed NeRF-W, to internet photo collections of famous landmarks, and demonstrate temporally consistent novel view renderings that are significantly closer to photorealism than the prior state of the art. View details
    iNeRF: Inverting Neural Radiance Fields for Pose Estimation
    Yen-Chen Lin
    Pete Florence
    Phillip Isola
    Alberto Rodriguez
    Tsung-Yi Lin
    IROS 2021 (to appear)
    Preview abstract We present iNeRF, a framework that performs mesh-free pose estimation by “inverting” a Neural Radiance Field (NeRF). NeRFs have been shown to be remarkably effective for the task of view synthesis — synthesizing photorealistic novel views of real-world scenes or objects. In this work, we investigate whether we can apply analysis-by-synthesis via NeRF for mesh-free, RGB-only 6DoF pose estimation – given an image, find the translation and rotation of a camera relative to a 3D object or scene. Our method assumes that no object mesh models are available during either training or test time. Starting from an initial pose estimate, we use gradient descent to minimize the residual between pixels rendered from a NeRF and pixels in an observed image. In our experiments, we first study 1) how to sample rays during pose refinement for iNeRF to collect informative gradients and 2) how different batch sizes of rays affect iNeRF on a synthetic dataset. We then show that for complex real-world scenes from the LLFF dataset, iNeRF can improve NeRF by estimating the camera poses of novel images and using these images as additional training data for NeRF. Finally, we show iNeRF can perform category-level object pose estimation, including object instances not seen during training, with RGB images by inverting a NeRF model inferred from a single view. View details
    How to train neural networks for flare removal
    Yicheng Wu
    Tianfan Xue
    Jiawen Chen
    Ashok Veeraraghavan
    ICCV (2021)
    Preview abstract When a camera is pointed at a strong light source, the resulting photograph may contain lens flare artifacts. Flares appear in a wide variety of patterns (halos, streaks, color bleeding, haze, etc.) and this diversity in appearance makes flare removal challenging. Existing analytical solutions make strong assumptions about the artifact’s geometry or brightness, and therefore only work well on a small subset of flares. Machine learning techniques have shown success in removing other types of artifacts, like reflections, but have not been widely applied to flare removal due to the lack of training data. To solve this problem, we explicitly model the optical causes of flare either empirically or using wave optics, and generate semi-synthetic pairs of flare-corrupted and clean images. This enables us to train neural networks to remove lens flare for the first time. Experiments show our data synthesis approach is critical for accurate flare removal, and that models trained with our technique generalize well to real lens flares across different scenes, lighting conditions, and cameras. View details
    Neural Light Transport for Relighting and View Synthesis
    Xiuming Zhang
    Yun-Ta Tsai
    Tiancheng Sun
    Tianfan Xue
    Philip Davidson
    Christoph Rhemann
    Paul Debevec
    Ravi Ramamoorthi
    ACM Transactions on Graphics, vol. 40 (2021)
    Preview abstract The light transport (LT) of a scene describes how it appears under different lighting and viewing directions, and complete knowledge of a scene's LT enables the synthesis of novel views under arbitrary lighting. In this paper, we focus on image-based LT acquisition, primarily for human bodies within a light stage setup. We propose a semi-parametric approach to learn a neural representation of LT that is embedded in the space of a texture atlas of known geometric properties, and model all non-diffuse and global LT as residuals added to a physically-accurate diffuse base rendering. In particular, we show how to fuse previously seen observations of illuminants and views to synthesize a new image of the same scene under a desired lighting condition from a chosen viewpoint. This strategy allows the network to learn complex material effects (such as subsurface scattering) and global illumination, while guaranteeing the physical correctness of the diffuse LT (such as hard shadows). With this learned LT, one can relight the scene photorealistically with a directional light or an HDRI map, synthesize novel views with view-dependent effects, or do both simultaneously, all in a unified framework using a set of sparse, previously seen observations. Qualitative and quantitative experiments demonstrate that our neural LT (NLT) outperforms state-of-the-art solutions for relighting and view synthesis, without separate treatment for both problems that prior work requires. View details
    Learning to Autofocus
    Charles Herrmann
    Richard Strong Bowen
    Neal Wadhwa
    Ramin Zabih
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
    Preview abstract Autofocus is an important task for digital cameras, yet current approaches often exhibit poor performance. We propose a learning-based approach to this problem, and provide a realistic dataset of sufficient size for effective learning. Our dataset is labeled with per-pixel depths obtained from multi-view stereo, following [9]. Using this dataset, we apply modern deep classification models and an ordinal regression loss to obtain an efficient learning-based autofocus technique. We demonstrate that our approach provides a significant improvement compared with previous learned and non-learned methods: our model reduces the mean absolute error by a factor of 3.6 over the best comparable baseline algorithm. Our dataset and code are publicly available. View details
    Light Stage Super-Resolution: Continuous High-Frequency Relighting
    Tiancheng Sun
    Zexiang Xu
    Xiuming Zhang
    Christoph Rhemann
    Paul Debevec
    Yun-Ta Tsai
    Ravi Ramamoorthi
    SIGGRAPH Asia and TOG (2020)
    Preview abstract The light stage has been widely used in computer graphics for the past two decades, primarily to enable the relighting of human faces. By capturing the appearance of the human subject under different light sources, one obtains the light transport matrix of that subject, which enables image-based relighting in novel environments. However, due to the finite number of lights in the stage, the light transport matrix only represents a sparse sampling on the entire sphere. As a consequence, relighting the subject with a point light or a directional source that does not coincide exactly with one of the lights in the stage requires interpolation and resampling the images corresponding to nearby lights, and this leads to ghosting shadows, aliased specularities, and other artifacts. To ameliorate these artifacts and produce better results under arbitrary high-frequency lighting, this paper proposes a learning-based solution for the "super-resolution" of scans of human faces taken from a light stage. Given an arbitrary "query" light direction, our method aggregates the captured images corresponding to neighboring lights in the stage, and uses a neural network to synthesize a rendering of the face that appears to be illuminated by a "virtual" light source at the query location. This neural network must circumvent the inherent aliasing and regularity of the light stage data that was used for training, which we accomplish through the use of regularized traditional interpolation methods within our network. Our learned model is able to produce renderings for arbitrary light directions that exhibit realistic shadows and specular highlights, and is able to generalize across a wide variety of subjects. Our super-resolution approach enables more accurate renderings of human subjects under detailed environment maps, or the construction of simpler light stages that contain fewer light sources while still yielding comparable quality renderings as light stages with more densely sampled lights. View details
    Preview abstract The sky is a major component of the appearance of a photograph, and its color and tone can strongly influence the mood of a picture. In nighttime photography, the sky can also suffer from noise and color artifacts. For this reason, there is a strong desire to process the sky in isolation from the rest of the scene to achieve an optimal look. In this work, we propose an automated method, which can run as a part of a camera pipeline, for creating accurate sky alpha-masks and using them to improve the appearance of the sky. Our method performs end-to-end sky optimization in less than half a second per image on a mobile device. We introduce a method for creating an accurate sky-mask dataset that is based on partially annotated images that are inpainted and refined by our modified weighted guided filter. We use this dataset to train a neural network for semantic sky segmentation. Due to the compute and power constraints of mobile devices, sky segmentation is performed at a low image resolution. Our modified weighted guided filter is used for edge-aware upsampling to resize the alpha-mask to a higher resolution. With this detailed mask we automatically apply post-processing steps to the sky in isolation, such as automatic spatially varying white-balance, brightness adjustments, contrast enhancement, and noise reduction. View details
    What Matters in Unsupervised Optical Flow
    Rico Jonschkowski
    Austin Stone
    Ariel Gordon
    Kurt Konolige
    ECCV (2020)
    Preview abstract We systematically compare and analyze a set of key components in unsupervised optical flow to identify which photometric loss, occlusion handling, and smoothness regularization is most effective. Alongside this investigation we construct a number of novel improvements to unsupervised flow models, such as cost volume normalization, stopping the gradient at the occlusion mask, encouraging smoothness before upsampling the flow field, and continual self-supervision with image resizing. By combining the results of our investigation with our improved model components, we are able to present a new unsupervised flow technique that significantly outperforms the previous unsupervised state-of-the-art and performs on par with supervised FlowNet2 on the KITTI 2015 dataset, while also being significantly simpler than related approaches. View details
    Preview abstract We present a deep learning solution for estimating the incident illumination at any 3D location within a scene from an input narrow-baseline stereo image pair. Previous approaches for predicting global illumination from images either predict just a single illumination for the entire scene, or separately estimate the illumination at each 3D location without enforcing that the predictions are consistent with the same 3D scene. Instead, we propose a deep learning model that estimates a 3D volumetric RGBA model of a scene, including content outside the observed field of view, and then uses standard volume rendering to estimate the incident illumination at any 3D location within that volume. Our model is trained without any ground truth 3D data and only requires a held-out perspective view near the input stereo pair and a spherical panorama taken within each scene as supervision, as opposed to prior methods for spatially-varying lighting estimation, which require ground truth scene geometry for training. We demonstrate that our method can predict consistent spatially-varying lighting that is convincing enough to plausibly relight and insert highly specular virtual objects into real images. View details
    Pushing the Boundaries of View Extrapolation with Multiplane Images
    Pratul Srinivasan
    Ravi Ramamoorthi
    Ren Ng
    Computer Vision and Pattern Recognition (CVPR) (2019)
    Preview abstract We explore the problem of view synthesis from a narrow baseline pair of images, and focus on generating high-quality view extrapolations with plausible disocclusions. Our method builds upon prior work in predicting a multiplane image (MPI), which represents scene content as a set of RGBA planes within a reference view frustum and renders novel views by projecting this content into the target viewpoints. We present a theoretical analysis showing how the range of views that can be rendered from an MPI increases linearly with the MPI disparity sampling frequency, as well as a novel MPI prediction procedure that theoretically enables view extrapolations of up to 4 times the lateral viewpoint movement allowed by prior work. Our method ameliorates two specific issues that limit the range of views renderable by prior methods: 1) We expand the range of novel views that can be rendered without depth discretization artifacts by using a 3D convolutional network architecture along with a randomized-resolution training procedure to allow our model to predict MPIs with increased disparity sampling frequency. 2) We reduce the repeated texture artifacts seen in disocclusions by enforcing a constraint that the appearance of hidden content at any depth must be drawn from visible content at or behind that depth. View details
    Preview abstract Deep learning techniques have enabled rapid progress in monocular depth estimation, but their quality is limited by the ill-posed nature of the problem and the scarcity of high quality datasets. We estimate depth from a single cam-era by leveraging the dual-pixel auto-focus hardware that is increasingly common on modern camera sensors. Classic stereo algorithms and prior learning-based depth estimation techniques underperform when applied on this dual-pixel data, the former due to too-strong assumptions about RGB image matching, and the latter due to a lack of understanding of the optics of dual-pixel image formation. To allow learning based methods to work well on dual-pixel imagery, we identify an inherent ambiguity in the depth estimated from dual-pixel cues, and develop an approach to estimate depth up to this ambiguity. Using our approach,existing monocular depth estimation techniques can be effectively applied to dual-pixel data, and much smaller models can be constructed that still infer high quality depth. To demonstrate this, we capture a large dataset of in-the-wild 5-viewpoint RGB images paired with corresponding dual-pixel data, and show how view supervision with this data can be used to learn depth up to the unknown ambiguities. On our new task, our model is 30%more accurate than any prior work on learning-based monocular or stereoscopic depth estimation. View details
    Handheld Mobile Photography in Very Low Light
    Kiran Murthy
    Yun-Ta Tsai
    Tim Brooks
    Tianfan Xue
    Nikhil Karnad
    Dillon Sharlet
    Ryan Geiss
    Marc Levoy
    ACM Transactions on Graphics, vol. 38 (2019), pp. 16
    Preview abstract Taking photographs in low light using a mobile phone is challenging and rarely produces pleasing results. Aside from the physical limits imposed by read noise and photon shot noise, these cameras are typically handheld, have small apertures and sensors, use mass-produced analog electronics that cannot easily be cooled, and are commonly used to photograph subjects that move, like children and pets. In this paper we describe a system for capturing clean, sharp, colorful photographs in light as low as 0.3 lux, where human vision becomes monochromatic and indistinct. To permit handheld photography without flash illumination, we capture, align, and combine multiple frames. Our system employs “motion metering”, which uses an estimate of motion magnitudes (whether due to handshake or moving objects) to identify the number of frames and the per-frame exposure times that together minimize both noise and motion blur in a captured burst. We combine these frames using robust alignment and merging techniques that are specialized for high-noise imagery. To ensure accurate colors in such low light, we employ a learning-based auto white balancing algorithm. To prevent the photographs from looking like they were shot in daylight, we use tone mapping techniques inspired by illusionistic painting: increasing contrast, crushing shadows to black, and surrounding the scene with darkness. All of these processes are performed using the limited computational resources of a mobile device. Our system can be used by novice photographers to produce shareable pictures in a few seconds based on a single shutter press, even in environments so dim that humans cannot see clearly. View details
    Single Image Portrait Relighting
    Christoph Rhemann
    Graham Fyffe
    Paul Debevec
    Ravi Ramamoorthi
    Tiancheng Sun
    Xueming Yu
    Yun-Ta Tsai
    Zexiang Xu
    SIGGRAPH (2019)
    Preview abstract Lighting plays a central role in conveying the essence and depth of the subject in a 2D portrait photograph. Professional photographers will carefully control the lighting in their studio to manipulate the appearance of their subject, while consumer photographers are usually constrained to the illumination of their environment. Though prior works have explored techniques for relighting an image, their utility is usually limited due to requirements of specialized hardware, multiple images of the subject under controlled or known illuminations, or accurate models of geometry and reflectance. takes as input a single RGB image of a portrait taken with a standard cellphone camera in an unconstrained environment, and from that image produces a relit image of that subject as though it were illuminated according to any provided environment map. Our proposed technique produces quantitatively superior results on our dataset's validation set compared to prior work, and produces convincing qualitative relighting results on a dataset of hundreds of real-world cellphone portraits. Because our technique can produce a 640 x 640 image in only 160 milliseconds, it may enable interactive user-facing photographic applications in the future. View details
    Depth from motion for smartphone AR
    Julien Valentin
    Neal Wadhwa
    Max Dzitsiuk
    Michael John Schoenberg
    Vivek Verma
    Ambrus Csaszar
    Ivan Dryanovski
    Joao Afonso
    Jose Pascoal
    Konstantine Nicholas John Tsotsos
    Mira Angela Leung
    Mirko Schmidt
    Sameh Khamis
    Vladimir Tankovich
    Shahram Izadi
    Christoph Rhemann
    ACM Transactions on Graphics (2018)
    Preview abstract Augmented reality (AR) for smartphones has matured from a technology for earlier adopters, available only on select high-end phones, to one that is truly available to the general public. One of the key breakthroughs has been in low-compute methods for six degree of freedom (6DoF) tracking on phones using only the existing hardware (camera and inertial sensors). 6DoF tracking is the cornerstone of smartphone AR allowing virtual content to be precisely locked on top of the real world. However, to really give users the impression of believable AR, one requires mobile depth. Without depth, even simple effects such as a virtual object being correctly occluded by the real-world is impossible. However, requiring a mobile depth sensor would severely restrict the access to such features. In this article, we provide a novel pipeline for mobile depth that supports a wide array of mobile phones, and uses only the existing monocular color sensor. Through several technical contributions, we provide the ability to compute low latency dense depth maps using only a single CPU core of a wide range of (medium-high) mobile phones. We demonstrate the capabilities of our approach on high-level AR applications including real-time navigation and shopping. View details
    Synthetic Depth-of-Field with a Single-Camera Mobile Phone
    Neal Wadhwa
    David E. Jacobs
    Bryan E. Feldman
    Nori Kanazawa
    Robert Carroll
    Marc Levoy
    SIGGRAPH (2018) (to appear)
    Preview abstract Shallow depth-of-field is commonly used by photographers to isolate a subject from a distracting background. However, standard cell phone cameras cannot produce such images optically, as their short focal lengths and small apertures capture nearly all-in-focus images. We present a system to computationally synthesize shallow depth-of-field images with a single mobile camera and a single button press. If the image is of a person, we use a person segmentation network to separate the person and their accessories from the background. If available, we also use dense dual-pixel auto-focus hardware, effectively a 2-sample light field with an approximately 1 millimeter baseline, to compute a dense depth map. These two signals are combined and used to render a defocused image. Our system can process a 5.4 megapixel image in 4 seconds on a mobile phone, is fully automatic, and is robust enough to be used by non-experts. The modular nature of our system allows it to degrade naturally in the absence of a dual-pixel sensor or a human subject. View details
    Aperture Supervision for Monocular Depth Estimation
    Pratul Srinivasan
    Neal Wadhwa
    Ren Ng
    CVPR (2018) (to appear)
    Preview abstract We present a novel method to train machine learning algorithms to estimate scene depths from a single image, by using the information provided by a camera's aperture as supervision. Prior works use a depth sensor's outputs or images of the same scene from alternate viewpoints as supervision, while our method instead uses images from the same viewpoint taken with a varying camera aperture. To enable learning algorithms to use aperture effects as supervision, we introduce two differentiable aperture rendering functions that use the input image and predicted depths to simulate the depth-of-field effects caused by real camera apertures. We train a monocular depth estimation network end-to-end to predict the scene depths that best explain these finite aperture images as defocus-blurred renderings of the input all-in-focus image. View details
    Burst Denoising with Kernel Prediction Networks
    Ben Mildenhall
    Jiawen Chen
    Dillon Sharlet
    Ren Ng
    Rob Carroll
    CVPR (2018) (to appear)
    Preview abstract We present a technique for jointly denoising bursts of images taken from a handheld camera. In particular, we propose a convolutional neural network architecture for predicting spatially varying kernels that can both align and denoise frames, a synthetic data generation approach based on a realistic noise formation model, and an optimization guided by an annealed loss function to avoid undesirable local minima. Our model matches or outperforms the state-of-the-art across a wide range of noise levels on both real and synthetic data. View details
    Preview abstract Performance is a critical challenge in mobile image processing. Given a reference imaging pipeline, or even human-adjusted pairs of images, we seek to reproduce the enhancements and enable real-time evaluation. For this, we introduce a new neural network architecture inspired by bilateral grid processing and local affine color transforms. Using pairs of input/output images, we train a convolutional neural network to predict the coefficients of a locally-affine model in bilateral space. Our architecture learns to make local, global, and content-dependent decisions to approximate the desired image transformation. At runtime, the neural network consumes a low-resolution version of the input image, produces a set of affine transformations in bilateral space, upsamples those transformations in an edge-preserving fashion using a new slicing node, and then applies those upsampled transformations to the full-resolution image. Our algorithm processes high-resolution images on a smartphone in milliseconds, provides a real-time viewfinder at 1080p resolution, and matches the quality of state-of-the-art approximation techniques on a large class of image operators. Unlike previous work, our model is trained off-line from data and therefore does not require access to the original operator at runtime. This allows our model to learn complex, scene-dependent transformations for which no reference implementation is available, such as the photographic edits of a human retoucher. View details
    Preview abstract We present Fast Fourier Color Constancy (FFCC), a novel color constancy algorithm which works by reformulating the problem of illuminant estimation into a spatial localization task on a torus. On standard benchmarks, our model produces lower error rates than the previous state-of-the-art by 13 - 20%, while also being 250 - 3000 times faster. This speed and accuracy is primarily due to how FFCC primarily operates in the frequency domain, though this approach also introduces a set of new difficulties regarding aliasing, directional statistics and preconditioning, which we address. Unlike past work, our model produces a complete posterior distribution over illuminants instead of a single illuminant estimate, which allows for a richer analysis and enables a novel temporal smoothing technique. FFCC is capable of running at ~ 700 frames per second on a mobile phone, making it a viable solution to the problem of constructing an effective, real-time, temporally-coherent automatic white balance algorithm. View details
    Preview abstract We present the bilateral solver, a novel algorithm for edge-aware smoothing that combines the flexibility and speed of simple filtering approaches with the accuracy of specialized domain-specific optimization algorithms. Our single technique is capable of matching or improving upon state-of-the-art results on several different computer vision tasks (stereo, depth superresolution, colorization, and semantic segmentation) while being 10-1000 times faster than competing approaches. The bilateral solver is fast, robust, straightforward to generalize to new domains, and capable of being easily integrated into deep learning pipelines. View details
    Burst photography for high dynamic range and low-light imaging on mobile cameras
    Dillon Sharlet
    Ryan Geiss
    Andrew Adams
    Florian Kainz
    Jiawen Chen
    Marc Levoy
    SIGGRAPH Asia (2016)
    Preview abstract Cell phone cameras have small apertures, which limits the number of photons they can gather, leading to noisy images in low light. They also have small sensor pixels, which limits the number of electrons each pixel can store, leading to limited dynamic range. We describe a computational photography pipeline that captures, aligns, and merges a burst of frames to reduce noise and increase dynamic range. Our system has several key features that help make it robust and efficient. First, we do not use bracketed exposures. Instead, we capture frames of constant exposure, which makes alignment more robust, and we set this exposure low enough to avoid blowing out highlights. The resulting merged image has clean shadows and high bit depth, allowing us to apply standard HDR tone mapping methods. Second, we begin from Bayer raw frames rather than the demosaicked RGB (or YUV) frames produced by hardware Image Signal Processors (ISPs) common on mobile platforms. This gives us more bits per pixel and allows us to circumvent the ISP's unwanted tone mapping and spatial denoising. Third, we use a novel FFT-based alignment algorithm and a hybrid 2D/3D Wiener filter to denoise and merge the frames in a burst. Our implementation is built atop Android's Camera2 API, which provides per-frame camera control and access to raw imagery, and is written in the Halide domain-specific language (DSL). It runs in 4 seconds on device (for a 12 Mpix image), requires no user intervention, and ships on several mass-produced cell phones. View details
    Preview abstract We present Jump, a practical system for capturing high resolution, omnidirectional stereo (ODS) video suitable for wide scale consumption in currently available virtual reality (VR) headsets. Our system consists of a video camera built using off-the-shelf components and a fully automatic stitching pipeline capable of capturing video content in the ODS format. We have discovered and analyzed the distortions inherent to ODS when used for VR display as well as those introduced by our capture method and show that they are small enough to make this approach suitable for capturing a wide variety of scenes. Our stitching algorithm produces robust results by reducing the problem to one of pairwise image interpolation followed by compositing. We introduce novel optical flow and compositing methods designed specifically for this task. Our algorithm is temporally coherent and efficient, is currently running at scale on a distributed computing platform, and is capable of processing hours of footage each day. View details
    Preview abstract Color constancy is the problem of inferring the color of the light that illuminated a scene, usually so that the illumination color can be removed. Because this problem is underconstrained, it is often solved by modeling the statistical regularities of the colors of natural objects and illumination. In contrast, in this paper we reformulate the problem of color constancy as a 2D spatial localization task in a log-chrominance space, thereby allowing us to apply techniques from object detection and structured prediction to the color constancy problem. By directly learning how to discriminate between correctly white-balanced images and poorly white-balanced images, our model is able to improve performance on standard benchmarks by nearly 40%. View details
    Fast Bilateral-Space Stereo for Synthetic Defocus
    Andrew Adams
    YiChang Shih
    Carlos Hernández
    CVPR (2015)
    Preview abstract Given a stereo pair it is possible to recover a depth map and use that depth to render a synthetically defocused image. Though stereo algorithms are well-studied, rarely are those algorithms considered solely in the context of producing these defocused renderings. In this paper we present a technique for efficiently producing disparity maps using a novel optimization framework in which inference is performed in "bilateral-space". Our approach produces higher-quality "defocus" results than other stereo algorithms while also being 10-100 times faster than comparable techniques. View details
    No Results Found