Richard F. Lyon
Dick Lyon, author of the 2017 book Human and Machine Hearing:
Extracting Meaning from Sound, has a long history of research and invention, including the optical mouse, speech and handwriting recognition, computational models of hearing, and color photographic imaging. At Google he worked on Street View camera systems, and is now focused on machine hearing technology and applications.
Authored Publications
Google Publications
Other Publications
Sort By
Tight bounds for the median of a gamma distribution
PLOS One (2023)
Preview abstract
The median of a standard gamma distribution, as a function of its shape parameter $k$, has no known representation in terms of elementary functions. In this work we prove the tightest upper and lower bounds of the form $2^{-1/k} (A + k)$: an upper bound with $A = e^{-\gamma}$ that is tight for low $k$ and a lower bound with $A = \log(2) - \frac{1}{3}$ that is tight for high $k$. These bounds are valid over the entire domain of $k > 0$, staying between 48 and 55 percentile. We derive and prove several other new tight bounds in support of the proofs.
View details
Preview abstract
Understanding speech in the presence of noise with hearing aids can be challenging. Here we describe our entry, submission E003, to the 2021 Clarity Enhancement Challenge Round1 (CEC1), a machine learning challenge for improving hearing aid processing. We apply and evaluate a deep neural network speech enhancement model with a low-latency recursive least squares (RLS) adaptive beamformer, and a linear equalizer, to improve speech intelligibility in the presence of speech or noise interferers. The enhancement network is trained only on the CEC1 data, and all processing obeys the 5 ms latency requirement. We quantify the improvement using the CEC1 provided hearing loss model and Modified Binaural Short-Time Objective Intelligibility (MBSTOI) score (ranging from 0 to 1, higher being better). On the CEC1 test set, we achieve a mean of 0.644 and median of 0.652 compared to the 0.310 mean and 0.314 median for the baseline. In the CEC1 subjective listener intelligibility assessment, for scenes with noise interferers, we achieve the second highest improvement in intelligibility from 33.2% to 85.5%, but for speech interferers, we see more mixed results, potentially from listener confusion.
View details
VHP: Vibrotactile Haptics Platform for On-body Applications
Dimitri Kanevsky
Malcolm Slaney
UIST, ACM, https://dl.acm.org/doi/10.1145/3472749.3474772 (2021)
Preview abstract
Wearable vibrotactile devices have many potential applications, including novel interfaces and sensory substitution for accessibility. Currently, vibrotactile experimentation is done using large lab setups. However, most practical applications require standalone on-body devices and integration into small form factors. Such integration is time-consuming and requires expertise.
To democratize wearable haptics we introduce VHP, a vibrotactile haptics platform. It comprises a low-power, miniature electronics board that can drive up to 12 independent channels of haptic signals with arbitrary waveforms at 2 kHz. The platform can drive vibrotactile actuators including LRAs and voice coils. Each vibrotactile channel has current-based load sensing, thus allowing for self-testing and auto-adjustment. The hardware is battery powered, programmable, has multiple input options, including serial and Bluetooth, as well as the ability to synthesize haptic signals internally. We conduct technical evaluations to determine the power consumption, latency, and how number of actuators that can run simultaneously.
We demonstrate applications where we integrate the platform into a bracelet and a sleeve to provide an audio-to-tactile wearable interface. To facilitate more use of this platform, we open-source our design and partner with a distributor to make the hardware widely available. We hope this work will motivate the use and study of vibrotactile all-day wearable devices.
View details
Ecological Auditory Measures for the Next Billion Users
Brian Kemler
Chet Gnegy
Dimitri Kanevsky
Malcolm Slaney
Ear and Hearing (2020)
Preview abstract
A range of new technologies have the potential to help people, whether traditionally considered hearing impaired or not. These technologies include more sophisticated personal sound amplification products, as well as real-time speech enhancement and speech recognition. They can improve user’s communication abilities, but these new approaches require new ways to describe their success and allow engineers to optimize their properties. Speech recognition systems are often optimized using the word-error rate, but when the results are presented in real time, user interface issues become a lot more important than conventional measures of auditory performance. For example, there is a tradeoff between minimizing recognition time (latency) by quickly displaying results versus disturbing the user’s cognitive flow by rewriting the results on the screen when the recognizer later needs to change its decisions. This article describes current, new, and future directions for helping billions of people with their hearing. These new technologies bring auditory assistance to new users, especially to those in areas of the world without access to professional medical expertise. In the short term, audio enhancement technologies in inexpensive mobile forms, devices that are quickly becoming necessary to navigate all aspects of our lives, can bring better audio signals to many people. Alternatively, current speech recognition technology may obviate the need for audio amplification or enhancement at all and could be useful for listeners with normal hearing or with hearing loss. With new and dramatically better technology based on deep neural networks, speech enhancement improves the signal to noise ratio, and audio classifiers can recognize sounds in the user’s environment. Both use deep neural networks to improve a user’s experiences. Longer term, auditory attention decoding is expected to allow our devices to understand where a user is directing their attention and thus allow our devices to respond better to their needs. In all these cases, the technologies turn the hearing assistance problem on its head, and thus require new ways to measure their performance.
View details
Haptics with Input: Back-EMF in Linear Resonant Actuators to Enable Touch, Pressure and Environmental Awareness.
Proceedings of UIST 2020 (ACM Symposium on User Interface Software and Technology), ACM, New York, NY
Preview abstract
Today’s wearable and mobile devices typically use separate hardware components for sensing and actuation. In this work, we introduce new opportunities for the Linear Resonant Actuator (LRA), which is ubiquitous in such devices due to its capability for providing rich haptic feedback. By leveraging strategies to enable active and passive sensing capabilities with LRAs, we demonstrate their benefits and potential as self-contained I/O devices. Specifically, we use the back-EMF voltage to classify if the LRA is tapped, touched, as well as how much pressure is being applied. The back-EMF sensing is already integrated into many motor and LRA drivers. We developed a passive low-power tap sensing method that uses just 37.7 uA. Furthermore, we developed active touch and pressure sensing, which is low-power, quiet (2 dB), and minimizes vibration. The sensing method works with many types of LRAs. We show applications, such as pressure-sensing side-buttons on a mobile phone. We have also implemented our technique directly on an existing mobile phone’s LRA to detect if the phone is handheld or placed on a soft or hard surface. Finally, we show that this method can be used for haptic devices to determine if the LRA makes good contact with the skin. Our approach can add rich sensing capabilities to the ubiquitous LRA actuators without requiring additional sensors or hardware.
View details
Preview abstract
The cascade of asymmetric resonators with fast-acting compression (CARFAC) is a cascade filterbank model that performed well in a comparative study of cochlear models, but exhibited two anomalies in its frequency response and excitation pattern. It is shown here that the underlying reason is CARFAC's inclusion of quadratic distortion, which generates DC and low-frequency components that in a real cochlea would be canceled by reflections at the helicotrema, but since cascade filterbanks lack the reflection mechanism, these low-frequency components cause the observed anomalies. The simulations demonstrate that the anomalies disappear when the model's quadratic distortion parameter is zeroed, while other successful features of the model remain intact.
View details
EXPLORING TRADEOFFS IN MODELS FOR LOW-LATENCY SPEECH ENHANCEMENT
Jeremy Thorpe
Michael Chinen
Proceedings of the 16th International Workshop on Acoustic Signal Enhancement (2018)
Preview abstract
We explore a variety of configurations of neural networks for one- and
two-channel spectrogram-mask-based speech enhancement. Our best model improves on
state-of-the-art performance on the CHiME2 speech enhancement task.
We examine trade-offs among non-causal lookahead, compute work, and parameter count versus enhancement performance and find that zero-lookahead models can achieve, on average, only 0.5 dB worse performance than our best bidirectional model. Further, we find that 200 milliseconds of lookahead is sufficient to achieve performance within about 0.2 dB from our best bidirectional model.
View details
Human and Machine Hearing: Extracting Meaning from Sound
Cambridge University Press (2017)
Preview abstract
Human and Machine Hearing is the first book to comprehensively describe how human hearing works and how to build machines to analyze sounds in the same way that people do. Drawing on over thirty-five years of experience in analyzing hearing and building systems, Richard F. Lyon explains how we can now build machines with close-to-human abilities in speech, music, and other sound-understanding domains. He explains human hearing in terms of engineering concepts, and describes how to incorporate those concepts into machines for a wide range of modern applications. The details of this approach are presented at an accessible level, to bring a diverse range of readers, from neuroscience to engineering, to a common technical understanding. The description of hearing as signal-processing algorithms is supported by corresponding open-source code, for which the book serves as motivating documentation.
View details
Trainable Frontend For Robust and Far-Field Keyword Spotting
Yuxuan Wang
Thad Hughes
Proc. IEEE ICASSP 2017, New Orleans, LA
Preview abstract
Robust and far-field speech recognition is critical to enable true hands-free communication. In far-field conditions, signals are attenuated due to distance. To improve robustness to loudness variation, we introduce a novel frontend called per-channel energy normalization (PCEN). The key ingredient of PCEN is the use of an automatic gain control based dynamic compression to replace the widely used static (such as log or root) compression. We evaluate PCEN on the keyword spotting task. On our large rerecorded noisy and far-field eval sets, we show that PCEN significantly improves recognition performance. Furthermore, we model PCEN as neural network layers and optimize high-dimensional PCEN parameters jointly with the keyword spotting acoustic model. The trained PCEN frontend demonstrates significant further improvements without increasing model complexity or inference-time cost.
View details
A 6 µW per Channel Analog Biomimetic Cochlear Implant Processor Filterbank Architecture With Across Channels AGC
Guang Wang
Emmanuel M. Drakakis
IEEE Transactions on Biomedical Circuits and Systems, vol. 9 (2015), pp. 72-86
Preview abstract
A new analog cochlear implant processor filterbank architecture of increased biofidelity, enhanced across-channel contrast and very low power consumption has been designed and prototyped. Each channel implements a biomimetic, asymmetric bandpass-like One-Zero-Gammatone-Filter (OZGF) transfer function, using class-AB log-domain techniques. Each channel's quality factor and suppression are controlled by means of a new low power Automatic Gain Control (AGC) scheme which is coupled across the neighboring channels and emulates lateral inhibition (LI) phenomena in the auditory system. Detailed measurements from a five-channel silicon IC prototype fabricated in a 0.35 µm AMS technology confirm the operation of the coupled AGC scheme and its ability to enhance contrast among channel outputs. The prototype is characterized by an input dynamic range of 92 dB while consuming only 28 µW of power in total ~6 µW per channel) under a 1.8 V power supply. The architecture is well-suited for fully-implantable cochlear implants.
View details
The Optical Mouse: Early Biomimetic Embedded Vision
Advnances in Embedded Computer Vision, Springer (2014), pp. 3-22
Preview abstract
The 1980 Xerox optical mouse invention, and subsequent product, was a successful deployment of embedded vision, as well as of the Mead–Conway VLSI design methodology that we developed at Xerox PARC in the late 1970s. The design incorporated an interpretation of visual lateral inhibition, essentially mimicking biology to achieve a wide dynamic range, or light-level-independent operation. Conceived in the context of a research group developing VLSI design methodologies, the optical mouse chip represented an approach to self-timed semi-digital design, with the analog image-sensing nodes connecting directly to otherwise digital logic using a switch-network methodology. Using only a few hundred gates and pass transistors in 5-micron nMOS technology, the optical mouse chip tracked the motion of light dots in its field of view, and reported motion with a pair of 2-bit Gray codes for x and y relative position—just like the mechanical mice of the time. Besides the chip, the only other electronic components in the mouse were the LED illuminators.
View details
The Intervalgram: An Audio Feature for Large-Scale Cover-Song Recognition
Thomas C. Walters
From Sounds to Music and Emotions: 9th International Symposium, CMMR 2012, London, UK, June 19-22, 2012, Revised Selected Papers, Springer Berlin Heidelberg (2013), pp. 197-213
Preview abstract
We present a system for representing the musical content of short pieces of audio using a novel chroma-based representation known as the ‘intervalgram’, which is a summary of the local pattern of musical intervals in a segment of music. The intervalgram is based on a chroma representation derived from the temporal profile of the stabilized auditory image [10] and is made locally pitch invariant by means of a ‘soft’ pitch transposition to a local reference. Intervalgrams are generated for a piece of music using multiple overlapping windows. These sets of intervalgrams are used as the basis of a system for detection of identical melodic and harmonic progressions in a database of music. Using a dynamic-programming approach for comparisons between a reference and the song database, performance is evaluated on the ‘covers80’ dataset [4]. A first test of an intervalgram-based system on this dataset yields a precision at top-1 of 53.8%, with an ROC curve that shows very high precision up to moderate recall, suggesting that the intervalgram is adept at identifying the easier-to-match cover songs in the dataset with high robustness. The intervalgram is designed to support locality-sensitive hashing, such that an index lookup from each single intervalgram feature has a moderate probability of retrieving a match, with few false matches. With this indexing approach, a large reference database can be quickly pruned before more detailed matching, as in previous content-identification systems.
View details
Modelling the Distortion Produced by Cochlear Compression
Preview
Roy D. Patterson
Timothy Ives
Thomas C. Walters
Basic Aspects of Hearing, Springer (2013), pp. 81-88
Automatically Discovering Talented Musicians with Acoustic Analysis of YouTube Videos
Eric Nichols
Charles DuHadway
Proceedings of the 2012 IEEE 12th International Conference on Data Mining (ICDM), IEEE Computer Society, Washington, DC, USA, pp. 559-565
Preview abstract
Online video presents a great opportunity for up-and-coming singers and artists to be visible to a worldwide audience. However, the sheer quantity of video makes it difficult to discover promising musicians. We present a novel algorithm to automatically identify talented musicians using machine learning and acoustic analysis on a large set of "home singing" videos. We describe how candidate musician videos are identified and ranked by singing quality. To this end, we present new audio features specifically designed to directly capture singing quality. We evaluate these vis-a-vis a large set of generic audio features and demonstrate that the proposed features have good predictive performance. We also show that this algorithm performs well when videos are normalized for production quality.
View details
Cascades of two-pole–two-zero asymmetric resonators are good models of peripheral auditory function
Journal of the Acoustical Society of America, vol. 130 (2011), pp. 3893-3904
Preview abstract
A cascade of two-pole–two-zero filter stages is a good model of the auditory periphery in two distinct ways. First, in the form of the pole–zero filter cascade, it acts as an auditory filter model that provides an excellent fit to data on human detection of tones in masking noise, with fewer fitting parameters than previously reported filter models such as the roex and gammachirp models. Second, when extended to the form of the cascade of asymmetric resonators with fast-acting compression, it serves as an efficient front-end filterbank for machine-hearing applications, including dynamic nonlinear effects such as fast wide-dynamic-range compression. In their underlying linear approximations, these filters are described by their poles and zeros, that is, by rational transfer functions, which makes them simple to implement in analog or digital domains. Other advantages in these models derive from the close connection of the filter-cascade architecture to wave propagation in the cochlea. These models also reflect the automatic-gain-control function of the auditory system and can maintain approximately constant impulse-response zero-crossing times as the level-dependent parameters change.
Copyright (2011) Acoustical Society of America. This article may be downloaded for personal use only. Any other use requires prior permission of the author and the Acoustical Society of America. The article appeared in J. Acoust. Soc. Am. vol. 130 and may be found via http://asadl.org/jasa/resource/1/jasman/v130/i6/p3893_s1.
View details
A Pole-Zero Filter Cascade Provides Good Fits to Human Masking Data and to Basilar Membrane and Neural Data
Mechanics of Hearing (2011)
Preview abstract
A cascade of two-pole–two-zero filters with level-dependent pole and
zero dampings, with few parameters, can provide a good match to human
psychophysical and physiological data. The model has been fitted to
data on detection threshold for tones in notched-noise masking,
including bandwidth and filter shape changes over a wide range of
levels, and has been shown to provide better fits with fewer parameters
compared to other auditory filter models such as gammachirps.
Originally motivated as an efficient machine implementation of auditory
filtering related to the WKB analysis method of cochlear wave
propagation, such filter cascades also provide good fits to mechanical
basilar membrane data, and to auditory nerve data, including linear
low-frequency tail response, level-dependent peak gain, sharp tuning
curves, nonlinear compression curves, level-independent zero-crossing
times in the impulse response, realistic instantaneous frequency
glides, and appropriate level-dependent group delay even with
minimum-phase response. As part of exploring different level-dependent
parameterizations of such filter cascades, we have identified a simple
sufficient condition for stable zero-crossing times, based on the
shifting property of the Laplace transform: simply move all the
$s$-domain poles and zeros by equal amounts in the real-$s$ direction.
Such pole-zero filter cascades are efficient front ends for machine
hearing applications, such as music information retrieval, content
identification, speech recognition, and sound indexing.
View details
Preview abstract
The concept of sparsity has attracted considerable interest in the
field of machine learning in the past few years. Sparse feature
vectors contain mostly values of zero and one or a few non-zero
values. Although these feature vectors can be classified by
traditional machine learning algorithms, such as SVM, there are various
recently-developed algorithms that explicitly take advantage of
the sparse nature of the data, leading to massive speedups in time, as
well as improved performance. Some fields that have benefited from
the use of sparse algorithms are finance, bioinformatics, text mining,
and image classification.
Because of their speed, these algorithms perform well on very large
collections of data; large collections are becoming
increasingly relevant given the huge amounts of data collected and warehoused
by Internet businesses.
We discuss the application of sparse feature vectors
in the field of audio analysis, and specifically their use in conjunction with
preprocessing systems that model the human auditory system. We present
results that demonstrate the applicability of the combination of
auditory-based processing and sparse coding to content-based audio analysis tasks: a search task in which
ranked lists of sound effects are retrieved from text queries, and a music
information retrieval (MIR) task dealing with the classification of music into
genres.
View details
Using a Cascade of Asymmetric Resonators with Fast-Acting Compression as a Cochlear Model for Machine-Hearing Applications
Autumn Meeting of the Acoustical Society of Japan (2011), pp. 509-512
Preview abstract
Every day, machines process many thousands of hours of audio signals through a realistic cochlear model. They extract features, inform classifiers and recommenders, and identify copyrighted material. The machine-hearing approach to such tasks has taken root in recent years, because hearing-based approaches perform better than we can do with more conventional sound-analysis approaches. We use a bio-mimetic "cascade of asymmetric resonators with fast-acting compression" (CAR-FAC)—an efficient sound analyzer that incorporates the hearing research community's findings on nonlinear auditory filter models and cochlear wave mechanics. The CAR-FAC is based on a pole–zero filter cascade (PZFC) model of auditory filtering, in combination with a multi-time-scale coupled automatic-gain-control (AGC) network. It uses simple nonlinear extensions of conventional digital filter stages, and runs fast due to its low complexity. The PZFC plus AGC network, the CAR-FAC, mimics features of auditory physiology, such as masking, compressive traveling-wave response, and the stability of zero-crossing times with signal level. Its output "neural activity pattern" is converted to a "stabilized auditory image" to capture pitch, melody, and other temporal and spectral features of the sound.
View details
Preview abstract
A key problem in using the output of an auditory model as the input to a machine-learning system in a machine-hearing application is to find a good feature-extraction layer. For systems such as PAMIR (passive-aggressive model for image retrieval) that work well with a large sparse feature vector, a conversion from auditory images to sparse features is needed. For audio-file ranking and retrieval from text queries, based on stabilized auditory images, we took a multi-scale approach, using vector quantization to choose one sparse feature in each of many overlapping regions of different scales, with the hope that in some regions the features for a sound would be stable even when other interfering sounds were present and affecting other regions. We recently extended our testing of this approach using sound mixtures, and found that the sparse-coded auditory-image features degrade less in interference than vector-quantized MFCC sparse features do. This initial success suggests that our hope of robustness in interference may indeed be realizable, via the general idea of sparse features that are localized in a domain where signal components tend to be localized or stable.
View details
Sound Retrieval and Ranking Using Sparse Auditory Representations
Martin Rehn
Samy Bengio
Thomas C. Walters
Gal Chechik
Neural Computation, vol. 22 (2010), pp. 2390-2416
Preview abstract
To create systems that understand the sounds that humans are exposed
to in everyday life, we need to represent sounds with features that
can discriminate among many different sound classes. Here, we use a
sound-ranking framework to quantitatively evaluate such
representations in a large scale task. We have adapted a
machine-vision method, the ``passive-aggressive model for image
retrieval'' (PAMIR), which efficiently learns a linear mapping from a
very large sparse feature space to a large query-term space. Using
this approach we compare different auditory front ends and different
ways of extracting sparse features from high-dimensional auditory
images. We tested auditory models that use adaptive pole--zero filter
cascade (PZFC) auditory filterbank and sparse-code feature extraction
from stabilized auditory images via multiple vector quantizers. In
addition to auditory image models, we also compare a family of more
conventional Mel-Frequency Cepstral Coefficient (MFCC) front ends. The
experimental results show a significant advantage for the auditory
models over vector-quantized MFCCs. Ranking thousands of sound files
with a query vocabulary of thousands of words, the best precision at
top-1 was 73% and the average precision was 35%, reflecting a 18%
improvement over the best competing MFCC.
View details
Google Street View: Capturing the World at Street Level
Dragomir Anguelov
Carole Dulong
Daniel Filip
Christian Frueh
Abhijit Ogale
Luc Vincent
Josh Weaver
Computer, vol. 43 (2010)
Preview abstract
Street View serves millions of Google users daily with panoramic imagery captured in hundreds of cities in 20 countries across four continents. A team of Google researchers describes the technical challenges involved in capturing, processing, and serving street-level imagery on a global scale.
View details
Preview abstract
Auditory filter models have a history of over a hundred years, with explicit
bio-mimetic inspiration at many stages along the way. From passive analogue electric
delay line models, through digital filter models, active analogue VLSI models, and
abstract filter shape models, these filters have both represented and driven the state of
progress in auditory research. Today, we are able to represent a wide range of linear
and nonlinear aspects of the psychophysics and physiology of hearing with a rather
simple and elegant set of circuits or computations that have a clear connection to
underlying hydrodynamics and with parameters calibrated to human performance data.
A key part of the progress in getting to this stage has been the experimental clarification
of the nature of cochlear nonlinearities, and the modelling work to map these experimental
results into the domain of circuits and systems. No matter how these models are built
into machine-hearing systems, their bio-mimetic roots will remain key to
their performance. In this paper we review some of these models, explain
their advantages and disadvantages and present possible ways of implementing them.
As an example, a continuous-time analogue CMOS implementation of the
One Zero Gammatone Filter (OZGF) is presented together with its automatic
gain control that models its level-dependent nonlinear behaviour.
View details
Machine Hearing: An Emerging Field
IEEE Signal Processing Magazine, vol. 27 (2010), pp. 131-139
Preview abstract
(intro paragraph in lieu of abstract) If we had machines that could hear as humans do, we would expect them to be able to easily distinguish speech from music and background noises, to pull out the speech and music parts for special treatment, to know what direction sounds are coming from, to learn which noises are typical and which are noteworthy. Hearing machines should be able to organize what they hear; learn names for recognizable objects, actions, events, places, musical styles, instruments, and speakers; and retrieve sounds by reference to those names. These machines should be able to listen and react in real time, to take appropriate action on hearing noteworthy events, to participate in ongoing activities, whether in factories, in musical performances, or in phone conversations.
View details
Sound Ranking Using Auditory Sparse-Code Representations
Martin Rehn
Samy Bengio
Thomas C. Walters
Gal Chechik
ICML 2009 Workshop on Sparse Method for Music Audio
Preview abstract
The task of ranking sounds from text queries is a
good test application for machine-hearing techniques, and particularly
for comparison and evaluation of alternative sound representations in
a large-scale setting. We have adapted a machine-vision system,
``passive-aggressive model for image retrieval''
(PAMIR), which
efficiently learns, using a ranking-based cost function, a linear
mapping from a very large sparse feature space to a large
query-term space.
Using this system allows us to focus on comparison of different
auditory front ends and different ways of extracting sparse features
from high-dimensional auditory images. In addition to two main
auditory-image models, we also include and compare a family of more
conventional MFCC front ends. The experimental results show a
significant advantage for the auditory models over vector-quantized MFCCs.
The two auditory models tested use the adaptive pole-zero filter
cascade (PZFC) auditory filterbank and sparse-code feature extraction
from stabilized auditory images via multiple vector quantizers. The
models differ in their implementation of the strobed temporal
integration used to generate the stabilized image. Using ranking
precision-at-top-k performance measures, the best results are about
70% top-1 precision and 35% average precision, using a test corpus
of thousands of sound files and a query vocabulary of hundreds of
words.
View details
A Biomimetic, 4.5 µW, 120+dB, Log-domain Cochlea Channel with AGC
Andreas G. Katsiamis
Emmanuel M. Drakakis
IEEE JSSC (Journal of Solid-State Circuits), vol. 44 (2009), pp. 1006-1022
Preview abstract
This paper deals with the design and performance evaluation of a new analog CMOS cochlea channel of increased biorealism. The design implements a recently proposed transfer function, namely the One-Zero Gammatone filter (or OZGF), which provides a robust foundation for modeling a variety of auditory data such as realistic passband asymmetry, linear low-frequency tail and level-dependent gain. Moreover, the OZGF is attractive because it can be implemented efficiently in any technological medium-analog or digital-using standard building blocks. The channel was synthesized using novel, low-power, class-AB, log-domain, biquadratic filters employing MOS transistors operating in their weak inversion regime. Furthermore, the paper details the design of a new low-power automatic gain control circuit that adapts the gain of the channel according to the input signal strength, thereby extending significantly its input dynamic range. We evaluate the performance of a fourth-order OZGF channel (equivalent to an 8th-order cascaded filter structure) through both detailed simulations and measurements from a fabricated chip using the commercially available 0.35 mum AMS CMOS process. The whole system is tuned at 3 kHz, dissipates a mere 4.46 µW of static power, accommodates 124 dB (at < 5% THD) of input dynamic range at the center frequency and is set to provide up to 70 dB of amplification for small signals.
View details
Large Scale Content-Based Audio Retrieval from Text Queries
Gal Chechik
Martin Rehn
Samy Bengio
ACM International Conference on Multimedia Information Retrieval (MIR), ACM (2008)
Preview abstract
In content-based audio retrieval, the goal is to find sound recordings (audio documents) based on their acoustic features. This content-based approach differs from retrieval approaches that index media files using metadata such as file names and user tags.
In this paper, we propose a machine learning approach for retrieving sounds that is novel in that it (1) uses free-form text queries rather sound sample based queries, (2) searches by audio content rather than via textual meta data, and (3) can scale to very large number of audio documents and very rich query vocabulary. We handle generic sounds, including a wide variety of sound effects, animal vocalizations and natural scenes. We test a scalable approach based on a passive-aggressive model for image retrieval (PAMIR), and compare it to two state-of-the-art approaches; Gaussian mixture models (GMM) and support vector machines (SVM).
We test our approach on two large real-world datasets: a collection of
short sound effects, and a noisier and larger collection of
user-contributed user-labeled recordings (25K files, 2000 terms
vocabulary). We find that all three methods achieved very good
retrieval performance. For instance, a positive document is retrieved
in the first position of the ranking more than half the time, and on
average there are more than 4 positive documents in the first 10
retrieved, for both datasets. PAMIR completed both training and
retrieval of all data in less than 6 hours for both datasets, on a
single machine. It was one to three orders of magnitude faster than
the competing approaches. This approach should therefore scale to much
larger datasets in the future.
View details
Practical Gammatone-Like Filters for Auditory Modeling
Andreas G. Katsiamis
Emmanuel M. Drakakis
EURASIP Journal on Audio, Speech, and Music Processing, vol. 2007 (2007), pp. 12
Preview abstract
This paper deals with continuous-time filter transfer functions that resemble tuning curves at particular set of places on the basilar
membrane of the biological cochlea and that are suitable for practical VLSI implementations. The resulting filters can be used in
a filterbank architecture to realize cochlea implants or auditory processors of increased biorealism. To put the reader into context,
the paper starts with a short review on the gammatone filter and then exposes two of its variants, namely, the differentiated all-pole
gammatone filter (DAPGF) and one-zero gammatone filter (OZGF), filter responses that provide a robust foundation for modeling
cochlea transfer functions. The DAPGF and OZGF responses are attractive because they exhibit certain characteristics suitable for
modeling a variety of auditory data: level-dependent gain, linear tail for frequencies well below the center frequency, asymmetry,
and so forth. In addition, their form suggests their implementation by means of cascades of N identical two-pole systems which
render them as excellent candidates for efficient analog or digital VLSI realizations. We provide results that shed light on their char-
acteristics and attributes and which can also serve as “design curves” for fitting these responses to frequency-domain physiological
data. The DAPGF and OZGF responses are essentially a “missing link” between physiological, electrical, and mechanical models
for auditory filtering.
View details
No Results Found