The authors have declared that no competing interests exist.
Conceived and designed the experiments: RCM TL FET. Performed the experiments: RCM TL FET. Analyzed the data: RCM TL FET. Contributed reagents/materials/analysis tools: RCM TL FET. Wrote the paper: RCM TL FET.
Given the extraordinary ability of humans and animals to recognize communication signals over a background of noise, describing noise invariant neural responses is critical not only to pinpoint the brain regions that are mediating our robust perceptions but also to understand the neural computations that are performing these tasks and the underlying circuitry. Although invariant neural responses, such as rotation-invariant face cells, are well described in the visual system, high-level auditory neurons that can represent the same behaviorally relevant signal in a range of listening conditions have yet to be discovered. Here we found neurons in a secondary area of the avian auditory cortex that exhibit noise-invariant responses in the sense that they responded with similar spike patterns to song stimuli presented in silence and over a background of naturalistic noise. By characterizing the neurons' tuning in terms of their responses to modulations in the temporal and spectral envelope of the sound, we then show that noise invariance is partly achieved by selectively responding to long sounds with sharp spectral structure. Finally, to demonstrate that such computations could explain noise invariance, we designed a biologically inspired noise-filtering algorithm that can be used to separate song or speech from noise. This novel noise-filtering method performs as well as other state-of-the-art de-noising algorithms and could be used in clinical or consumer oriented applications. Our biologically inspired model also shows how high-level noise-invariant responses could be created from neural responses typically found in primary auditory cortex.
Birds and humans excel at the task of detecting important sounds, such as song and speech, in difficult listening environments such as in a large bird colony or in a crowded bar. How our brains achieve such a feat remains a mystery to both neuroscientists and audio engineers. In our research, we found a population of neurons in the brain of songbirds that are able to extract a song signal from a background of noise. We explain how the neurons are able to perform this task and show how a biologically inspired algorithm could outperform the best noise-reduction methods proposed by engineers.
Invariant neural representations of behaviorally relevant objects are a hallmark of high-level sensory regions and are interpreted as the outcome of a series of computations that would allow us to recognize and categorize objects in
In this study, we examined how neurons in the secondary avian auditory cortical area NCM (
We recorded neural responses from single neurons in NCM of anesthetized adult male Zebra Finches. We obtained responses to 40 different unfamiliar conspecific songs and to the same songs embedded in naturalistic synthetic noise also called modulation-limited noise (ml-noise from here on). Ml-noise is broadband white-noise that has been filtered in the modulation domain to mimic the structure that is found in environmental sounds by restricting the power of modulations in the envelope to low spectral-temporal frequencies
As illustrated on the left panels in
Responses of two neurons (Cell A and Cell B) to song presented alone and over noise. The top row shows the spectrogram of the same zebra finch song used in the two recordings. Song starts at 0s. Below the spectrogram are raster plots and corresponding smoothed PSTHs. The first raster and PSTH correspond to the response of each neuron to the song alone presented at 70 dB SPL. Clear temporal synchrony across the four trials can be seen illustrative of an equally robust response to song stimuli. The second raster and PSTH correspond to the responses to song+ modulation limited noise (ml-noise) presented at 3dB signal to noise ratio. Ml-noise is synthesized by low-pass filtering white noise in the space of temporal and spectral modulations (see
To quantify the degree of noise robustness, we calculated two measures of noise-invariance: a de-biased correlation coefficient between the PSTHs obtained for the song alone and song + ml-noise stimuli (called ICC) and the ratio of the SNR estimated for the song + noise response and the song + ml-noise response (ISNR invariance). The ICC metric is a normalized measure that ranges in values between −1 and 1. It is 1 when the response pattern observed to song+ml-noise is identical to the one observed to song, irrespective of the relative magnitude of the two responses. For ISNR, we defined the response SNR as follows. For the response to song alone, the signal power was defined as the variance in the PSTH across time and the noise was defined as the mean firing rate. For the response to song plus noise, the signal was taken to be the time-varying response that could be predicted linearly from the response to song alone and the noise was the mean of this predicted response (see
We found neurons with different degrees of noise invariance throughout NCM but the neurons in the ventral region tended to have highest Icc (
To further attempt to understand how noise invariance was achieved in this system, we examined how the neurons' responses for particular joint spectral-temporal patterns that are unique to song could have contributed to robust coding of song in noisy conditions. To do so we estimated the STRF of each neuron and examined the predicted response to song and to song plus noise. The STRF describes how acoustical patterns in time and frequency correlate with the neuron's response
To determine whether a neuron's tuning for particular spectral-temporal features characteristic of song and less common in noise could explain the observed invariance, we use the STRF to obtain estimated responses to song and noise. We then regressed the ICC values that we measured directly from the neuron's response against the ICC values obtained from the predictions of STRF model (
Vertical axis in A–C shows the noise invariance in the neural response. Each neuron (each point on the scatterplots) is represented by its STRF (0.25–8 kHz on the vertical axis, 0–60 ms on the horizontal).
Therefore, for the task of extracting the song from noise, the most effective non-linearities appear to be the simple thresholding non-linearity (i.e. for neurons with STRFs closest to the x = y line in
Since the STRF could partially explain the observed noise-invariance, we asked what feature of the neurons spectral-temporal tuning was important for this computation. By estimating the modulation gain from the neurons' STRFs, we found that tuning for high spectral modulations and low temporal modulations correlate with neural invariance (
The generation of the observed modulation tuning properties of the more noise invariant neurons described in this study is not a trivial task: most neurons in lower auditory areas have much shorter integration times and lack the sharp excitation and inhibition along the spectral dimension that we observed here. From comprehensive surveys of tuning properties in the avian primary auditory cortex (Field L)
Since the tuning of noise invariant neurons described by their STRF and the threshold non-linearity only describes a fraction of the invariance, we were interested in assessing whether noise invariant neurons were selective for longer sound segments such as those that might be useful to distinguish one song from another. To begin to investigate this idea, we examined the invariance of all the neurons for each song and calculated the standard deviation and the coefficient of variation (CV) of the invariance metric for each neuron. These results are shown as a two-dimensional heat map on
Two dimensional heat plot that shows the value of the variance metric obtained for each neuron (n = 32) and each song stimuli (n = 36). The neurons are sorted from low mean invariance (bottom row) to high mean invariance (top row). The columns on the left show the standard deviation of the variance and the coefficient of variation for each neuron. The color bar is placed at the bottom of the graph and is the same for the variance, the standard deviation and the coefficient of variation. The grey cells in the matrix correspond to (neuron, stimulus) where we were not able to calculate the invariance either because of missing data or very low response rates.
Inspired by our discovery of noise invariant neurons in NCM, we engineered a noise filtering algorithm based on a decomposition of the sound by an ensemble of “artificial” neurons described by realistic STRFs. We developed this algorithm both for biological and engineering purposes. Our biological goal was to demonstrate that an ensemble of noise-invariant responses such as the one observed here could indeed be used to recover a signal from noise. We also wanted to show whether an optimization process designed to extract signals from noise would rely on responses of particular artificial neurons with properties that are similar to those found in the biology. Finally, we also wanted to explore to what extent is the invariance of signal in noise dependent on the exact statistics of the signal and noise stimuli. Our engineering goal was to develop a real-time algorithm inspired by the biology that could potentially be used in clinical applications such as hearing aids and cochlear implants or in commercial applications involving automatic speech recognition
Our ensemble of artificial neurons can be thought of as a modulation filter bank because the response of each neuron quantifies the presence and absence of particular spectral-temporal patterns as observed in a spectrogram and, contrary to a frequency filter bank, not solely the presence or absence of energy at a particular frequency band. In other words, the STRFs can be thought of as “higher-level” sound filters: if lower-level sound filters operate in the frequency domain (for example removing low frequency noise such as the hum of airplane engines), these high-level filters operate in the spectral-temporal modulation domain. In this joint modulation domain, sounds that have structure in time (such as beats) or structure in frequency (such as in a musical note composed of a fundamental tone and its harmonically related overtones) are characterized by specific temporal and spectral modulations. A spectral-temporal modulation filter could then be used to detect sounds that contain particular time-frequency patterns while filtering out other sounds that might have similar frequency content but lack this spectral-temporal structure. Similar decompositions have also been proposed and used by others for the efficient processing of speech and other complex signals
Noise filtering with such a modulation filter bank can be described as series of signal processing steps: i) decompose the signal into frequency channels using a frequency filter bank; ii) represent the sound as the envelope in each of the frequency channels, as it is done in a spectrogram; iii) filter this time-frequency amplitude representation by a modulation filter bank to effectively obtain a filtered spectrogram; iv) invert this filtered spectrogram to recover the desired signal. Although each of these steps involves relatively simple signal processing, two significant issues remain. First, one has to choose the appropriate gain on the modulation filters in order to detect behaviorally relevant signals over noise. Second, the spectrogram inversion step requires a computationally intensive iterative procedure
The various steps in our algorithm are illustrated on
We implemented a biologically inspired noise-filtering algorithm using an analysis/synthesis paradigm (top row) where the synthesis step is based on a STRF filter bank decomposition. The bottom row shows the model neural responses obtained from a sound (spectrogram of noise-corrupted song) using the filter bank of biologically realistic STRFs. These responses are then weighed optimally with weights d1,..,dM to select the combination of responses that are most noise-invariant. The weighted responses are then transformed into frequency space by multiplying the weighted responses by the frequency marginal of the corresponding STRF (color-matched on the figure) to obtain gains as a function of frequency. The top row illustrates how these time-varying frequency gains can then be applied to a decomposition of the sound into frequency channels allowing for the synthesis step and an estimate of the clean signal. This technology is available for licensing via UC Berkeley's Office of Technology Licensing (Technology: Modulation-Domain Speech Filtering For Noise Reduction; Tech ID: 22197; Lead Case: 2012-034-0).
The analysis step in the algorithm involves generating an additional representation of the sounds based on an ensemble of model neurons fully characterized by their STRF. These STRFs are designed to efficiently encode the structure of the signal and the noise, allowing them to be useful indicators of the time-course of signal in a noisy sound. For this study, we used a bank of STRFs that were designed to model the STRFs found throughout the auditory pallium, including STRFs not only from neurons in NCM but also the field L complex
To assess the quality of our algorithm, we compared it to 3 other noise reduction schemes: the optimal classical frequency Wiener filter for stationary Gaussian signals (OWF), a state-of-the-art spectral subtraction algorithm (SINR) used by a hearing aid company, and the upper bound obtained by an ideal binary mask (IBM). The optimal Wiener filter is a frequency filter whose static gain depends solely on the ratio of the power spectrum of the signal and signal + noise. The state-of-the-art spectral subtraction algorithm uses a time variable gain just as in our algorithm but based on a running estimate of noise and signal spectrum. This algorithm was patented by Sonic Innovations (US Patent 6,757,395 B1) and is currently used in hearing aids. The IBM procedure used a zero-one mask applied to the sounds in the spectrogram domain. The mask is adapted to specific signals by setting an amplitude threshold. Ideal binary masks require prior knowledge of the desired signal and thus can be considered as an approximate upper bound on the potential performance of general noise reduction algorithms
As shown on
We are now able to answer our questions. First, as quantified above, using an ensemble of physiologically realistic noise-invariant responses, we show that one is able to recover the distorted signal with remarkable accuracy. Second, we were also able to compare the properties of the STRFs in the model that had the biggest importance gains (
Both in the model and in the biological system, given a complete modulation filter bank, the importance weights for a given signal and noise could be learned quickly through supervised learning. Moreover, after learning, the algorithm can easily be implemented in real-time with minimal delay. Thus, the algorithm is particularly useful with adaptive weights or if the statistics of the noise and signal are known, both of which are true in the biological system. Finally given its performance and the advantages described above, we also believe that this noise filtering approach could be useful in clinical applications, such as hearing aids or cochlear implants, or in consumer applications such as noise canceling preprocessing for automatic speech recognition.
In summary, we have shown the presence of noise-invariant neurons in a secondary auditory cortical area. We show that a fraction of the noise-rejecting property can be explained by the spectral-temporal tuning of the neurons. However, tuning properties that are not well captured by the STRF can also both increase or decrease noise-invariance and these properties will have to be examined in future work. We have also described a novel noise reduction algorithm that uses a modulation filter-bank akin to the STRFs found in the avian auditory system. The performance of this algorithm in noise reduction was excellent and similar or better than the current state-of-the-art algorithms used in hearing aids. The model also illustrates some fundamental principles and allowed us to make stronger statements on the scope of our biological findings. The fundamental principles are, first, that signal and noises can have a distinct signature in the modulation space while overlapping in the frequency space and that therefore filtering in this domain can be advantageous. Second, that although modulation filtering is a linear operation in the spectrogram domain, that both the generation of a spectrogram and the re-synthesis of a clean signal require non-linear computations. We argue that the spectral-temporal properties that are found in higher auditory areas and that are particularly efficient at distinguishing noise modulations from signal modulations are the result of a series of non-linear computations that occurred in the ascending auditory processing stream. The model also shows that a real-time re-synthesis of a cleaned signal could be obtained with additional non-linear operations or, in other words, that a real-time spectrographic inversion is possible. Finally, our modeling efforts show that the noise-invariant findings described here for a song as a chosen prototypical signal and a modulation-limited noise as the chosen prototypical noise would also apply to other signals and noise. However, the involvement of neurons with slightly different tuning or adaptive properties would be needed to obtain optimal signal detection. Given the behavioral experiments that have shown that birds excel at auditory scene analysis tasks both in the wild
All animal procedures were approved by our institutional Animal Care and Use Committee. Neurophysiological recordings were performed in four, urethane anesthetized adult zebra finches to obtain 50 single unit recordings in areas NCM and potentially field L (see below). We used similar neurophysiological and histological methods to characterize other regions of the avian auditory processing stream and detailed descriptions can be found there
To obtain recordings from NCM, we used more medial coordinates than our previous experiments. With the bird's beak fixed at a 55° angle to the vertical, electrodes were inserted roughly 1.2 mm rostral and 0.5 mm lateral to the Y-sinus. We made extracellular recordings from tungsten-parylene electrodes having impedance between 1 and 3 MΩ (A-M Systems). Electrodes were advanced in 0.5 µm steps with a microdrive (Newport), and extracellular voltages were recorded with a system from Tucker-Davis Technologies (TDT).
In all cases, the extracellular voltages were thresholded to collect candidate spikes. Each time the voltage crossed the threshold, the timestamp was saved along with a high-resolution waveform of the voltage around that time (0.29 ms before and 0.86 ms after for a total of 1.15 ms). After the experiment, these waveforms were sorted using SpikePak (TDT) to assess unit quality. We sorted spike waveforms using a combination of PCA and waveform features (maximum and minimum voltage, maximum slope, area). We assessed clustering qualitatively and verified afterwards that the resulting units had Inter-Spike-Interval distributions where no more than 0.5% of the intervals were less than 1.5 ms.
In each bird, we advanced the electrode in 50 µm steps until we found auditory responses. At that point we recorded activity in 100 µm steps. When we no longer found auditory responses, we moved the electrode 300 µm further, made an electrolytic lesion (2 uA×10 s), advanced another 300 µm, and made a second identical lesion. These lesions were used to find the electrode track post-mortem and to calibrate the depth measurements.
At the end of the recording session, the bird was euthanized with an overdose of Equithesin and transcardially perfused with 0.9% saline, followed by 3.7% formalin in 0.025 M phosphate buffer. The skullcap was removed and the brain was post-fixed in 30% sucrose and 3.7% formalin to prepare it for histological procedures. The brain was sliced parasagittally in 40 µm thick sections using a freezing microtome. Alternating brain sections were stained with both cresyl violet and silver stain, which were then used to visualize electrode tracks, electrolytic lesions and brain regions.
All of our electrode tracks sampled NCM from dorsal to ventral regions. Some of the more dorsal recordings (shallower depths) could have been in subregions L or L2b of the Field L complex as the boundary between either of these two regions and NCM proper is difficult to establish
Stimuli consisted of zebra-finch songs, roughly 1.6–2.6 seconds in length, recorded from 40 unfamiliar adult male zebra finches played either in isolation or in combination with a background of synthetic noise (song+ml-noise stimuli in main text).
The masking noise in the neurophysiological experiments was synthetic and obtained by low-pass filtering white noise in the modulation domain following the procedure described in
We have also shown that such ml-noise is an effective stimuli for midbrain and cortical avian auditory neurons in a sense that it drives neuron with high response rates and high information rates
All song and ml-noise stimuli were processed to be band limited between 250 Hz and 8 kHz and to have equal loudness using custom code in Matlab. The sounds were presented using software and electronics from TDT. Stimuli were played over a speaker at 72 dB C-weighted average SPL in a double-walled anechoic chamber (Acoustic Systems). The bird was positioned 20 cm in front of the speaker for free-field binaural stimulation.
Each of the combined stimuli consisted of a different ml-noise sound sample, randomly paired with one of the songs. The noise stimulus began five to seven seconds after the previous stimulus, and the song began after a random delay of 0.5 to 1.5 seconds after the onset of the noise. Thus for each trial the same song is paired with a different noise sample and at a different delay. In the combined presentations, the noise stimuli were attenuated by 3 dB to obtain a signal to noise ratio (SNR) of 3 dB.
We played four trials at each recording location, each consisting of a randomized sequence of 40 songs, 40 masking noise stimuli, and 40 combined stimuli. Stimuli were separated by a period of silence with a length uniformly and randomly distributed between five and seven seconds.
We used custom code written in MATLAB, Python and R for all of our analyses.
We assessed responsiveness using an average z-score metric for each stimulus class. The z-score is calculated as follows:
To measure invariance, we evaluated the similarity between the responses to song and song + ml-noise by computing two measures: 1) the correlation coefficient between the PSTH for each corresponding response and 2) the ratio of the SNR in the neural response to song+noise and the SNR in the response to song alone.
If the PSTH for song is called
The signal to noise ratio for the response to song+noise is then:
In the calculations above, the PSTH was obtained by smoothing spike arrival times using a 31 ms Hanning window. The bias introduced by the small number of trials used to compute each PSTH was correcting by jackknifing. The single-stimulus results indicate a small but consistent negative bias in the four-trial estimates. We then computed the invariance as the mean of the individual bias-corrected correlations obtained for each 40 stimulus.
For each responsive single unit, we estimated the neuron's STRF from their responses to song alone. The STRF were obtained using the
We assessed the performance of each STRF using coherence and the normal mutual information as described in
To further examine the gain of the neuronal response as a function of temporal and spectral modulations, we also represented each STRF in terms of its Modulation Transfer Function (MTF). The MTF is obtained by taking the amplitude of 2 dimensional Fourier Transform of the STRF
To calculate the invariance metrics for the STRF model, we first obtained the predicted response to the song+ml-noise stimulus for each trial. Using these in place of the actual responses, we then computed an invariance metrics for the STRF model by comparing the predicted responses to the actual response obtained for song alone. In this manner, we were able to directly compare the STRF model invariance with the invariance calculated for the actual neuron. We used a two-tailed t-test to compare the distribution of similarity values for the 40, four-trial linear predictions to the 40 actual four-trial responses.
Following directly from the premise that neurons in area NCM selectively respond to spectral-temporal modulations present in zebra finch songs, even in the presence of corrupting background noise, we developed a noise reduction scheme that would exploit this property. Our algorithm falls in the general class of single microphone noise reduction (SMNR) algorithms using spectral subtraction. The core idea in spectral subtraction is to estimate the frequency components of the signal from the short time Fourier components of the corrupted signal. The estimated signal frequency components are obtained by multiplying the Fourier components of signal+noise by a gain function. This is the synthesis part of the algorithm. The gain function can vary both in frequency and time. The form and estimation of the
Both the analysis and synthesis step in our algorithm used a complete (amplitude and phase) time-frequency decomposition of the sound stimuli (
The analysis step in the algorithm involved generating an additional representation of the sounds based on an ensemble of M model neurons fully characterized by their STRF. The model STRFs were parameterized as the product of two Gabor functions describing the temporal and spectral response of the neuron:
The parameters of these Gabor functions (e.g. for time:
The optimal set of weights,
Training was performed on all instances of the signal + noise samples. Weights were determined by averaging across values obtained through jack-knifing across this data set ten times with 10% of the data held out as an early stopping set. Noise reduction was then validated and quantified on a novel song in novel noise. Examples of noise corrupted signals and filtered signals that correspond to the spectrograms shown in
To assess the performance of our model, we computed the cross-correlation between the estimate and the clean signal in the log spectrogram domain. We then took the ratio of this cross-correlation and the value obtained prior to attempting to de-noise the stimulus to obtain a performance ratio. As summarized in the text, we then compared our algorithm to other noise reduction schemes. For this purpose, we also estimated the performance ratio for three other spectral subtraction noise algorithms: the optimal Wiener filter (OWF), a variable gain algorithm patented by Sonic Innovations (SINR) and the ideal binary mask (IBM). The optimal Wiener filter is a frequency filter whose static gain depends solely of the ratio of the power spectrum of the signal and signal + noise. In our implementation, the Wiener filter was constructed using the frequency power spectrum of signal and noise from the training set and then applied to a stimulus from the testing set (of the same class). The spectral subtraction algorithm for Sonic Innovations used a time variable gain just as in our implementation. Also, as in our implementation, the analysis step for estimating this gain was based on the log of the amplitude of the Fourier components. However, the gain function itself was estimated not from a modulation filter bank but estimating the statistical properties of the envelope of the signal and noise in each frequency band (US Patent 6,757,395 B1). We used a Matlab implementation of the SINR algorithm provided to us by Dr. William Woods of Starkey Hearing Research Center, Berkeley, CA. Optimal parameters for the level of noise reduction and the estimation of the noise envelope for that algorithm were also obtained on the training signal and noise stimuli and the performance was cross-validated with the test stimuli. The IBM procedure used a zero-one mask applied to the sounds in the spectrogram domain. The mask is adapted to specific signals by setting an amplitude threshold. Binary masks require prior knowledge of the desired signal and thus should be seen as an approximate upper bound on the potential performance of general noise reduction algorithms. Although these simulations are far from comprehensive, they allowed us to compare our algorithm to optimal classical approaches for Gaussian distributed signals (OWF), to a very recent state-of-the-art algorithm (SINR) and to an upper bound (IBM). For commercial applications, our noise-reduction algorithm is available for licensing via UC Berkeley's Office of Technology Licensing (Technology: Modulation-Domain Speech Filtering For Noise Reduction; Tech ID: 22197; Lead Case: 2012-034-0).
(WAV)
(WAV)
(WAV)
(WAV)
(WAV)
(WAV)
(EPS)
(EPS)
We thank Julie Elie and Wendy de Heer for constructive comments on a previous version of the manuscript. We also thank William S. Woods of Starkey Hearing Research Center, Berkeley, CA., for expert advice on current state-of-the-art noise reduction algorithms.