| Home | Research | Coursework | Publications | Others | CV & Contact |


Research Work

My research work is focused on developing neural methods for representation learning of speech and audio signals, with the goal of improving downstream applications that rely on these representations. The key highlights are:

Details

In the first part of the work, we propose representation learning methods for speech data in an unsupervised manner. Using the modulation representation learning as the goal, we explore various neural architecture for unsupervised learning like restricted Boltzmann machines (RBM), variational auto-encoders (VAE) and generative adversarial networks (GAN). For learning modulation representations that are distinct and irredundant, we propose different learning frameworks like external residual approach, skip connection based approach, and cost function based approach. The methods developed for rate and scale representation learning are benchmarked using an automatic speech recognition (ASR) task on noisy and reverberant conditions. We also illustrate that the unsupervised representation learning can be extended to the first stage of learning time-frequency representations from raw waveforms.

The second part of the research work deals with supervised representation learning. Here, we propose a two-stage representation learning approach from raw waveform consisting of acoustic filterbank learning (time-frequency representation learning) from raw waveform followed by a modulation representation learning. The key novelty in the proposed framework consists of a relevance weighting mechanism that acts as a feature selection module. This is inspired by gating networks and provides a mechanism to weight the relevance of the acoustic and modulation representations in predicting the target class. The relevance weighting network can also utilize feedback from the previous predictions of the model for tasks like ASR. The proposed relevance weighting scheme is shown to provide significant performance improvements for ASR task and UrbanSound audio classification task. A detailed analysis yields insights into the interesting properties of the relevance weights that are captured by the model at the acoustic and modulation stages for speech and audio signals.