Skip to main content

Audio-visual speech processing using deep learning techniques

Resource type
Thesis type
(Thesis) M.A.Sc.
Date created
2019-10-02
Authors/Contributors
Author (aut): Ideli, Elham
Abstract
Speech separation is the task of segregating a target speech signal from background interference. To differentiate the separation of multiple speech sources from separating speech from non-speech noise, the terms Speaker Separation and Speech Enhancement (or Denoising) are commonly used, respectively. Speech separation can benefit from exploiting different modalities, i.e., audio and visual, and directional information when multiple microphones are available. A new approach is developed for subject-independent speaker separation by incorporating spectral, spatial and visual features. The audio signals have their magnitude and phase modified in the frequency domain for the speaker separation. The key idea is to estimate the target magnitudes from the audio with the Permutation Invariant Training (PIT) technique and then refine these estimates using both visual and spatial audio features. Specifically, visual features are matched to the corresponding audio, and spatial audio features are used as side information and shown to provide drastic improvement for magnitude and phase estimation in terms of output speech intelligibility, quality, and the separation performance. Visual information also provides improvements. Hence, both visual and spatial features are shown to be useful for speaker separation. A monaural speech enhancement model is also developed which incorporates both audio and visual information. In contrast to the audio-visual speaker separation model, the audio-visual speech enhancement model operates in time-domain. Hence, there is no need for a transformation and separate models for estimation of magnitude and phase spectra. According to the results of the objective evaluations, exploiting visual information for enhancement applications improves the performance in terms of both output quality and intelligibility.
Document
Identifier
etd20557
Copyright statement
Copyright is held by the author.
Permissions
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Scholarly level
Supervisor or Senior Supervisor
Thesis advisor (ths): bajic, ivan
Member of collection
Download file Size
etd20557.pdf 3.56 MB

Views & downloads - as of June 2023

Views: 34
Downloads: 2