Ideli, Elham

Resource type

Thesis

Thesis type

(Thesis) M.A.Sc.

Date created

2019-10-02

Authors/Contributors

Author (aut): Ideli, Elham

Abstract

Speech separation is the task of segregating a target speech signal from background interference. To differentiate the separation of multiple speech sources from separating speech from non-speech noise, the terms Speaker Separation and Speech Enhancement (or Denoising) are commonly used, respectively. Speech separation can benefit from exploiting different modalities, i.e., audio and visual, and directional information when multiple microphones are available. A new approach is developed for subject-independent speaker separation by incorporating spectral, spatial and visual features. The audio signals have their magnitude and phase modified in the frequency domain for the speaker separation. The key idea is to estimate the target magnitudes from the audio with the Permutation Invariant Training (PIT) technique and then refine these estimates using both visual and spatial audio features. Specifically, visual features are matched to the corresponding audio, and spatial audio features are used as side information and shown to provide drastic improvement for magnitude and phase estimation in terms of output speech intelligibility, quality, and the separation performance. Visual information also provides improvements. Hence, both visual and spatial features are shown to be useful for speaker separation. A monaural speech enhancement model is also developed which incorporates both audio and visual information. In contrast to the audio-visual speaker separation model, the audio-visual speech enhancement model operates in time-domain. Hence, there is no need for a transformation and separate models for estimation of magnitude and phase spectra. According to the results of the objective evaluations, exploiting visual information for enhancement applications improves the performance in terms of both output quality and intelligibility.

Keywords

Identifier

etd20557

Copyright statement

Copyright is held by the author.

Permissions

This thesis may be printed or downloaded for non-commercial research and scholarly purposes.

Scholarly level

Graduate student (Masters)

Supervisor or Senior Supervisor

Thesis advisor (ths): bajic, ivan

Member of collection

Engineering Science Theses

Download file	Size
etd20557.pdf	3.56 MB

Audio-visual speech processing using deep learning techniques

Keywords

Views & downloads - as of June 2023