Deep learning-based multimedia content processing

Date created: 
Multimedia content processing
Machine and deep learning
Music information retrieval
Learned image compression
Entropy coding

In the last few years, deep learning has revolutionized many applications in the field of multi-media content processing such as music information retrieval (MIR) and image compression, which are addressed in this thesis. In order to handle the challenges in acoustic-based MIR such as automatic music transcription, the video of the musical performances can be utilized. In Chapter 2, a new learning-based system for visually transcribing piano music using the convolutional neural networks and support vector machines is presented that achieves an average improvement of ~0.37 in terms of F1 score over the previous works. Another significant problem in MIR is music generation. In Chapter 3, a semi-recurrent hybrid model combining variational auto-encoder and generative adversarial network for sequential generation of piano music is introduced that achieves better results than previous methods. Auto-encoders have also been used as a perfect candidate for learned image compression, which has recently shown the potential to outperform standard codecs. Some efforts in integrating other computer vision tasks and image compression to improve the compression performance have also been made. In Chapter 4, a semantic segmentation-based layered image compression method is presented in which the segmentation map of the input is used in the compression procedure. Most learned image compression methods train multiple models for multiple bit rates, which increase the implementation complexity. In Chapter 5, we propose a variable-rate image compression model employing two novel loss functions and residual sub-networks in the auto-encoder. The proposed method outperforms the standard codecs and also previous learned variable-rate methods on Kodak image set. The state-of-the-art image compression has been achieved by utilizing joint hyper-prior and auto-regressive models. However, they suffer from the spatial redundancy of the low frequency information in the latents. In Chapter 6, we propose the first learned multi-frequency image compression approach that uses the recently developed octave convolutions to factorize the latents into high and low frequencies. As the low frequency is represented by a lower resolution, their spatial redundancy is reduced, which improves the compression rate. Our experiments show that the proposed scheme outperforms all standard codecs and learning-based methods in both PSNR and MS-SSIM metrics, and establishes the new state of the art for learned image compression on Kodak image set.

Document type: 
This thesis may be printed or downloaded for non-commercial research and scholarly purposes. Copyright remains with the author.
Jie Liang
Applied Sciences: School of Engineering Science
Thesis type: 
(Thesis) Ph.D.