Video coding for humans and machines

Resource type
Thesis type
(Thesis) Ph.D.
Date created
Author: Choi, Hyomin
Due to the remarkable evolution of machine learning models over the last number of years, computers are now "looking into" image and video to understand scenes on behalf of humans. Most systems dealing with video signals necessitate compression due to the tremendous data volume that needs to be transmitted or stored. However, compression algorithms have traditionally been tailored to only the human visual system. This thesis aims to establish a novel video coding framework optimized for human and machine vision in a scalable manner. Therefore, our work enables us to code compressed representations of visual signals selectively and progressively as needed for human or machine vision. Specifically, our intra coding built-upon a deep neural network (DNN) transforms an input frame into latent space in a way that task-specific information is steered into designated portions of the latent space during training. Hence, part of the latent space supports machine vision while the remainder is needed only when processing more demanding tasks such as input reconstruction. Furthermore, as usual we exploit temporal redundancy for inter-frame coding. In this regard, we propose a DNN-based frame prediction that uses two previous decoded frames as input to generate an intermediate virtual reference frame. Specifically, we introduce affine transformation-based spatially-varying filters to apply the learned interpolation filters at the adaptively-chosen location. The resulting coding system is able to reduce the bitrate by ignoring the irrelevant part of the compressed representation, depending on the needs of machine vision tasks while achieving multiple tasks with marginal performance degradation. Still, human viewing is enabled by decoding the entire bitstream. Our multi-task DNN for intra coding supports object detection, segmentation, and input reconstruction, so we evaluate task-specific performance compared with relevant benchmarks. The DNN for frame prediction is also examined by integrating it with a conventional inter frame coding. Lastly, we integrate all the above ideas into the HEVC video coding pipeline to evaluate the overall system performance on various tasks in terms of rate-distortion and rate-accuracy.
109 pages.
Copyright statement
Copyright is held by the author(s).
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: bajic, ivan
Member of collection
Attachment Size
etd21915.pdf 19.24 MB