Resource type
Thesis type
(Thesis) M.Sc.
Date created
2022-04-13
Authors/Contributors
Author: Gholami, Ali
Abstract
This thesis tackles the problem of dense captioning objects in 3D environments. In this task, we are given a 3D environment, and our aim is to first detect objects and then describe them using natural language. Previously, many works have addressed the problem of image captioning as well as dense captioning in 3D environments. However, no prior work has thoroughly investigated and compared the quality of generated captions from aspects such as the choice of visual input. In this thesis, we first introduce a 3D dense captioning pipeline, and then we show how it compares against prior work. Our investigations show that captioning objects in 3D leads to higher quality captions (compared to captioning with 2D visual inputs). We further show that simple modifications in the type of visual input (e.g. addition of depth to 2D single view images) and careful choice of optimization settings (e.g. optimizer, learning rate, and end-to-end training) can drastically improve the performance of the 2D captioning, and even outperform the 3D captioning on some evaluation metrics.
Document
Extent
62 pages.
Identifier
etd21920
Copyright statement
Copyright is held by the author(s).
Supervisor or Senior Supervisor
Thesis advisor: Chang, Angel
Language
English
Member of collection
Download file | Size |
---|---|
etd21920.pdf | 12.72 MB |