Gholami, Ali

Resource type

Thesis

Thesis type

(Thesis) M.Sc.

Date created

2022-04-13

Authors/Contributors

Author: Gholami, Ali

Abstract

This thesis tackles the problem of dense captioning objects in 3D environments. In this task, we are given a 3D environment, and our aim is to first detect objects and then describe them using natural language. Previously, many works have addressed the problem of image captioning as well as dense captioning in 3D environments. However, no prior work has thoroughly investigated and compared the quality of generated captions from aspects such as the choice of visual input. In this thesis, we first introduce a 3D dense captioning pipeline, and then we show how it compares against prior work. Our investigations show that captioning objects in 3D leads to higher quality captions (compared to captioning with 2D visual inputs). We further show that simple modifications in the type of visual input (e.g. addition of depth to 2D single view images) and careful choice of optimization settings (e.g. optimizer, learning rate, and end-to-end training) can drastically improve the performance of the 2D captioning, and even outperform the 3D captioning on some evaluation metrics.

Extent

62 pages.

Keywords

Identifier

etd21920

Copyright statement

Copyright is held by the author(s).

Permissions

This thesis may be printed or downloaded for non-commercial research and scholarly purposes.

Supervisor or Senior Supervisor

Thesis advisor: Chang, Angel

Language

English

Member of collection

Computing Science Theses

Download file	Size
etd21920.pdf	12.72 MB

Dense captioning objects in 3D environments using natural language

Keywords

Views & downloads - as of June 2023