Zhang, Yiming

Resource type

Thesis

Thesis type

(Thesis) M.Sc.

Date created

2023-12-07

Authors/Contributors

Author: Zhang, Yiming

Abstract

We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natural language descriptions. Existing 3D visual grounding tasks focus on localizing a unique object given a text description. However, such a strict setting is unnatural as localizing potentially multiple objects is a common need in real-world scenarios and robotic tasks (e.g., visual navigation and object rearrangement). To address this setting we propose Multi3DRefer, generalizing the ScanRefer dataset and task. Our dataset contains 61926 descriptions of 11609 objects, where zero, single or multiple target objects are referenced by each description. We also introduce a new evaluation metric and benchmark methods from prior work to enable further investigation of multi-modal 3D scene understanding. Furthermore, we develop a better baseline leveraging 2D features from CLIP by rendering object proposals online with contrastive learning, which outperforms the state of the art on the ScanRefer benchmark.

Extent

41 pages.

Keywords

Identifier

etd22858

Copyright statement

Copyright is held by the author(s).

Permissions

This thesis may be printed or downloaded for non-commercial research and scholarly purposes.

Supervisor or Senior Supervisor

Thesis advisor: Chang, Angel

Language

English

Member of collection

Computing Science Theses

Download file	Size
etd22858.pdf	20.65 MB

Multi3DRefer: Grounding text description to multiple 3D objects

Keywords

Views & downloads - as of June 2023