Visual localization is a fundamental 3D Computer Vision problem that amounts to determining the orientation and position of a camera within a known 3D scene. It paves the way for various applications, such as autonomous driving and robotic navigation. A typical localization framework starts with a collection of images and creates a prior 3D map via Structure-from-Motion (SfM). Then, the map is further processed to encode visual cues from images and yield the scene representation in different forms. Eventually, a query image taken in the same scene is localized using the scene representation as a reference. This thesis focuses on developing an applicable system that can be used in any localization scenario. To obtain a high-quality 3D map, we propose a learning-based rotation averaging in SfM. Due to the constraints imposed by rotation manifolds, traditional approaches commonly start with the initialization based on a spanning-tree and then refine it through nonlinear optimization. Notably, incorrect initialization by the outliers produces poorer outcomes. To address this challenge, we propose a novel end-to-end network combining initialization and refinement via a differentiable multi-source propagation. It develops initialization by learning to balance the impact of outliers using appearance and geometry cues. To make scene representation more applicable, we propose a scene-agnostic network SANet for online systems. Although many learning-based methods have achieved high performance, they must retrain their model repetitively when adapting to new scenes. Instead of using trainable parameters to represent the scene, our SANet efficiently encodes a scene into an independent representation, making it ideal for real-time applications. In particular, we construct a scene pyramid that can register any query frame in a coarse-to-fine manner, and the product is a dense scene coordinate map for estimating camera pose afterward. To compress a scene for mobile devices, we present a learning-based compression framework: SceneSqueezer. The scene compression is achieved hierarchically: the frames are first clustered by their co-visibility to eliminate redundant ones. Then, a learnable module selects distinctive points in each cluster while considering the final pose estimation accuracy. Finally, the selected points are further compressed by quantizing their features.
Copyright is held by the author(s).
This thesis may be printed or downloaded for non-commercial research and scholarly purposes.
Supervisor or Senior Supervisor
Thesis advisor: Tan, Ping
Member of collection