서지주요정보
Deep learning-based solutions for empowering visual localization and other vision tasks = 강력한 시각적 위치 파악 및 기타 컴퓨터 비전 문제를 지원하는 딥 러닝 기반 솔루션
서명 / 저자 Deep learning-based solutions for empowering visual localization and other vision tasks = 강력한 시각적 위치 파악 및 기타 컴퓨터 비전 문제를 지원하는 딥 러닝 기반 솔루션 / Praveen Kumar Rajendran.
발행사항 [대전 : 한국과학기술원, 2023].
Online Access 원문보기 원문인쇄

소장정보

등록번호

8040585

소장위치/청구기호

학술문화관(문화관)B1층 보존서고

MPD 23009

휴대폰 전송

도서상태

이용가능(대출불가)

사유안내

반납예정일

리뷰정보

초록정보

Visual localization is essential for many applications, including AR/VR, robots, and self-driving cars. Traditional methods use large memory and processing resources to estimate the camera position in absolute and relative terms. It gave rise to a new pattern of finding the pose using learning-based methods, i.e., pose regressors. Existing relative camera pose estimation techniques rely solely on balancing hyperparameter tuning manually or automatically in the loss function. On the other hand, current absolute pose regressors generally lack the quality to adapt to different domains of the same scene. In this work, we primarily address these two issues. First, estimating the relative camera position between a pair of images is formulated using a two-stage training strategy that eliminates the need for compensating hyperparameters in the loss function. Our proposed training strategy drastically improved the translation vector estimation by 16.11%, 28.88%, and 52.27% on the KingsCollege, OldHospital, and StMarysChurch scenes, respectively. To demonstrate texture invariance, we explore the generalization of the proposed method by extending the datasets to different scene styles for ablation and qualitative studies using Generative Adversarial Networks(GAN). Second, we offer a novel lightweight domain adaptive training framework to retrain any existing absolute pose regressors(APR) to improve their generalization capability. Our lightweight network outperforms the transformer in translation vector estimation on the visual localization benchmark dataset. The results show that despite using about 24 times fewer FLOPs, 12 times fewer activations, and five times fewer parameters than state-of-the-art MS-Transformer, our approach outperforms all CNN-based architectures and achieves comparable performance to transformer-based architectures. Our method achieves ranks 2nd and 4th with the Cambridge Landmarks and 7Scenes datasets, respectively. Moreover, our approach outperforms and ranks 1st over the MS -transformer on unseen domains. Furthermore, This work explores the demonstration of an APR's inversion for synthesizing views similar to NeRF.

시각적 위치 추정은 AR/VR, 로봇 및 자율 주행 자동차를 포함한 많은 애플리케이션에 필수적이다. 전통적인 방법은 절대적이고 상대적인 관점에서 카메라 위치를 추정하기 위해 대용량 메모리와 처리 리소스를 사용한다. 그것은 학습 기반 방법, 즉 포즈 회귀기를 사용하여 포즈를 찾는 새로운 패턴을 낳았다. 기존의 상대적인 카메라 포즈 추정 기술은 손실 함수에서 수동 또는 자동으로 하이퍼 파라미터 튜닝의 균형을 맞추는 데에만 의존한다. 반면, 현재의 절대적인 포즈 회귀기는 일반적으로 동일한 장면의 다른 영역에 적용하기 위한 품질이 부족하다. 본 연구에서는 아래 2가지 문제를 주로 다룬다. 첫째, 손실 함수에서 하이퍼 파라미터를 보상할 필요가 없는 2단계 훈련 전략을 사용하여 한 쌍의 이미지 사이의 상대적인 카메라 위치를 추정한다. 우리가 제안한 훈련 전략은 KingsCollege, OldHospital, and StMarysChurch 장면에서 트렌스레이션 벡터 추정치를 각각 16.11%, 28.88%, 52.27%씩 획기적으로 개선했다. 텍스처 불변성을 입증하기 위해, GAN을 사용한 ablation 및 qualitative 연구를 통해 데이터 세트를 다른 장면 스타일로 확장하여 제안된 방법의 일반화를 탐구한다. 둘째, 기존의 절대적 포즈 회귀자를 재훈련하여 일반화 능력을 향상시키는 새로운 경량 도메인 적응 훈련 프레임워크를 제공한다. 제안된 경량 네트워크는 시각적 위치 추정 벤치마크 데이터 셋의 translation 벡터 추정에서 트랜스포머를 능가한다. 결과는 최첨단 MS-트랜스포머보다 약 24배 적은 FLOP, 12배 적은 활성화, 5배 적은 매개 변수를 사용했음에도 불구하고, 제안된 접근 방식은 모든 CNN 기반 아키텍처를 능가하고 트랜스포머 기반 아키텍처와 비슷한 성능을 달성한다는 것을 보여준다. 제안된 방법은 캠브리지 랜드마크와 7개의 장면 데이터 셋으로 각각 2위와 4위를 달성한다. 또한, 제안된 방식은 학습에 관여하지 않은 도메인에서 MS-트랜스포머보다 성능이 뛰어나고 1위를 차지한다. 또한, 본 논문은 NeRF와 유사한 관점을 합성하기 위한 APR의 반전 시연을 탐구한다.

서지기타정보

서지기타정보
청구기호 {MPD 23009
형태사항 vii, 71 p. : 삽도 ; 30 cm
언어 영어
일반주기 저자명의 한글표기 : Rajendran Praveen Kumar
지도교수의 영문표기 : Dongsoo Har
지도교수의 한글표기 : 하동수
Including appendix
학위논문 학위논문(석사) - 한국과학기술원 : 미래자동차학제전공,
서지주기 References : p. 59-69
주제 Visual localization
Camera pose
Relative pose estimation
Absolute pose estimation
Domain adaptation
시각적 현지화
카메라 포즈
상대 포즈 추정
절대 포즈 추정
도메인 적응
QR CODE

책소개

전체보기

목차

전체보기

이 주제의 인기대출도서

A) Relative Pose Estimation with deep Learning B) Absolute Pose Estimation with deepLearning C) Forward pass view synthesis with deeplearning

Dense 3-dimensional reconstruction for different scenes in the Cambridge Land- marks dataset, (a) KingsCollege(seq2), (b) OldHospital(seq3), (c) ShopFacade(seq3), and (d) StMarysChirch(seq3), using the COLMAP 81] algorithm

Qualitative evaluation ofepipolarlines for corresponding key-points ofreference images, as represented by same colour oflines. First column represents the reference images with keypoints. Second, third, and fourth column represents epipolar lines based on, ground truth pose, SIFT+LMeds, and proposed RelMobnet

RelMobNet: a siamese convolutional neural network architecture using pre-traine MobileNetV3-Large backbones with adaptive average pooling layers. The outputofthe paralle branches in the siamese network are concatenated and with a pose regressor to estimate translation and a rotation vector. The adaptive poolinglayers are added to handle variableinpu imaoe sizes

Cumulative histogram of test errors in rotation and translation forindividual models and the shared model. The higher the area under the curve, the better the estimation. (a) rotation error using individual models (b) translation error using individual models (c) rotation error using the shared model (d) translation error using the shared model

Box plot depicting rotation error distribution on test set. Less spread ofbox shows less variance and low central pointofbox (i.e. centroid) shows less bias. For better result low bias and variance is desired. (a) Shared model's rotation estimation with real data (b) Individual model's rotation estimation with real data (c, e, g) Shared model's rotation estimation with mosaic, udnie and starry sty

Box plot depicting translation error distribution on test set. Less spread of box shows less variance and low central pointofbox (i.e. centroid) shows less bias. For betterresult low bias and variance is desired. (a) Shared model's translation estimation with real data (b Individual model's translation estimation with real data (c, e, g) Shared model's translation estimation with mosaic, udnie and

Sample of Real data along with styled transferred images with a generative adversarial network [53]

Performance comparison ofone-stage trainingand two-stage training models with inference results oftranslation and rotation vector. For"Real Data' both models(One-stage and Two-stage) are trained and then tested on respective real scene images. However, for other three styledimages, real data models are used for inference. Bold numbers represent occurrences where rotation prediction is better than

(a,b) Training sequence dense and sparse reconstruction (c,d) Testingsequence dense and sparse reconstruction, for data collected with mobile camera visualized with COLMAP [81

Overview of the proposed framework. (a) Domain adaptive training framework Three parallel branches with shared weights are trained for images related to the same pose under different domains. (b) Inference framework.

Left: D.A-model learns domain invariant features by using a training objective composed ofaL2 loss and a BARLOW TWINS LBT loss. Additionally, a pose loss Lp is applied to optimize pose predictions. Right: The inference stageis performed as a single branch model as three parallel branches share network weights during training.

Results for outdoors Cambridge Landmark dataset. The last column represents the overall rank of different methods.

Results forindoor 7Scenes dataset. The last column represents the overall rank of different methods.

Camera trajectory visualization [19] on real Cambridge Landmarks and 7Scenes datasets [22, 35]. Each plotshows the camera trajectory, green forthe ground truth and redfor the prediction. From left to right, the testing trajectories are forthe scenes KingsCollege-seq-02, StMarysChurch-seq-13, Office-seq-06, and Heads-seq-01. From the 1st row, itis clear that the D.A-model inherently leads to fewer

Computational complexity analysis in terms ofFLOPs, activations, parameters, and memory. Bold numbers represent the highest efficiency.

Impact ofindividual components LBT, L2, and MHA

Average ofmedian errors for multiple indoor and outdoor scenes on the domains used for training: real, foggy,and night. Bold numbers represent best performance. See more details in Appendix 3.7.5.

Average of median errors for multiple indoor and outdoor scenes on domains un for training: mosaic, udnie and starry. Bold numbers represent best performance. Seen details in Appendix 3.7.6. Average of median errors for different methods T 01)

Comparison with AtLoc [100]. Average of median errors for unseen domains with Stairs scene. Bold numbers representbest performance. Stairs AtLoc SB-model D.A-model 30000

Visualization ofthe embeddings produced by SI3-model and D.A-model for testsplitofStMarysChurch scene in 2 dimensions by usingt-SNE [95, 31] plots. Each subp represents the latent embedding space of different domains. From the 1st row, itcan be seen tl D.A-model produces similar embeddings to seen and unseen domains while the2ndrow, sho thatSB-model produces dissimilar embeddings. From left to rig

Cumulative histogram ofprediction errors in the test splits with rotation(deg) and translation(m). 1st, 2nd,3rdcolumns, depictthe performance ofD.A-model, SB-model, and MS-Transformer [84], respectively. The accuracy ofthe estimation increases with the area under the curve.

SI3-model Camera Trajectory visualization on Cambridge and 7Scenes dataset [22, 35]. Each plotshows the camera trajectory (green forthe ground truth and red forthe prediction). From left to right, the testing sequences are KingsCollege-seq-02, StMarysChurch-seq-13, Office-seq-06 and Heads-seq-01.

Median errors for multiple indoor and outdoor scenes on domains use Real, Foggy,Night. Full Version ofTable 5(main text).

Median errors for multiple indoor and outdoor scenes on domains training: Mosaic,Starry, and Udnie. Full Version ofTable 6(main text). Mosaic-style Udnie-style Starry-style Average Methods / Scenes T(m) R(deg) T(m) R(deg) T(m) R(deg) T(m) R(deg)

Various lightweight CNN backbones impact on the proposed domain adaptive framework's performance. MobileNetV3-Large demonstrates the better trade-off over other lightweight CNN backbones considered.

Visualization ofthe embeddings producedbySI3-model and D.A-model for test splitof Chess and Redkitchen scene in 2 dimensions by using t-SNE [95, 31] plots. E subplotrepresents the latent embedding space of different domains. From the 1st row, itcar seen thatthe D.A-model produces similar embeddings to seen and unseen domains while 2ndrow, shows thatSI3-model produces dissimilar embeddings. From le

Samples ofGAN generated images. (a) Real image (b) Foggy image (c) Night image (d) Mosaic-style image (e) Udnie-styleimage (f) Starry-styleimage

Thefigure shows a neural network ofx-y mapping for view synthesis with NeRF structure using a simple forward pass.

Oursimple forward pass, view synthesis results on the training set of ShopFacade ecene

Our simple forward pass, view synthesis results on the test set of ShopFacade scene

Our simple forward pass, view synthesis results on the trainingset ofOldHospita SCOne

Our simple forward pass, view synthesis results on the test set of OldHospital scene.