서지주요정보
Towards human-level domain adaptation for scene understanding = 장면이해를 위한 인간 수준의 도메인 적응 방법론
서명 / 저자 Towards human-level domain adaptation for scene understanding = 장면이해를 위한 인간 수준의 도메인 적응 방법론 / Inkyu Shin.
발행사항 [대전 : 한국과학기술원, 2024].
Online Access 원문보기 원문인쇄

소장정보

등록번호

8042498

소장위치/청구기호

학술문화관(도서관)2층 학위논문

DPD 24002

휴대폰 전송

도서상태

이용가능(대출불가)

사유안내

반납예정일

리뷰정보

초록정보

The human visual system analyzes the vision data to create meaningful representations, enabling the performance of various tasks. Remarkably, it possesses the capability to autonomously discern and learn from the obtained unseen data by analyzing their pattern and distribution (unsupervised offline adaptation). Furthermore, it demonstrates robust adaptability to real-time incoming data during inference (online adaptation). This adaptability significantly enhances the generalizability and effectiveness of the human visual system in diverse scenarios. In this thesis, we propose to apply these two data-centric adaptation methods to machine vision systems, which are currently vulnerable to changes in data distribution, with the aim of achieving domain adaptive and cost effective human-level computer vision. Below is an abstract summary detail of how this approach is proposed. Firstly, in Chapter 2, we present our pursuit of data-centric unsupervised adaptation (UDA) in machine vision. Our research identifies the crucial role of effectively acquiring and utilizing model outputs, such as pseudo-labels, from unseen target data to enhance adaptation. To this end, we propose a methodology that scales up the data pseudo-labels by meticulously analyzing the patterns and relationships within the pixel outputs of the data. Furthermore, we demonstrate that our approach significantly improves adaptability at both the image and video levels. This is achieved by implementing spatial and temporal scaling strategies, respectively, allowing for more nuanced and effective adaptation across diverse visual contexts. In Chapter 3, our empirical studies reveal that unsupervised adaptation, conducted without any real target data labels as like in Chapter 2, is inherently limited and cannot match the performance of a fully supervised model. While cost-effective, this adaptation approach yields a model whose performance gap compared to its supervised counterpart cannot not be deployed practically. Addressing this challenge, we introduce a novel human-in-the loop active domain adaptation method (Active DA). This method strategically determines areas for labeling within the target data, guided by the model’s analysis on target data. Our findings indicate that labeling a mere $2%$ of pixels in each image can approximate the performance of a supervised model. Additionally, we propose a technique for selecting representative points within this $2%$ threshold (e.g., 40 points per image), demonstrating that this selective approach still yields comparable results to the supervised models without the severe performance degradation. In Chapter 4, we delve into the realm of online adaptation, a pivotal element in our pursuit of human-level adaptability in machine learning models. Online adaptation is characterized by the model’s capacity for bidirectional inference and learning, utilizing target test data in real-time (Test-time DA). This approach necessitates more meticulous analysis of each data sample, as the model aims to adapt by observing only current batch or even a single sample. To enhance the model’s self-supervision on an individual sample basis, we propose two innovative methods. The first method focuses on the generation of improved pseudo labels through the integration and aggregation of multi-modal sensor data. Our findings reveal that the bidirectional interplay between modalities significantly enhances the quality of pseudo labels, thereby bolstering the model’s adaptability during test-time. In scenarios lacking multi-modal data, and consequently accurate pseudo labels, we introduce a second method. This approach involves a straightforward yet effective self-supervision technique, which we term ‘masking and reconstruction’. This method adeptly translates the inherent structure and correlations within the data, leading to a substantial improvement in the model’s performance during test-time adaptation. These methodologies underscore our commitment to advancing the frontiers of online adaptation, ensuring our models remain robust and effective in various tasks. In Chapter 5, we culminate our exploration with the comprehensive framework for unified domain adaptation (UnDA), aimed at attaining the zenith of human-level adaptability in machine learning. This chapter commences with a series of supplementary experiments designed to extend and apply the UDA methodology, initially introduced in Chapter 2, to test-time training and, conversely, to incorporate test-time adaptation (TTA) strategies, as proposed, into the offline training phase. Our empirical evaluations reveal a notable compatibility and synergy between our UDA and TTA approaches. Further, this chapter ventures into the integration of active adaptation strategies to augment the efficacy of our unified domain adaptation framework. A critical challenge emerges in the context of incorporating a human-in-the-loop active adaptation system within this unified framework, since we assume the infeasibility of human labeling in online scenarios. To navigate this obstacle, we leverage the capabilities of a pre-trained, domain-generalized foundation model. This model serves as a surrogate for human-guided labeling, offering robust masking capabilities that are invariant to domain shifts. We demonstrate that pseudo-labels, meticulously refined through both training and test phases under the guidance of the mask from foundation model, exhibit marked improvements. This innovative approach to pseudo-label generation and refinement facilitates a more potent and effective unified adaptation, seamlessly bridging the gap between training and test phases.

인간의 시각 시스템은 시각 데이터를 분석하여 의미 있는 표현을 만들어내어 다양한 작업을 수행할 수 있습니다. 특히, 미처 본 적 없는 데이터의 패턴과 분포를 분석하여 자동으로 식별하고 학습하는 능력(비지도 오프라인 적응)을 갖추고 있습니다. 또한, 추론하는 동안 실시간으로 들어오는 데이터에 대해 강력한 적응성을 보여줍니다(온라인 적응). 이러한 적응성은 인간의 시각 시스템의 다양한 상황에서의 일반화 능력과 효과성을 크게 향상시킵니다. 이 논문에서는, 현재 데이터 분포 변화에 취약한 기계 시각 시스템에 이 두 가지 데이터 중심의 적응 방법을 적용하여, 도메인 적응과 비용 효율적인 인간 수준의 컴퓨터 비전을 달성하고자 합니다. 아래는 이 접근법이 제안된 방식의 요약입니다. 먼저, 2장에서는 기계 시각에서 데이터 중심의 비지도 적응(UDA)을 추구하는 것을 제시합니다. 우리 연구는 미처 본 적 없는 대상 데이터에서 모델 출력(예를 들어, 의사 라벨)을 효과적으로 획득하고 활용하는 것이 적응을 강화하는 데 중요한 역할을 한다는 것을 밝힙니다. 이를 위해, 데이터의 픽셀 출력 내부의 패턴과 관계를 면밀하게 분석하여 데이터 의사 라벨을 확장하는 방법론을 제안합니다. 또한, 공간 및 시간적 확장 전략을 각각 구현하여, 다양한 시각적 맥락에서 보다 섬세하고 효과적인 적응을 달성하는 것으로 나타났습니다. 3장에서, 2장과 같이 실제 대상 데이터 라벨 없이 수행된 비지도 적응은 본질적으로 제한적이며 완전한 감독 모델의 성능과 일치하지 않는다는 것을 실험적으로 밝혔습니다. 비용 효과적이지만, 이러한 적응 접근법으로 얻어진 모델의 성능 격차는 실제로 배포할 수 없습니다. 이 도전에 대응하기 위해, 대상 데이터 내 라벨링을 위한 영역을 모델의 대상 데이터 분석에 의해 전략적으로 결정하는 새로운 인간 중심의 활동적 도메인 적응 방법(Active DA)을 도입합니다. 우리의 발견에 따르면, 각 이미지에서 픽셀의 $2%$만 라벨링하면 감독 모델의 성능에 근접할 수 있습니다. 또한, 이 $2%$ 임계값 내에서 대표적인 지점을 선택하는 기술을 제안하며, 이러한 선택적 접근법은 심각한 성능 저하 없이 감독 모델과 비슷한 결과를 얻을 수 있음을 보여줍니다. 4장에서는, 인간 수준의 적응성을 기계 학습 모델에서 추구하는 핵심 요소인 온라인 적응에 대해 탐구합니다. 온라인 적응은 실시간으로 대상 테스트 데이터를 활용하는 모델의 양방향 추론 및 학습 능력을 특징으로 합니다(Test-time DA). 이 접근법은 현재 배치나 심지어 단일 샘플만을 관찰하여 적응하려는 모델이 각 데이터 샘플을 보다 면밀하게 분석해야 함을 필요로 합니다. 개별 샘플 기준으로 모델의 자기 감독을 강화하기 위해, 우리는 두 가지 혁신적인 방법을 제안합니다. 첫 번째 방법은 다중 모달 센서 데이터의 통합 및 집계를 통해 개선된 의사 라벨 생성에 초점을 맞춥니다. 우리의 발견은 모달 간의 양방향 상호 작용이 의사 라벨의 품질을 크게 향상시킴으로써 테스트 시간 동안 모델의 적응성을 강화한다는 것을 드러냅니다. 다중 모달 데이터가 부족하고 따라서 정확한 의사 라벨이 부족한 시나리오에서는, 두 번째 방법을 도입합니다. 이 접근법은 '마스킹 및 재구성'이라고 하는 간단하지만 효과적인 자기 감독 기술을 포함합니다. 이 방법은 데이터의 내재된 구조와 상관관계를 능숙하게 번역하여, 테스트 시간 적응 중 모델의 성능을 크게 향상시킵니다. 이러한 방법론은 온라인 적응의 전선을 발전시키는 데 우리의 약속을 강조하며, 우리 모델이 다양한 작업에서 견고하고 효과적으로 유지되도록 보장합니다. 5장에서는, 통합 도메인 적응(UnDA)을 위한 포괄적인 프레임워크를 탐구하여, 기계 학습에서 인간 수준의 적응성을 달성하는 정점에 도달합니다. 이 장은 2장에서 처음 소개된 UDA 방법론을 테스트 시간 훈련에 확장 및 적용하고, 그 반대로 제안된 TTA 전략을 오프라인 훈련 단계에 통합하기 위한 일련의 보완 실험으로 시작합니다. 우리의 경험적 평가는 우리의 UDA 및 TTA 접근 방식 사이의 주목할 만한 호환성과 시너지를 드러냅니다. 또한, 이 장은 활동적 적응 전략을 통합하여 우리의 통합 도메인 적응 프레임워크의 효과를 증가시키는 것을 탐구합니다. 온라인 시나리오에서 인간 라벨링의 불가능성을 가정하면서, 이 통합 프레임워크 내에 인간 중심의 활동적 적응 시스템을 통합하는 것과 관련된 중요한 도전이 나타납니다.

서지기타정보

서지기타정보
청구기호 {DPD 24002
형태사항 xi, 100 p. : 삽도 ; 30 cm
언어 영어
일반주기 저자명의 한글표기 : 신인규
지도교수의 영문표기 : Kuk-Jin Yoon
지도교수의 한글표기 : 윤국진
공동지도교수의 영문표기 : In-So Kweon
공동지도교수의 한글표기 : 권인소
Including appendix
학위논문 학위논문(박사) - 한국과학기술원 : 미래자동차학제전공,
서지주기 References : p. 90-99
주제 Unsupervised Domain Adaptation
Active Domain Adaptation
Test-time Adaptation
Unified Domain Adaptation
Human-level Adaptation
비지도 도메인적응
엑티브 도메인적응
테스트타임 도메인 적응
통합 도메인 적응
인간 수준의 도메인 적응
QR CODE

책소개

전체보기

목차

전체보기

이 주제의 인기대출도서

Generalizability ofhuman vision in scene understanding.

Overview of Thesis: Towards Human-level Adaptation.

Overview ofthe Chapter2 framework weexplorehow to provide data-centric adaptation by scaling up the pseudo labels from model output on new unseen target data.

The overview of the proposed two-phase pseudo-label densification framework, (a) The first phase utilizes the sliding window based voting in which it propagates neighbor confident predictions to fill in the unlabeled pixels. We use Csti to train the model in the first phase. (b) The secondphase employs confidence- based easy-hard classification (EH class.) along with the hard-to-easy adversarial l

Theoverall procedure ofthe voting-based densification. We describe the process in three steps. 1) We find the top two competing classes on the unlabeled pixel, 2) We pool neighboring confident values for these classes, 3) We combine the original prediction values and the pooled values (weighted-sum with hyperparameter a). We pick the bigger one and assign the corresponding class ifit passes the th

Votingbased densification results by iteration. We can see the initial sparse pseudo label becomes dense as iteration number increases. Though it may bring noisy predictions. We set the total iteration numberto 3 after conducting parameter analysis in Table 2.5.

Qualitative easy and hard samples. For the illustration, we randomly selected three samples from each. Note that easy samples are near to the ground truth with low entropy values, whereas hard samples are far from the ground truth and have high entropy values. Therefore, in the second phase, we train easy samples with their full-pseudo labels and make hard samples to be easy using adversarial loss

Illustration of clip-kMaX. The proposed clip-kMaX seamlessly converts the image-level segmen- tation model kMaX-DeepLab to clip-level without adding extra module. Motivated by the k-means perspective, clip-kMaX considers one object query as one cluster center, which learns to group together pixels of the same object within aclip. Specifically, each object query, when multiplied with the clip featu

Step-by-step overview of Location-Aware Memory Buffer (LA-MB). The LA-MB approach con- sists oftwo phases: Encoding and Decoding. In the Encoding Phase, appearance and location features of detected objects are stored in the memory buffer. In the Decoding Phase, LA-MB performs hierarchical matching, begin- ning with video stitching for short-term association in overlapping frames between clips, fol

KITTI-STEP val set

KITTI-STEP test set

Visualization results on KITTI-STEP val set. The proposed within-clip segmenter, clip-kMaX segments objects in a clip better than the state-of-art TubeFormer ((a) VS. (b)). In (c), the proposed cross-clip associater, Video-kMaX (Location-Aware Memory Buffer), associates occluded objects better than the baseline naive-MB, which exploits only appearance features.

Experiments on unseen video. We trained our Video-kMax on GTA5 video dataset, and tested this model on Cityscape Video. As we can see on the right, even the strong model, Video-kMaX cannot handle the domain shift as we can observe both pixel-level inaccuracy and temporal inconsistency.

The overview of the proposed unsupervised domain adaptation for video semantic segmenta- tion(Video DA). Our Video DA consists of two phase video specific domain adaptation training: Video Adver- sarial Training(VAT) and Video SelfTraining(VST). In first phase, VAT, we stack two neighbor outputs from two domains(Us(t,t-T) for source and Ur(t.t-7) for target) and utilize sequence discriminator to a

Visualization of pseudo-labels w/o and w/ the online refinement. Theproposed online refinement successfully eliminate noise on the pseudo labels by checking temporal consensus.

Experimental results on GTA5→ Cityscapes. nV's and 'R' denote VGG-16and ResNet-101 respec- tively. We highlight the rare classes [80] and compute Rare class mIoU (R-mIoU) as well.

Experimental results on SYNTHIA → Cityscapes. mIoU* is computed with 13 classes out of total 16 classes except the classes with *,

Qualitative results on GTA5 → Cityscapes. We can clearly see that our full model generates the most visually pleasable results.

Performance improvements in mIoU of integrating our TPLD with existing self-training adaptation approaches. We use the Deeplabv2-R segmentation model.

Framework design choices

Voting field / number

a

Results of ablation studies.

A contrastive analysis of with and without hard sample training (Eq.Eq. (2.8)+Eq.Eq. (2.9)). (a): target image, (b): ground truth, (c): prediction result without hard sample training, (d): prediction result with hard sample training. We map high-dimensional features of (c) and (d) to 2-D space features of (e) and (f) respectively using t-SNE [133].

mIoU value 1s plotted per each round on SYNTHIA to Cityscapes.

Confidence score ablation. We compare between ours and entropy[136].

Additional qualitative results of pseudo-label densification.

Additional qualitative results of the two phase densification applied to CRST(MRKLD).

Qualitative results ofTPLD on CBST /LRENT / MRKLD.

Qualitative comparison with the baselines. We indicate yellow and red boxes for inaccurate and inconsistent prediction, respectively. We can see that previous approaches suffer from both wrong and temporally inconsistent prediction. Instead, our framework successfully resolves both issues. Best viewed in color.

Image semantic segmentation results (mIoU).

Video semantic segmentation results (VPQ:).

Ablation study on video adversarial training. We empirically verify the effectiveness of proposed tube matching loss and sequence discriminator.

Ablation study on video self-training. Agg.', "Reg.' and "Ref. denote temporally aggregated prediction, regularization and online pseudo label refinement, respectively.

Importance of video adversarial training. We run our full VST phase on the different adversarial models.

Importance of video adversarial training. We run our full VST phase on the different adversarial models.

Visual comparisons on pseudo labels. We can clearly observe thatproposed method generate more accurate and consistent pseudo labels over the baselines.

Pseudo labels generated from different adversarial models. It is obvious that the quality of generated pseudo labels depend on the pre-trained adversarial model since ours with VAT model shows better visual effect on pseudo labels than ours with IAT model.

Comparative performance on online refinement. Weexperiment different online refinement meth- ods on top of proposed VAT and VST.

Visualization of pseudo-labels w/0 and w/ the online refinements. The proposed cut-out refine- ment successfully eliminate noise on the pseudo labels by checking temporal consensus. However,fill-in based method makes additional noise on pseudo labels.

Visualization of cause and effect in our failure case. Our model is comparably weak to certain class, which could be originated from pseudo label generation process. Detail analysis is on Sec. 2.4.8

Despite considerable efforts in the develop- ment of unsupervised adaptation techniques, there re- mains a notable disparity in performance when com- pared to supervised models. This performance gap sig- nificantly impedes the practical application and industry- wide adoption of unsupervised adaptation methods.

Overview of Chapter3. Wepropose human-alignment adaptation for better data-centric adaptation a with active labeling strategy.

Average Pixel Label perimage VS. Performance. Our novel human-in-the-loop framework, LabOR (PPLand SPL) significantly outperforms not only previous UDA state-of-the-art models (e.g., IAST [83]) but also DA model with few labels(e.g., WDA [96]). Note that our PPLrequires negligible numberoflabel to achieve such performance improvements (25 labeled points per image), and our SPL shows the performanc

The overview of the proposed adaptive pixel-basis labeling, LabOR. This framework is made up of two models: UDA model and Pixel selector model. The UDA model initially trained from conventional adversarial learning forwards target image to generate pseudo label. Different from normal self-training training scheme [83] that utilizes the generated label to retrain the model directly, we instead trai

Experimental results on GTA5 → Cityscapes. While our PPL method already surpass previous UDA state-of-the-art models (e.g., IAST [83]) andDA model with few labels(e.g., WDA [96]) by only leveraging (around 40 labeled points per image), our SPL method shows the performance comparable with fully supervised learning (only 0.1% mIoU gap).

Qualitative result of our SPL While the state-of-the-art UDA method, i.e., IAST [83], and a naive way to label regions, SCONF baseline, show erroneous segmentation results, the proposed method, SPL, shows the correct segmentation result similar to the fullysupervised approach.

Qualitative result of our PPL While the state-of-the-art UDA method, i.e., IAST [83], and a naive way to la label regions, SCONFbaseline, show erroneous segmentation results, the proposed method, PPL,shows the correct segmentation result similar to the fully supervised approach.

The performance of (a) segment based and (b) point based pixel labeling strategies. (a) Our method, SPL, significantly outperforms all the methods among the uncertainty metrics, and our method shows the performance comparable to that offully supervised training method at the final stage. (b) Among the point based strategies, our final model, (PPL-Sim(best), shows the best performance.

The visualization ofthe generated regions to label Compared to simple ENT baseline, our SPL and PPL are able to select more diverse points to give labels.

Effect of self-training entropy regularization [83] on SPL and PPL. While the entropy regulariza- tion does not improve the performance of our SPL, adding entropy regularize on our PPL slightly improves the performance.

Effect of pseudo label generation on SPL accuracy and label ratio. The existing pseudo label thresholding techniques from CBST [165] and IAST [83] do not improve the performance of our SPL.

Effect of difference Ensemble based methods for generating pseudo labels. We show in this table that the method we choose, MCD, is a better choice in comparison to the Temporal ensemble method for the purposes of LabOR.

Time taken to label entire Cityscapes Dataset based on LabelingType. The table shows the drastic differences in time required for the Full Image Labeled and our methods of PPL and SPL. SPL already reduces the time taken by over a fourth, but PPL reduces it even further.

Experimental results on Synthia →Cityscapes. We show that even in Synthia →Cityscapes, both of our methods SPL and PPL outperform all previous UDA state-of-the-art models in addition to the WDA method The second last column shows the average mIoU of all 16 classes while the last column with the asterix denote the average mIoU of13 classes which exclude "Wall.' and "Pole.'

Performance of both of our models SPL and PPL on Synthia → Cityscapes at each stage We include in the table the label ratio or points used at each stage.

Diversity ofpixelclasses selected for Ours(SPL)and Entropy. We showin (a) that SPLhas a much more even distribution in many of the classes, while Entropy (b) has many classes that are rarely selected. We show in red boxes the "Rare' classes, which are classes with mIoU under 60%. SPLis shown to more consistently picks the "Rare' classes as shown in the higher mean and lower standard deviation.

The labeling cost graph with respect to the number of pixels. This graph shows the time required in relation to the number of pixels labeled.

Overview of Test-time adaptation

We propose a Multi-Modal Test-Time Adaptation (MM-TTA) framework that enables a model to be quickly adapted to multi-modal test data without access to the source domain training data. We introduce two modules: 1) Intra-PG to produce reliable pseudo labels within each modality via updating two models (batch norm statistics) in different paces, i.e., slow and fast updating schemes with a momentum, a

Overview of our Test-Time Training methodology. We adapt the encoder to a single out-of- distribution (OOD) test sample online by updating its weights using a self-supervised reconstruction task. We then use the updated weights to make a prediction on the test sample. To enable this approach, the encoder, decoder, and the classifier are co-trained in the classification and reconstruction tasks [93

Overview of the proposed Multi-Modal Test-Time Adaptation (MM-TTA) framework. Our MM-TTA consists of two modules: Intra-modal Pseudo-label Generation (intra-PG) and Inter-modal Pseudo- label Refinement (inter-PR). For Intra-PG, we adopt a slowly-updated model S that is gradually updated by a fast-updated model S with a momentum. Note that, statistics in the fast-updated model S are directly update

Overview of our 3D Test-Time Trainingmethodology. We build on top ofPointMAE, Theinputpoint cloud is first tokenized and then randomly masked. For our setup, we mask 90% of the point cloud. For joint training the visible tokens from the training data are fed to the encoder to get the latent embeddings from the visible tokens. These embeddings are fed to the classification head for the classificati

Quantitative comparisons with UDA methods and TTA baselines for multi-modal 3D semantic seg- mentation.

Example results of our MM-TTA duringtest-time adaptation for gradual improvement. While TENT [138] shows little improvements during adaptation, our method can effectively suppress the noise and achieve visually similar results to the ground truth, especially within the area of dotted white boxes.

Ablation study on effects of Intra-PG and Inter-PR in the A2D2 → SemanticKITTI benchmark. We provide two variants with different fusion: 1) Consensus: using pseudo-labels that are consistent between 2D and 3D, and 2) Merge: taking the mean oftwo output probabilities. For the selection process, "Entropy calculates and compares the entropy of2D and 3D predictions.

Pseudo-label threshold ratio 0(k)

Momentum factor>

Stability on using different learning rates in A2D2 → SemanticKITTI For the 2D/3D branch, we use four sets of learning rates: [1] 1.0x10-512.4x10-5 [2] 1.0x10-512.4x10-4 [3] 1.0x10-412.4x10-4 [4] 1.0x10-4/2x10-3

Pseudo-label accuracy during adaptation in A2D2 → SemanticKITTl

Construction ofthe Synthia datasetto generate pointclouds (15kpoints). In that sense, we can simulate the multi-modal dataset.

Quantitative results with usingreal target labels as oracles. Depending on whether we only finetune the batchnorm parameters or update all layers, the oracles are "Oracle TTA" and "Oracle Full'.

Qualitative results oft-SNE on TENT and MM-TTA. Each color represents one category, where our MM-TTA produces more compact clusters for each category

Qualitative results of pseudo labels on xMUDAPL and MM-TTA.

Qualitative 3D segmentation result of TENT, xMUDA, MM-TTA (Hard Select), MM-TTA (Soft Select) on three adaptation benchmarks.

Top-1 Classification Accuracy (%) for all distribution shifts in the ModelNet-40C dataset. All results are for the PointMAE backbone trained on clean train set and adapted to the OOD test set with a batch-size of 1 (copied 48 times through random masking). Source-Only denotes its performance on the corrupted test data without any adaptation. Highest Accuracy is in bold, while second best is underl

Mean Top-1 Classification Accuracy (%) for ModelNet-40C by using a larger batch size (BS) of 128 for baselines and MATE-Online.

Top-1 Classification Accuracy (%) for all distribution shifts in the ShapeNet-C dataset. All results are for the PointMAE backbone trained on clean train set setand adapted to the OOD testset witha batch-size of

MATE can achieve real-time adaptation performance with only a minor performance penalty. Here, we report the Mean Top-1 Accuracy (%)over the 15corruptions in the ShapeNet-Cdataset for different adaptation strides. Strides represent the number of samples after which an adaptation step is performed.

MATE can achieve real-time adaptation performance by only sacrificing some percent-points. Here, we report the Mean Top-1 Accuracy (%) over the 15corruptions in the ModelNet-40Cdataset for different adap tation strides. Strides represent the number of samples after which an adaptation step is performed.

Accuracy (Top) and Reconstruction Loss (Bottom) for all corruption in the ModelNet-40C at each adaptation step for MATE-Standard To avoid clutter, we splitthe different corruptions into two plots (left and right).

Reconstruction results for MATE-Standard at the 20-th gradient step for adaptation at test-time. We plotthe out-of-distribution test sample for adaptation (left), 10% input visible tokens (center) and the correspond- ing reconstruction output (right) for four corruptions in the ModelNet-40C dataset.

Overview of Unified Domain Adaptation: it provides the unified framework for unsupervised adapta- tion and test-time adaptation.

Extension of UDA method (TPLD) to Test-time Adaptation

Extension ofTTA method (MM-TTA) to Unsuper- vised Adaptation

Extension of TTA method (MATE) to Unsuper- vised Adaptation

Adopting original single scenerio model (UDA or TTA) to multi scenarios (UDA and TTA)is proven to work in various settings and datasets.

Overview ofhow-to-adopt data-generalized foundation model to UnDA framework

Overall framework of SAMUnDA

Visualization on corrected Pseudo Label#1

Visualization on corrected Pseudo Label #2

Visualization on corrected Pseudo Label #3

Figure shows that our SAMUnDA needs to be refined more to approach to human-level adaptation performance.

Concept ofhow to achieve human-interactive UnDA