한국과학기술원 도서관

서지주요정보
머신러닝에서의 데이터 드리프트 문제 해결을 위한 데이터 보정 및 선택 자동화 기법 = Automatic data calibration and selection techniques for addressing drifting data in machine learning
서명 / 저자	머신러닝에서의 데이터 드리프트 문제 해결을 위한 데이터 보정 및 선택 자동화 기법 = Automatic data calibration and selection techniques for addressing drifting data in machine learning / 김민수.
발행사항	[대전 : 한국과학기술원, 2022].
Online Access	원문보기 원문인쇄

소장정보

등록번호

8038767

소장위치/청구기호

학술문화관(도서관)2층 학위논문

MEE 22022

휴대폰 전송

도서상태

이용가능(대출불가)

사유안내

반납예정일

리뷰정보

초록정보

As machine learning becomes widely used in the industry, one of the serious bottlenecks is coping with data drifts where training data is continuously generated, but its distribution changes over time. In this setting, machine learning models may need to be frequently updated in order to maintain their performances. Why data drift occurs depends on the application. For example, in semiconductor manufacturing, periodic equipment inspections may lead to drifts in sensor data. In meteorology, the temperature on earth may constantly increase due to global warming. Our goal is to keep the models up-to-date against such drifting data. Although there is an extensive literature on handling data drifts, most techniques focus on detecting drifts in the data and then updating the model on the new data only. In other words, the drifted old data is simply discarded. This approach is problematic because the new data may be relatively small, and training a model on small amounts of data may not have high accuracy. Instead, we contend that the old data can be useful if we properly calibrate it and select the "useful" data subsets that augment the new data. Our proposed method first divides the old data into segments based on where the drifts occur. Next, our method selects which segments to use with the new data, how each segment is calibrated, and which subset of features to use for training. We extend existing wrapper based feature selection algorithms and propose MixSelection algorithms that select among multiple choices of data segments, calibration methods, and features simultaneously. Our method is agnostic to the data drift patterns and model being trained. In our experiments, we compare our method with various baselines on synthetic and real datasets with data drifts. As a result, our method outperforms the baselines in terms of model performance by effectively selecting calibrated segments and features.

데이터를 이용한 머신러닝이 현재 산업 전반에 걸쳐서 많이 쓰이고 있고, 또 좋은 결과를 보여주고 있다. 하지만 시간에 따라 데이터 분포에 변화가 생기는 데이터 드리프트 문제가 곳곳에서 발생하고 있다. 데이터 드리프트는 반도체 공정 과정에서 장비 점검과 같은 외부 요인으로 인해 생길 수도 있고, 날씨 데이터에서 지구온난화에 따라 기온이 점점 높아지는 것처럼 자연스럽게 나타날 수도 있다. 지금까지 대부분의 머신러닝 기법들은 분포가 변하지 않고 일정한 데이터셋을 기준으로 연구되었기 때문에, 이를 그대로 적용하게 되면 과거 데이터를 통해 학습된 모델은 미래의 바뀐 데이터를 제대로 예측할 수 없게 된다. 이는 과거의 학습 모델을 무의미하게 만들 수 있기에, 데이터 드리프트는 지속가능한 머신러닝 모델 운영을 위해서 꼭 해결해야 될 중요한 문제이다. 데이터 드리프트 문제와 관련된 많은 기법들이 연구되었지만, 대부분 근본적인 데이터 문제보다는 모델 측면에서 드리프트 감지를 통한 모델 업데이트에 초점이 맞추어져 있다. 이는 필요가 없어진 과거의 데이터는 버리고 앞으로 들어오는 새로운 데이터에 맞춰서 학습을 이어가겠다는 것을 의미한다. 하지만 지금까지 학습된 정보를 버리고 새롭게 얻은 적은 양의 데이터만으로 모델을 재학습하거나 업데이트를 하게 되면 정확도가 낮을 수 있다. 따라서 본 논문에서는 과거의 데이터를 모두 버리는 것이 아니라, 적절한 데이터 보정과 선택을 통해 근본적인 드리프트 문제를 해결하고자 한다. 즉, 새로운 분포를 가진 데이터가 들어왔을 때, 과거 데이터 중에서 새로운 데이터 예측에 도움이 될만한 학습 데이터를 모으는 것이다. 이러한 데이터 전처리 과정은 세부적으로 데이터 구간, 피처, 그리고 보정 기법 선택으로 나눠질 수 있는데, 우리는 기존의 래퍼 기반 선택 알고리즘을 확장해서 효율적으로 세 가지 선택을 한번에 처리할 수 있도록 자동화했다. 제안 기법을 드리프트가 존재하는 합성 데이터셋과 실제 데이터셋에 적용하여, 각 드리프트 시점마다 적절한 데이터 부분집합을 구성하고 이를 통해 학습된 모델의 예측 성능을 평가했다. 그 결과 우리의 제안 기법은 모든 데이터셋에 대해 베이스라인을 뛰어넘는 성능 향상을 이루어냈다. 이와 같은 결과는 결국 과거 데이터에서 적절한 데이터 보정과 선택을 통해 새로운 데이터 분포에 맞는 학습 데이터셋을 구성할 수 있음을 의미한다. 또한, 제안 기법은 특정 드리프트 종류나 학습 모델에 상관없이 적용할 수 있는 전처리 기법이기 때문에 다양하게 확장 및 적용이 가능하다.

서지기타정보

서지기타정보
청구기호	{MEE 22022
형태사항	iv, 31 p. : 삽화 ; 30 cm
언어	한국어
일반주기	저자명의 영문표기 : Min Su Kim 지도교수의 한글표기 : 황의종 지도교수의 영문표기 : Steven Euijong Whang
학위논문	학위논문(석사) - 한국과학기술원 : 전기및전자공학부,
서지주기	참고문헌 : p. 29-30

QR CODE

책소개

전체보기

나의 도서관정보

메뉴

소장정보

리뷰정보

초록정보

서지기타정보

책소개

목차

이 주제의 인기대출도서