Topic models refer to a group of bayesian statistical models that discover latent structure of documents inthe form of human-readable topics. They have been widely researched and applied in industry for tasks such as document classification, sentiment analysis, and review prediction. Recently we have witnessed two cases of major explosion in text data as well as a new trend in statistical language models. These changes call for more advanced models. The goal of this thesis is to present novel Bayesian generativemodels that are capable of gaining deeper understanding from the data. First, most of human knowledgeand information have become digitized with efforts such as Wikipedia and Gutenberg project. Oftentopic model is used to extract topics and obtain insights on the corpus, yet it is much harder to learnrelations and structure among topics. I build recursive Chinese Restaurant Process (rCRP), a novelnonparametric generative model that discovers hierarchical structure of topics. Second, traces of ourdaily lives are transferred online with proliferation of mobile devices. This gave rise to coupling of textdata with other types of data such as text-photo (Facebook, Instagram), text-video (Youtube), andtext-review score (Amazon). With its flexibility, researchers have attempted to extend topic model fromits original form of LDA to account for additional data. However, text-click data, equivalent to mostonline user activity, has not been researched. I build Headline Click-based Topic Model (HCTM), a novelgenerative model that learns click-value of words for relevant semantic context. Finally, the developmentof competing models such as word embedding has precipitated the need for modeling specific contextsof individual words. Topic model is good at extracting general topics of an entire document, however itrequires better modeling of local context at individual word locations. Topic model is good at extractinggeneral topics of an entire document, however word embedding excels at extracting local context atspecific position of document. I build Dual Context Topic Model (DCTM), a novel generative modelthat accounts for both document context and local context of individual words.
대용량 텍스트 데이터를 이해하기 위한 가장 효과적인 기계 학습 방법론은 토픽 모델링이다. 베이지안 확률 모델로서 토픽 모델 기법은 데이터 세부 구조에 대한 강한 모델 가정을 세울 수 있다. 본 박사 학위 연구를 통하여 텍스트 데이터 내부에 존재하는 잠재적 관계를 발견하기 위한 다양한 토픽 모델을 개발하였다. 이를 통하여 이제까지 의존도가 없다고 여겨졌던 변수들 내부의 관계를 정확히 모델링 할 수 있다. 첫째로 새로운 비모수적 확률 분포인 recursive Chinese Process (rCRP) 를 만들어, 대용량 텍스트 내 존재하는 인간 지식의 계층 구조를 자동으로 추출해낼 수 있는 모델을 만들었다. 둘째로 클릭 스트림 데이터로부터 주제별 단어의 클릭 가치를 추출 할 수 있는 Headline Click-based Topic Model (HCTM) 을 만들었다. 이는 사용자들이 어떠한 컨텐츠를 왜 클릭하는지를 찾아낼 수 있다. 셋째로 문서내 근접 단어 사이의 관계를 명확히 파악하여, 개별 단어의 세부 맥락을 파악 할 수 있는 Dual Context Topic Model (DCTM) 을 만들었다.