Traditionally, the concept of document has been represented using multiple pre-defined categories. Most previous models for text categorization did not consider the similarity between compound category, which consists of multiple categories, and a document. Instead the models computed the similarity between each category and a document to rank candidate categories, and then assigned one or more top-ranking categories to the document using experimental and statistical criteria.
This thesis presents a new model for text categorization that associates compound category with a document to enhance both recall and precision. A compound category is seen as a "meta category". In the model, document, category, and meta category are represented by probability vectors that are composed of terms with weights for the terms. Probability vector of category or meta category can be incrementally learned from sample documents.
The model selects most relevant meta category to a document using cross entropy of meta category and the document. The model was implemented using simulated annealing, and an experiment was carried out on Reuter-22,173 corpus. The experimental results show micro-averaged recall 69% and micro-averaged precision 72% in sparse training data set, and micro-averaged recall 89% and micro-averaged precision 84% in sufficient training data set. Recall and precision of the result are higher than them in Lewis' model which used Bayesian probability with experimental and statistical criteria.