Though the automated text categorization into topical categories has a long history, dating back to 1960s, it`s target documents have been confined to short texts such as abstracts and newswire. However increasing lengths of documents in full-text collections and World-Wide Web carries out renewed interests in classifying long documents into proper categories.
This thesis proposes a new text categorization model, passage-based automated text categorization. Contrary to the passage-based text categorization model, traditional text categorization systems can be called as document-based text categorization systems since past researches on automated text categorization used a whole document as a categorization unit. However, the passage-based automated text categorization model divides the test document into passages and uses them as categorization units. By merging the resulting categories for the passages, test document`s categories can be reconstructed.
Experiments were conducted with passages based on overlapping fixed-length windows. Applying the passage-based text categorization model to longer documents in subsets of Reuters-21578 text categorization test collection on the top of kNN(k Nearest Neighbo) classifier, there was significant increases in categorization efficiency. This implies that passage-based text categorization can be used as a categorization method for full-text collections.