The way in which texts are represented is a crucial influence on the effectiveness of systems for text categorization which is the classification of documents with respect to a set of predefined categories, but attempts to produce better text representations mostly have been unsuccessful.
The lack of success of attempts to produce more effective text representations arises in part because most previous feature set to represent texts such as single-term and syntactic phrase have no consideration about the predefined categories.
This thesis presents a text representation method using collocation which is recurrent combinations of words that co-occur more often than expected by chance and that correspond to arbitrary word usages. Collocations are cohexive lexical clusters and category-dependent, that is, extracted differently from different cateogries. However, pure collocation as a feature set causes too many features. To resolve this problem without losing good properties of collcation, We suggest a clustered collocation considering very similar collocations into one feature.
The method using clusters of collocation as a feature set for text representation showed better results than using single terms. Especially, it was more effective when it was difficult to discriminate the categories.