Classifying images to object or scene categories according to the content is an important topic in computer vision with many applications. In real world, an image or an object is usually associated with rich contexts which are important in human vision to categorization. In this thesis, we explore modeling the contexts for effective image categorization, and address the issues of defining, representing and learning contexts in three categorization scenarios: single-label categorization, multi-label categorization, and pixel-level categorization, $\It{i.e.}$., scene parsing.
Defining two typical contextual relations between local features, $\It{i.e.}$., a semantic conceptual relation and a spatial neighboring relation, a local feature based Contextual Bag-of-Words (CBoW) model is proposed for single-label image categorization with the popular Bag-of-Words (BoW) representation style. The conceptual relation is learned according to the similarity of class distributions induced by visual words corresponding to local features, and the spatial neighboring relation is learned by a confidence that neighboring visual words are relevant. Classification is taken using support vector machine (SVM) with a designed kernel incorporating the relational information.
Multi-label image categorization is more challenging yet closer to real-world applications than single-label case since real-world images are usually associated to multiple labels. Conventional algorithms over multi-label image data predominantly rely on the holistic image similarities, ignoring that each label essentially only characterizes a local region. With the multi-label contexts piloted by a collection of multi-label images, we propose the Contextual Image Decomposition (CID), to obtain an optimal representation for each label of a set of multi-labeled images without explicit segmentation. Multi-label context is defined that local label representations of the same category are similar across different images while those from different categories are dissimilar. We formulate the decomposition as an optimization problem which minimizes intra-label difference and at the same time maximizes inter-label difference of the target label representations, to which two ways of mathematical solutions are proposed.
Scene parsing, to categorize image to different labels on pixel-level, is a core problem in computer vision. Guided by the multi-label context across the images that closely related segments usually have similar labels we propose a weakly-supervised scene parsing algorithm that semantically parses a collection of images with multi-label. Images are segmented to patches on multi-level and the patches contextual relation is discovered via sparse representation by $\ell^1$ minimization. The contextual patch labeling process is formulated in an optimization framework based on the graph representation and solved by a convergent iterative method. For better performance, the category models are also learned using CID from the image set and applied to the segments. Final labeling is obtained by combining all the information on pixel level.
The proposed contextual modeling algorithms are extensively evaluated on different image categorization tasks with benchmark datasets. Experimental results demonstrate the importance of contexts for image categorization and the proposed algorithms achieve state of the art performance with comparison to previous methods. Furthermore, a typical application of KAIST campus images labeling and label ranking is demonstrated.