There are three problems in searching for relevant documents such as noiseness of descriptors, vocabulary gap between documents and a given query, and different importance of query descriptors. The previous probabilistic retrieval models rank documents, considering only the different importance of query descriptors. They ignore the other problems because it is difficult to obtain knowledge appropriate to a particular application, and to use the knowledge correctly in reducing the three problems. At first, this thesis proposes a general ranking function which can correctly handle the three problems. By the way, the function is too complex for a practical information retrieval system to utilize it for effective and efficient document retrieval.
The general ranking function is simplified substantially under the assumption of certainty indexing, i.e., binary indexing. The complexity of the simplified ranking function is reduced further by Faithful User Assumption (FUA) that a relevant document has all the concepts represented by a query. A learning method to reduce the three problems is derived formally from FUA. Each time retrieval results are available, it updates the knowledge on importance of query descriptors and relationships between query descriptors and other descriptors. Noise descriptors are also defined in this thesis. The retrieval by the simplified ranking function and the proposed learning method is called Faithful User Retrieval (FUR) under certainty indexing.
The effect of the incrementally constructed knowledge and noise query descriptors is investigated through experiments in FUR under certainty indexing and in the previous probabilistic ranking model BIR. When it is not impossible to obtain the distributions of query descriptors in relevant documents for past queries, the retrieval effectiveness of FUR is comparable to that of BIR. If the distributions become available, both of them improve the performance. The degree of improvement of FUR is greater about 10% than that of BIR in most document collections. The experimental results also show that many of query descriptors extracted automatically from natural language queries are independent of retrieval quality. In other words, they are noise in retrieval purpose. They only increase query processing time without supporting effective document ranking.
In order to obtain further improvements in retrieval effectiveness, the vocabulary gap problem is looked into. Query descriptors are expanded with their analogues to alleviate the problem. The analogues of a descriptor identified by the proposed learning method reflect the contexts in which the descriptor has appeared. Since a broad query descriptor relates to a lot of contexts, the expansion with its analogues may make the query cover many contexts different to the user's intention. Reversely to analogues of broad query descriptors, those of narrow query descriptors may clarify the contexts of the query because a narrow query descriptor occurs in highly correlated contexts. Hence, the analogues for only narrow query descriptors can improve retrieval effectiveness further, which is proved experimentally.
Uncertainty indexing estimates for each descriptor in a document a probability of correct indexing that a human being attaches this descriptor to the document. A ranking function and a learning method suitable for uncertainty indexing have been developed. The ranking function is another simple version of the general ranking function, and the learning method developed for certainty indexing is modified for uncertainty indexing. The retrieval based on both of them is called FUR under uncertainty indexing, The experimental results show the superiority of FUR under uncertainty indexing to BIR and FUR under certainty indexing.