Information retrieval is the process of selecting related pieces of information according to the information needs specified in a query. However, a major role of information retrieval systems is no just to generate a set of relevant documents, but to help determine which documents are most likely to be relevant to the given requirements. The similarity between a query and the documents can be computed in order to rank the retrieved documents in descending order of the query-document similarity. Users are able to minimize their time spent to find useful information by reading the top-ranked documents first. Ranking techniques are used to find the documents in a collection of documents that is most likely to be relevant to the user‘s query. Occasionally, we find out that there could be retrieved documents whose contexts may not be consistent to the query.
It is common practice in linguistics to classify words not only on the basis of their meanings but also on the basis of their co-occurrence with other words. Generally speaking, subjects respond quicker than normal to the word 'nurse' if it follows a highly associated word such as 'doctor'. The word 'doctor' is associated with 'nurse', 'sick', 'health', 'medicine', 'hospital', 'man', 'sickness', and so forth. Mutual information is a relation measure which represents relation between a word and another word. So, we will re-evaluate the relation between the terms in the retrieved document and the terms in the query.
In this paper, we discuss a model of natural language information retrieval system that is based on a two-level document ranking method using mutual information. At the first-level, we will retrieve documents by using an automatically constructed index terms. For indexing the first-level retrieval, we will construct the inverted file with index terms. For indexing, a typical complex term-weighting schemes, best fully weighted system, uses a cosine normalized tf X idf weight (term frequency times inverse document frequency) for document terms, and an enhanced but unnormalized tf × idf factor for the queries. Ranking is based on similarity that calculated as the inner product between document and query, and documents are ranked based on that similarity. At the second-level, we will reorder the retrieved documents by using mutual information. As the information for second-level reordering, we will construct the inverted file with mutual information and the inverted file with document terms. At the second-level reordering, we will reorder the first retrieved documents using our newly developed formulas based on the mutual information value, the co-occurrence terms normalization, the document normalization, and/or the combination of the foregoing normalizations.
Basically, we want to improve the retrieval effectiveness by reordering the document ranking with two-level document ranking method using the mutual information. An empirical study was conducted using a Korean encyclopedia with 23,113 entries (10MB of text), 45 natural language queries collected by end-users, and the relevant information selected by experts. We will show that our method achieves considerable retrieval effectiveness improvement over a traditional linear searching method. Also, we will analyse newly developed seven formulas that reorder the retrieved documents. Among seven formulas, we will recommend one formula that dominates the others in terms of the retrieval effectiveness.
Since our method can improve the precision while preserving the recall, we believe that the two-level document ranking method using mutual information is a good candidate for post-enhancement after traditional linear search ranking, relevance feedback, data fusion, or query expansion.