The problem of merging collections in multilingual and distributed information retrieval: Calculation of documentary relevance in two steps

Fernando Martínez Santiago. October 2004

In this thesis a new approach, calculating documentary relevance in two steps, is proposed in order to address the widely known problem of merging collections or simply mixed results. In short, collection fusion is related to Information Retrieval which when analyzing a user´s information need should respond with a list of relevant documents for the given query. Sometimes such a list of documents is obtained from the fusion or mixture of several lists obtained independently from each other, and this paper focuses on that aspect, illustrating the efficiency of the proposed method in two scenarios: multi-lingual Information Retrieval and Distributed Information Retrieval.

One hypothesis advocated in this text is that given a particular information need, the rating given to two documents from two different collections is not comparable primarily because the significance assigned to a document is not an absolute value, but on the contrary strongly dependent on the collection to which this document belongs. Moreover, it is possible to perceive the unity of all the documents returned by each search engine as a new collection of small size and small vocabulary, since only the terms in the user’s query are of interest in this new collection. Under these simplifications this collection can be re-indexed and contrasted with the user’s query, thus obtaining a new list of documents rated only in relation to this new collection.

The results show that the proposed method is stable, always obtaining an improvement over other approaches of between 20% and 40%, regardless of the language

(Link TESEO)