Lexical Ambiguity Resolution in Automatic Document Classification Tasks

L. Alfonso Ureña López. November 2000

In this study content analysis tasks are described and the resolution of lexical ambiguity and document classification studied, drawing parallels between both fields. In our work we analyze the existing linguistic resources and investigate the ways in which they can improve the effectiveness of the disambiguation techniques.

The main contribution of this thesis is the proposal of a new approach for resolving lexical ambiguity based on the integration of linguistic resources, using information from a text corpus (SemCor) and a lexical database (WordNet). We perform a direct assessment of disambiguation, which shows experimentally, on a wide set or collection test, the effectiveness of the approach in terms of disambiguation based on the integration of linguistic resources using automatic evaluation.

Resolution of lexical ambiguity is applied to two specific tasks of document classification: information retrieval and text categorization. In the process of information retrieval the terms of the query are expanded with WordNet information once it has been disambiguated by feedback. In text categorization automatic resolution of lexical ambiguity has been proposed as an approach that is also based on the integration of the Reuters corpus and WordNet lexical database.

This is a novel approach because it incorporates automatic disambiguation into the integration of linguistic resources in the text categorization task.

Finally, we discuss and evaluate both tasks through a systematic method that allows us to compare the effectiveness of the system in the field of document classification, both for information retrieval and text categorization.

(Link TESEO)
(Published as a SEPLN monograph and available in PDF here)