» Multimodal Information Retrieval based on knowledge integration

See full content »

Manuel Carlos Díaz Galiano. April 2011

Abstract:
This study aims to integrate knowledge and filtering techniques in order to improve Multimodal Information Retrieval systems. Traditional Information Retrieval (IR) Systems are primarily concerned with dealing with textual information. However, the amount of electronic information available today is not only textual, but rather multimodal. By multimodality we mean any format including textual information, images, video or audio, and in most cases we usually find mixed information.

There are specialized systems dealing with the extraction of textual information in different formats. Examples include the Content Based Image Recovery (CBIR) systems, systems which extract video features and systems that transcribe conversations to text. In most of these the information obtained is finally expressed in text, so that in the end traditional text processing techniques are often used . A multimodal system is a system that retrieves information from large collections in various formats.

This can exploit the advantages of various specialized systems. This multimodality allows, for example, CBIR systems to improve using textual information that appears next to images. These systems are useful for different types of professionals who need to work with formats other than text. Within this area we can consider medical work, which generates large volumes of information on each clinical case, including text and images from the various tests.

This study examines how a multimodal system is affected by filtering and including knowledge specific to the textual information available. For this multimodal corpus are used. These are available in the various evaluation forums of these systems. We will focus on the corpus provided by ImageCLEFmed, as it concerns a more specific environment such as healthcare.

(Link TESEO)

» Geographic Information Retrieval and formulations based on multiple search engines

See full content »

Jose Manuel Perea Ortega. October 2010

Abstract:
This study aims to integrate knowledge and filtering techniques in order to improve Multimodal Information Retrieval systems. Traditional Information Retrieval (IR) Systems are primarily concerned with dealing with textual information. However, the amount of electronic information available today is not only textual, but rather multimodal.

By multimodality we mean any format including textual information, images, video or audio, and in most cases we usually find mixed information. There are specialized systems dealing with the extraction of textual information in different formats. Examples include the Content Based Image Recovery (CBIR) systems, systems which extract video features and systems that transcribe conversations to text. In most of these the information obtained is finally expressed in text, so that in the end traditional text processing techniques are often used . A multimodal system is a system that retrieves information from large collections in various formats. This can exploit the advantages of various specialized systems. This multimodality allows, for example, CBIR systems to improve using textual information that appears next to images. These systems are useful for different types of professionals who need to work with other formats than text. Within this area we can consider medical work, which generates large volumes of information on each clinical case, including text and images from the various tests.

This paper proposes the use of various techniques together to address the problem of Multimodal Information Retrieval. Systems of traditional text-based Information Retrieval are amply tested and analyzed, and techniques used in these systems have proven their effectiveness. However, in systems where the goal of the search is not a text or where the documental corpus is formed only by text, the technologies currently employed do not obtain the same performance as textual techniques.

That is why this study is focused on enhancing and improving text retrieval as part of a multimodal recovery system, applying methodologies and proven tools together. Among the techniques used and studied are the use of external knowledge to improve user’s queries, the filtering of textual collections to remove relevant data and the fusion of results obtained by the different retrieval systems within a multimodal system.

(Link TESEO)

» BRUJA: A System for Multilingual Question Answering

See full content »

Miguel Ángel García Cumbreras. May 2009

Within systems of natural language processing and information retrieval systems we find Question Answering. The search for answers can be defined as the automated process performed by computers to find concrete answers to specific questions asked by users.

Question Answering systems not only locate relevant documents or passages (within a document collection or unstructured information), but also find, extract, and show the response to the end user, saving them time searching or reading the relevant information in order to find the final answer manually.

The main components of a search for answers are:

- Analysis of the question
- Retrieval of documents or relevant passages
- Extraction of answers

Today there are systems designed to find answers to questions asked by the user using one single language for collections and any language for the question, so it is only necessary to apply one language translation, of the question into the language of the collections, in order to work in monolingual mode.

In this study a multilingual question answering system, called BRUJA (Búsqueda de Respuestas en la Universidad de Jaén, “search for answers at the University of Jaén”), has been researched and developed. The term “multilingual” is used in its entirety, or “clir” (cross language information retrieval). This generally involves accepting questions in any of the languages used, the use of collections in several languages and returning the response or final answer in the same language as that of the question.

Several possible solutions for the various modules have been investigated, developed and tested and then integrated into a final solution. The final version of the system works in three languages: English, Spanish and French, with possible expansion to other languages.

This research work and PhD thesis was awarded the rating of Excellent Cum Laude, and in 2010 was awarded the prize for the best doctoral thesis in the field of Natural Language Processing and Information Retrieval by the Spanish Society of Natural Language Processing and later published in full in a monograph.

(Link TESEO)

(Published as a SEPLN monograph and available here in PDF)

» Resolution of lexical ambiguity by learning vector quantization

See full content »

Manuel García Vega. December 2006

Abstract:
Word Sense Disambiguation is the problem of assigning a specific meaning to a polysemous word, using context. This problem has been of interest almost from the beginning of computing in the 50s. Disambiguation is an intermediate task and not an end in itself. In particular, it is very useful, sometimes necessary, for many NLP problems such as information retrieval, text categorization, automatic translation, etc.

The goal of this thesis is to implement a meaning tagger of words based on the Vector Space Model, optimizing the weights of the training vectors using neural network LVQ (Learning Vector Quantization) of the Kohonen supervised neural model, and to propose a uniform method of integration of the resources that serve to train the network. The LVQ network parameters have been optimized for the problem of disambiguation.

This work has shown that neural networks, specifically Kohonen models, solved the problem of lexical ambiguity resolution brilliantly, providing robustness because the LVQ network is insensitive to small changes and consistent results were observed regardless of the training; flexibility, because they are easily applicable to any PLN task; scalability, because many different training texts can be introduced to suit any domain; and effectiveness, because the results obtained are comparable and in many cases outperform traditional methods used to solve the same problems.

The SemCor corpus and WordNet lexical database have been integrated. This has also provided a method for the automatic integration of any corpus. Experiments show the good performance of this network for the specific problem of disambiguation.

(Link TESEO)

» The problem of merging collections in multilingual and distributed information retrieval: Calculation of documentary relevance in two steps

See full content »

Fernando Martínez Santiago. October 2004

Abstract:
In this thesis a new approach, calculating documentary relevance in two steps, is proposed in order to address the widely known problem of merging collections or simply mixed results. In short, collection fusion is related to Information Retrieval which when analyzing a user´s information need should respond with a list of relevant documents for the given query. Sometimes such a list of documents is obtained from the fusion or mixture of several lists obtained independently from each other, and this paper focuses on that aspect, illustrating the efficiency of the proposed method in two scenarios: multi-lingual Information Retrieval and Distributed Information Retrieval.

One hypothesis advocated in this text is that given a particular information need, the rating given to two documents from two different collections is not comparable primarily because the significance assigned to a document is not an absolute value, but on the contrary strongly dependent on the collection to which this document belongs. Moreover, it is possible to perceive the unity of all the documents returned by each search engine as a new collection of small size and small vocabulary, since only the terms in the user’s query are of interest in this new collection. Under these simplifications this collection can be re-indexed and contrasted with the user’s query, thus obtaining a new list of documents rated only in relation to this new collection.

The results show that the proposed method is stable, always obtaining an improvement over other approaches of between 20% and 40%, regardless of the language

(Link TESEO)

» Automatic document classification in the domain of High Energy Physics

See full content »

Arturo Montejo Raez

Abstract:
This study is a proposed solution to the problem of massive multi-tagging of documents in general, and documents in the domain of high energy physics in particular.

This problem is called Text Categorization, in which predefined keywords are considered categories to be assigned to documents based on their textual content. During the development of this research, conducted mainly at CERN, the European Laboratory for Nuclear Research, the collection of documents revealed problems not previously covered by the literature. The express need for a solution to the management of such data that should go beyond mere scientific analysis and prototyping has marked the hypothesis throughout the study.

The results of the final solution implemented as a result of this investigation have opened up a wide range of applications, giving me the pleasant feeling of usability normally neglected in pure research. The reader will find out how exciting this task was, but what cannot be included here is the personal enrichment gained by working in an international environment for four years, with a team that facilitated the most advanced computational techniques to the community of CERN library users, the largest in the world of physics.

» LVQ algorithm applied to natural language processing tasks

See full content »

María Teresa Martín Valdivia. May 2004

Abstract:
Both Natural Language Processing (NLP) and Artificial Neural Networks (ANN) are two key areas in Artificial Intelligence. However, despite the large amount of work in both disciplines, attempts to combine them have been scarce.

On the one hand, the studies that include machine learning in NLP systems are numerous, and secondly, ANNs have been applied to a number of problems with similar features to those of PLN. Interestingly, however, the number of studies that use ANN in PLN systems is very small. This is more surprising when the results of the few existing studies show that the use of a neural approach is a good alternative method for building PLN systems based on learning.

The main purpose of this thesis is to demonstrate that it is possible to take advantage of features of ANNs to successfully address the development and implementation of systems that deal with language automatically.

To do this, a common formalism based on a neural model for solving various NLP tasks is proposed. Specifically three tasks will be discussed:
• Text categorization
• Resolution of lexical ambiguity
• Information retrieval.

While for the first two tasks complete systems will be developed, for information retrieval two specific issues related to these types of systems will be addressed:
• The recognition of multiword terms
• Collection fusion

The first problem is examined from a monolingual perspective while the second will be addressed to a multilingual environment.
The neural scheme used is based on the Kohonen model and more specifically his supervised release: the learning algorithm or algorithm for vector quantization LVQ (Learning Vector Quantization). It will be shown that this algorithm can be adapted to solving real applications of natural language processing, presenting it as a robust, flexible and effective method. Experiments show that the LVQ algorithm is easily adapted to the different scenarios used and the results obtained are comparable, and in many cases outperform the traditional methods used to solve each of the problems studied.

(Link TESEO)

» Lexical Ambiguity Resolution in Automatic Document Classification Tasks

See full content »

L. Alfonso Ureña López. November 2000

Abstract:
In this study content analysis tasks are described and the resolution of lexical ambiguity and document classification studied, drawing parallels between both fields. In our work we analyze the existing linguistic resources and investigate the ways in which they can improve the effectiveness of the disambiguation techniques.

The main contribution of this thesis is the proposal of a new approach for resolving lexical ambiguity based on the integration of linguistic resources, using information from a text corpus (SemCor) and a lexical database (WordNet). We perform a direct assessment of disambiguation, which shows experimentally, on a wide set or collection test, the effectiveness of the approach in terms of disambiguation based on the integration of linguistic resources using automatic evaluation.

Resolution of lexical ambiguity is applied to two specific tasks of document classification: information retrieval and text categorization. In the process of information retrieval the terms of the query are expanded with WordNet information once it has been disambiguated by feedback. In text categorization automatic resolution of lexical ambiguity has been proposed as an approach that is also based on the integration of the Reuters corpus and WordNet lexical database.

This is a novel approach because it incorporates automatic disambiguation into the integration of linguistic resources in the text categorization task.

Finally, we discuss and evaluate both tasks through a systematic method that allows us to compare the effectiveness of the system in the field of document classification, both for information retrieval and text categorization.

(Link TESEO)
(Published as a SEPLN monograph and available in PDF here)