<html>
<head>
<title>
DL94: On the Use of Linguistic Ontologies for Accessing and Indexing Distributed Digital Libraries
</title>
</head>

<body>

<!--#include virtual="/DL94/header.ihtml" -->

<h1> On the Use of Linguistic Ontologies for Accessing and Indexing Distributed Digital Libraries </h1>
<p>
Sujata Banerjee[1] and Vibhu O. Mittal[2]<p>
<p>
<i>
[1] Department of Information Science, <p>
[2] Department of Computer Science, University of Pittsburgh, Pittsburgh, PA 15260<p>
</i>
<p>
<p>
<p>
<b><p>
Abstract</b><p>
This paper  advocates the use  of large lexical  resources to generate synonyms
of terms supplied  by the user  in keyword based matches. The use of one such
resource -- WordNet  [2,10]  -- allows the system  to identify the role of
terms in a phrase (adjective/noun) and attempt to find further matches by
either (i)  generalizing appropriate terms, or (ii)  generating  a  set of
more  specific terms  that  can be used instead. Such an  approach can be
extended  easily to  new terms, and also used in a highly distributed system.
<p>
<b><p>
Keywords:</b> Ontologies, WordNet, information retrieval.  <b><p>
<p>
<p>
<p>
Introduction</b><p>
Digital libraries containing vast amounts of information are expected to soon
be readily available over wide area distributed networks. As the amount of data
available increases, the problem of indexing the resources efficiently and
accessing them easily will also increase. There have been several approaches
proposed previously: these range from simple keyword retrieval (as in DIALOG
and  MEDLINE) to sophisticated approaches that analyze the internal document
structure and attempt a semantic and syntactic analysis in finding suitable
matches [14]. The primary characteristics of these libraries will be: <p>
* A very large repository of information distributed across a wide  area
network. <p>
* A very large range of users (ranging from naive to expert). <p>
* A significant rate of change in the  data leading  to high update rates. <p>
These characteristics will require  that an efficient  indexing scheme allowing
easy retrieval  across  a  distributed representation  be developed.  In this
paper, we review  some of the previous approaches and  then  present our
approach,  based  on  the  use  of a  large and sophisticated linguistic
ontology to  generate  suitable keyword matches. We compare our approach to the
previous ones and discuss the advantages and disadvantages  of these approaches
in  the context of large, full text,  digital  libraries distributed  across a
wide  area network. We conclude by discussing directions for future work.
<b><p>
<p>
Background Work</b><p>
Conventional automated library catalogues are typically based on a simple
keyword pattern matching strategy, augmented by a boolean retrieval (along with
some positional operators such as, ADJ, NEAR, SAME). This approach  suffers
from at least two problems: (1) not all related information is retrieved, and
(2) these searches often  lead to the extraction of irrelevant materials. The
assumption is that keywords, which typically occur in the titles and abstracts
are sufficient to represent the topics in the text. However, the success of
this approach is crucially dependent upon the accuracy of the searcher's
attempts to guess the keywords used by the original author.  In a recent study
of journal articles in the  pure sciences and social science, it was found that
few articles that shared common references (and hence presumably dealt with the
same or a related subject) shared common keywords [1].  Another study in
applied biomaterials reported the difficulty in choosing keywords for a
literature search owing to the usage of multiple terms representing the same
concepts [4].  A third study reported that nearly 11 of the searches at a
particular library retrieved 50 or more articles, while 10 of searches
retrieved no articles [15].  Both facts indicate a poor choice of appropriate
keywords connected by inappropriate boolean operators.  Further examples of
inefficient searches were encountered by the authors with queries executed on
several existing on line library catalogues.  <p>
In order to alleviate some of the problems with purely keyword based systems, a
number of approaches attempted to extend the paradigm by incorporating
strategies such as statistical and probabilistic models, fuzzy logic, and
several different linguistic techniques, e.g., [3,8,16]. There have also been a
number of studies that investigated the use of AI techniques, such as the use
of rule bases, spreading activation, semantic networks, search expertise and
user models in information retrieval [5,7,9,13]. However, these studies were
conducted in limited, specialized domains and did not consider issues of
scalability to general, multiple domain searches. Finally, there have been a
few attempts at characterizing the meta-structure of the text in order to
facilitate focused searching based on the role of the search terms in the text,
e.g. [9,14]. While such approaches represent a powerful searching strategy in
many cases, their strength lies in the ability to distinguish the different
roles that the same keyword can play in the text. Thus, these strategies can
prove helpful in domains such as medicine where the same drug can be used in
different ways (for instance, as a variable in an experiment, as a condition,
or as a measure in an observation). But, as pointed out by Rama and Srinivasan
[14], such interfaces are difficult to use by even expert searchers. In
addition, any approach based on a full text analysis would be difficult to
integrate in a distributed system, where different sites can store data in
different formats with different indices. Other problems with very
sophisticated full text approaches are:  <p>
* these are typically not real-time, especially as the amounts of  data grow,
<p>
* these are not scalable to distributed systems: since each site may  have
different requirements, the storage and indexing of documents at  each site
will be different<p>
* such approaches that compile their indices in advance cannot  handle changes
to the indexing mechanism gracefully -- for instance,  the text grammar
approach [14] would require that the  documents be reprocessed if  the
meta-structure were modified even  slightly. Flexibility is an important
criteria because of the dynamic  nature of the data and the constantly evolving
nature of information  terminology.  <b><p>
<p>
Our Approach: Using a Large Ontology </b><p>
 Our approach to designing an indexing scheme was motivated by the need to make
the scheme scalable to very large amounts of data distributed across a very
large number of sites. Since very large amounts of data are involved, the
scheme had to be fast; since terminology is dynamically evolving, it had to be
flexible, and since it had to be able to scale to distributed networks, with
autonomous sites, it had to minimize the amount of information transmitted
across the network: thus, it could not transmit the complete text of hundreds
of thousands of documents in response to a search query. Thus, full text
analyses, such as the meta-structure method of Rama and Srinivasan [14], or
Liddy [9] were not considered feasible. Keyword approaches were simple, fast
and scalable. The transmission of keywords, in addition to the titles, across
network connections would not not unduly slow down the searches. However, pure
keyword extensions are brittle to changes in terminology, and keyword
extensions, such as the use of thesaurus style keyword synonym lists had been
investigated previously. The previous approaches to the synonym lists did not
succeed because of  three  problems:  <p>
*  the keyword lists were not large, and general enough  <p>
*  they were implemented in very small and specialized domains  <p>
*  they did not attempt any level of word role assignment: i.e., if  a phrase
such as 'large aircraft' was used as a search pattern, the  lists were unable
to take into account the fact that 'large' was an  adjective and other possible
patterns could be 'big aircraft', 'large  planes', etc.<p>
<p>
Our  approach makes use of  a  very large, well developed,  linguistic ontology
to  generate alternative search  patterns,  such as the ones discussed above --
an approach that addresses all of the issues raised earlier. The ontology used
is the WordNet  Lexical Database [2,10], a very large  lexical  database
consisting  of  approximately  100,000 different word  forms,  organized into
70,000  sets of  synonyms. The database, which is organized according to
psycholinguistic principles, divides  the lexicon into  five  categories:
nouns, verbs, adjectives, adverbs  and function  words.  Nouns  are organized
as  topical hierarchies, verbs are organized  by a variety of entailment
relations and  adjectives  and  adverbs  are  organized  as  N-dimensional
hyperspaces. Each of these  lexical organizations reflects a different way  of
categorizing  experiences and the  psychological complexity of lexical
knowledge.<p>
Mappings between words and their meanings are  -- some words can have different
meanings (polysemy), and some meanings can be expressed by  many different
words (synonymy). In an information retrieval system, it is essential that the
system be able to deal with both these aspects of form and meaning.  <p>
-----------------------------------------------------------------------<p>
For each query, until (user satisfied) do:<p>
<p>
	* give results of  an  exact match  <p>
	* if insufficient matches, prompt user for other 	options by generating new
queries:<p>
		by substituting adjective synonyms<p>
		by substituting noun synonyms<p>
		by dropping adjectives<p>
		by generalizing nouns<p>
		by generating multiple queries from 		possible disjoint coverings of nouns
and 		adjectives in the original query <p>
<p>
Figure 1: Query algorithm used by the system. <p>
-----------------------------------------------------------------------<p>
In our approach, based on the user's query, which consists of a set of
keywords, or phrases, the system analyses the query locally and (possibly)
reformulates it as shown in Figure 1. The system first attempts an exact match
(as in traditional systems), and failing that, subsequently generates queries
by analyzing the terms and their roles: a search for 'house pet' will generate
successive queries as follows: <p>
*  family pet, housing pet, lodging pet, business pet <p>
*  house favorite, house animal, house fauna <p>
*  pet (and its synonyms) <p>
*  house animal, house life form, house entity <p>
*  (house dog  house cat  )  <p>
At each step, the  system presents the reformulated  query to the user for
approval and possible modification. This allows the user to select between
different  senses of  the word to  be  used  in  the retrieval process.  This
approach shifts  the  burden of  reasoning on the  site making the request:
sites responding to the request  need only be able to retrieve exact matches
of  words specified  in the request.  Thus, each  site may upgrade  or  refine
its query  generation and  keyword mapping algorithms locally.  This will allow
each site to also extend the lexical resources appropriately depending  upon
the range of users and  their application domains;  for  instance, sites
interested  in medical  applications  could add the  UMLS  [6,11] hierarchy to
this organization and modify it accordingly.  <b><p>
<p>
Conclusions and Future Work</b><p>
To achieve the objective of digital libraries, that of bringing information to
the common man, a lot will depend on how easy it is to obtain the information.
If the onus of finding the exact keywords for each query is made to rest on the
naive users, repeated failures to obtain the relevant data will discourage even
the experienced user, let alone a naive one.  Hence an effective
resource/knowledge discovery mechanism is the key to the success of digital
libraries.  <p>
The approach we have advocated here represents an attempt to integrate
developments in computational linguistics and keyword based database indexing.
While similar approaches have been investigated previously, none of these
attempts used psycholinguistic organizations of lexical resources that were
comparable to WordNet in scope or size.  Since the success or failure of this
approach depends upon the range of alternatives that the system can generate to
use as keywords in finding matches, the quality of the lexical resources used
is crucial.  <p>
This paper did not investigate the issue of indexing other media, such as audio
and video, both of which are expected to form a significant portion of on line
resources. However, we expect that this approach can be used to index embedded
sound and video in documents by  extending the matcher to match on the
captions. We intend to investigate this aspect in future work.  <p>
An important issue in accessing resources will be indexing distributed library
resources. This will require the construction of local, generalized indices,
similar to the construction and maintenance of gigantic filespaces in the
Prospero [12] system.  Each site in a network will be relatively autonomous and
cannot assume that other sites will maintain similar complicated indices, or
semantic or syntactic links to portions of the documents. However, getting each
site to maintain, and advertise, keywords in addition to the titles, is a
reasonable extension, and will allow each site to select and refine its own
matching algorithms. Retrieval approaches that depend upon indexing the
complete document cannot, in general, make such assumptions.  <b><p>
<p>
<p>
References  </b><p>
[1]	Ali, S. N.  Subject Relationship between Articles Determined by
Cooccurrence of  Keywords in Citing and Cited Titles  Journal of Information
Science 19, 3 (1993), 225-231.  <p>
<p>
[2]	Beckwith, R., Fellbaum, C., Gross, D., and Miller, G.A.  WordNet: A Lexical
Database Organized on Psycholinguistic  Principles  In  Using On-line Resources
to Build a Lexicon, U.Zernik, Ed.  Lawrence Erlbaum, Hillsdale, NJ, (in
press).<p>
<p>
[3]	Bookstein, A.  Probability and Fuzzy Set Applications to Information
Retrieval  Annual Review of Information Science Technology 20 (1985), 117-151<p>
<p>
[4]	Bush, R.B.  A Bibliography of Monographic Works on Biomaterials and
Biocompatibility  Journal of Applied Biomaterials 4, 2 (1993), 195-209.  <p>
<p>
[5]	Carbonell, J.G., Evans, D., Scott, D.S., and Thomason, R.H.  On the Design
of Bio-Medical Knowledge Bases  In  MEDINFO (Amsterdam, 1986), R.Salamon,
B.Blum, and  M.Jorgenson, Eds., Elsevier Science, pp.285-303.  <p>
<p>
[6]	Cimino, J.J., Hripcsak, G., Johnson, S.B., Fink, C. F. D.F., and  Clyton,
P.D.  UMLS as knowledge base - A rule-based expert system approach to
controlled medical vocabulary management.  In  Proceedings of the fourteenth
Annual Symposium on Computer  Applications in Medical Care (Washington, DC,
1990), pp.175-179.  <p>
<p>
[7]	Cohen, P., and Kjeldson, R.  Information retrieval by constrained spreading
activation in semantic networks.  Information Processing Management 23, 4
(1987), 255-268.  <p>
<p>
[8]	Evens, M., Vandendorpe, J., and Wang, Y.C.  Lexical-Semantic Relations in
Information Retrieval  In  Humans and Machines: Proceedings of the Fourth
Delaware  Symposium on Language Studies (Norwood, NJ, 1985), Ablex, pp.73-100.
<p>
<p>
[9]	Liddy, E.D.  Structure of information in full text abstracts.  In
Proceedings of the RIAO Conference (1988), pp.183-195.  <p>
<p>
[10]	Miller, G.A.  Word Net: A Dictionary Browser  In  Information in Data:
Proceedings of the First Conference of  the UW Center for the New Oxford
Dictionary (Waterloo, Canada, 1985),  University of Waterloo.  <p>
<p>
[11]	Nelson, S.J.  The semantic structure of the UMLS metathesaurus.  In
Proceedings of the sixteenth Annual Symposium on Computer  Applications in
Medical Care (Baltimore, MD, 1992), pp.649-653.  <p>
<p>
[12]	Neuman, B.C.  The Prospero File System: A Global File System based on the
Virtual  System Model  Computing Surveys 5, 4 (1992), 407-432.  <p>
<p>
13]	Pollitt, A.S.  A rule based system as an intermediary for searching cancer
therapy literature on MEDLINE.  In  Intelligent Information Systems: Progress
and Prospects (Chichester, 1986),  R.Davies, Ed., Ellis Horwood, pp.82-126.  <p>
<p>
[14]	Rama, D.V., and Srinivasan, P.  An Investigation of Content Representation
using Text Grammars.  ACM Transactions on Information Systems 11, 1 (January
1993), 51-75.  <p>
<p>
[15]	Wallace, P.M.  How do Patrons Search the Online Catalog when no ones
looking -  Transaction Log Analysis and Implications for Bibliographic
Instruction and System-Design.  RQ 33, 2 (1993), 239-252.  <p>
<p>
[16]	Wong, S. K.M., Ziarko, W., Raghavan, V.V., and Wong, P. C.N.  On Extending
the Vector Space Model for Boolean Query Processing.  In  Proceedings of the
ACM Conference on Research and Development  in Information Retrieval (Pisa,
1986), F.Rabitti, Ed., pp.175-185.<p>
 
<!--#include virtual="/DL94/footer.ihtml" -->
</body></html>
