<html>
<head>
<title>
DL94: Knowledge-Based Access to Heterogeneous Information Sources
</title>
</head>


<body>

<!--#include virtual="/DL94/header.ihtml" -->

<h1>
Knowledge-Based Access to Heterogeneous Information Sources
</h1>
<p>
<p>
Richard M. Tong and David H. Holtzman<p>
<i>
Advanced Decision Systems, <br>
division of Booz, Allen &amp; Hamilton, Inc.<br>
1500 Plymouth Street, Mountain View, CA 94043</i><p>
<p>
<p>
<p>
<b><p>
Abstract</b><p>
Booz, Allen &amp; Hamilton has developed a knowledge-based approach to
"mediated" cross-database access that uses an object-oriented knowledge
representation to enable the user to construct detailed models of the retrieval
domain of interest. The Booz, Allen &amp; Hamilton system that implements these
ideas is called MINERVA and currently uses two kinds of mediators to support
access. "Query Mediators" first transform user queries into a series of
finer-grained information requests using the knowledge-base of retrieval
models, and then probabilistically combine and evaluate the responses from the
individual sources. "Source Mediators" first transform the fine-grained
requests from the Query Mediators into source specific requests using local
knowledge of the sources and their information characteristics, and then
transform the responses from the specific sources into a form that the Query
Mediators can use for combination and evaluation.<p>
<b><p>
Keywords</b>: Heterogeneous databases, mediated systems, distributed
information retrieval, knowledge-based information access.<b><p>
<p>
<p>
<p>
1.  Introduction</b><p>
Perhaps the main challenge facing enterprises today is ensuring that mission
critical information is made readily and easily available to those in the
enterprise who need it. Enterprises have traditionally used a variety of
database systems to store and organize data and have struggled with the
concomitant issues of disparate, and often conflicting, data models and access
mechanisms. In recent years, this problem has been exacerbated by the
availability of large numbers of information sources outside of the enterprise
that are also important to the operation of the enterprise. A further
complicating issue is that end users are also increasingly making use of
non-traditional data to support their activities. Images, video and sound are
now easy to capture and store in digital form, and potentially offer a wealth
of information not readily available in the past.<p>
In our view, information access is not a separate activity that a user performs
in isolation from other responsibilities, but rather is an integral part of the
overall working environment. As we move towards this new data rich world, we
need to develop advanced information access systems that can both deal directly
with the heterogeneity and autonomy of the sources in the underlying
information space, and give the user the ability to move easily between tools
that aid in the execution of specific tasks and those that help locate and
analyze information to support these tasks. Thus, given the disparate nature of
the emerging information sources, the need to maintain access to legacy
sources, the heterogeneity of the software and hardware systems used to host
the sources, and the need to provide end-users with transparent access to the
massive underlying information space, we need to develop new paradigms for
information search and retrieval that can form the basis of these future
information systems.<p>
A powerful general model of the software architecture needed for this next
generation of information systems was recently proposed by Wiederhold [1] and
involves the use of intelligent mediators that support various transformations
between end-users and the information sources. In the spirit of this
architectural vision, Booz, Allen has developed a prototype system, called
MINERVA, that provides knowledge-based, mediated access to heterogeneous
information sources.<p>
In this short paper we describe the overall MINERVA system, discuss our
approach to providing heterogeneous access to full-text databases, and
introduce our current research in extending the MINERVA model to structured
data access.<b><p>
<p>
<p>
2.  The MINERVA System</b><p>
MINERVA<a href="#fn1">[1]</a> is an open, client-server system that: uses
TCP/IP for basic network connectivity; uses an extensible, object-oriented
meta-language for application level communications; and, encapsulates
information sources (especially legacy information sources) using source
specific "wrappers." MINERVA uses two kinds of mediators to support access:<p>
*	Query Mediators that first transforms users queries into a series of
finer-grained information requests, and then probabilistically combines and
evaluates the responses received from the individual sources; and,<p>
*	Source Mediators that first transform the fine-grained requests from the
Query Mediator into source specific requests, and then transform the responses
from the specific sources into a form that the Query Mediator can use for
combination and evaluation.<p>
In the remainder of this section, we described the underlying MINERVA
infrastructure, outline our approach to Query Mediation, and discuss the nature
of Source Mediators for full-text and structured databases.<b><p>
<p>
2.1.  The Distributed Information Operating Environment (DIOE)</b><p>
MINERVA is designed to implement mediated access against information sources
that are physically separated from each other and from the user. It is also
designed to provide access when the information sources have a native search
capability (e.g., a text search engine, or an RDBMS), without removing or
bypassing this capability. In this way it can operate in distributed
environments and with legacy systems. We call the environment that deals with
the distributed nature of the sources and users the Distributed Information
Operating Environment (DIOE).<p>
Figure 1 shows an instance of the DIOE that illustrates it's main features and
is intended to give some context to the technical discussion in the following
sections. As the figure shows, the DIOE has three key functional components:<p>
*	Dispatchers that manage the passing of messages (related to queries, results
and various administrative functions) among the distributed resources;<p>
*	Query Mediators (we only show one in this figure) that perform query
decomposition and scoring; and<p>
*	Source Mediators that allow information sources to connect into the DIOE.<p>
Although the figure shows four Dispatchers with Desktops and services attached,
the ring can contain any number of Dispatchers and the services themselves can
exist anywhere in the ring. So, for example the figure could just as easily
have shown two Dispatchers with the Desktops attached to one Dispatcher and all
the other services attached to the other Dispatcher. The point is that the DIOE
is a flexible, configurable environment that allows users, information sources,
and other services to be connected in completely unrestricted ways.<p>
 <p>
<p>
<img src="figures/tong1.gif"><p>
<i>Figure 1. The MINERVA DIOE.</i> <p>
<p>
The basis of the open architecture is the Dispatcher, which acts as a router
for its own local service network. The Dispatcher obtains requests from local
users (these originate at the desktops shown in the figure) and decides which
local service can handle these requests. In addition, the Dispatcher passes the
requests on to other available Dispatchers. Services can be any process that is
connected to the Dispatcher -- for example, applications (shown in the figure
as an Application Mediator), Query Mediators, any form of Source Mediator (we
show connectivity to text DBMSs, structured DBMSs, and images DBMSs), and
Gateway Mediators to other systems. Dispatchers connect to one another in a
virtual ring, and MINERVA contains capabilities for establishing connections to
the ring and for managing the processing in the ring.<p>
<b><p>
2.2. Concept-Based Query Mediators</b><p>
Query Mediators in MINERVA make use of the concept-based retrieval technology
developed by ADS [2]. In this approach, users formulate queries using
"concepts" that have meaning in the application domain. Concepts are domain
dependent entities which carry the semantics of that domain and which can be
described and organized without reference to specific information sources. It
is important to realize, however, that the representations of domain knowledge
developed using this approach will not, in general, be exhaustive descriptions
of that domain, but rather will constitute a Retrieval Model that reflect the
information needs of the user (or user group) that developed them. That is,
this approach is intended as a way of describing valuable (to the user)
distinctions in the domain rather than as an "expert" on specific topics.<p>
To make some of these ideas more concrete, we will use a simple, running
example throughout the remainder of the paper. This will allow us connect
together the various technical themes of our research effort. Our example
focuses on an intelligence analyst who is tasked with monitoring events that
might indicate the potential clandestine delivery of military equipment to
Iraq. The analyst has at his disposal various information sources, such as
message traffic, imagery and on-line technical information about military
equipment, which helps him monitor the event. To support this activity, the
analyst will want to draw upon a concept knowledge base that describes events
of various kinds and which can be used to focus the search for relevant
information.<p>
In our approach, defining a concept corresponds to distinguishing a sub-class
of some larger class of entities. To illustrate, first let us suppose that we
are interested in various kinds of events, then naming trade events as a
sub-class implies that we have some way of distinguishing the elements of this
smaller class from those in the larger class. In our approach, the mechanism
for describing this distinction is to specify the "attributes" of the
sub-classes. A simple example of the representation is shown in Figure 2. Here
we assume that our overall goal is to model the <p>


<p>
<img src="figures/tong2.gif"><p>
<i>Figure 2. Example Concept Definition.</i><p>
<p>
domain of trade events and that for us a trade event can be defined in terms of
actions, customers, suppliers and products. In the figure, trade event is a
concept (shown as an oval) and the rest are attributes (shown as labelled arcs
between ovals). For the purposes of the example, we further assume that
customers are countries, suppliers are companies, products are various kinds of
equipment, and that a sea shipment is an action of interest. These concepts in
turn can have attributes (as shown). Any given concept can also have
sub-concepts, and we show one instance of this in the figure where military
equipment is defined as sub-class of equipment (indicated by the solid
unlabelled arc).<p>
A simple query corresponding to the concept of a trade event would then be
successively decomposed by the Query Mediator into the lower-level domain
concepts using the information in the concept knowledge base. When the terminal
domain concepts are reached, the decomposition then proceeds by interpreting
the "source reference language" expressions attached to these nodes.<a
href="#fn2">[2]</a> This decomposition is shown schematically in Figure 3. <p>


<p>
<img src="figures/tong3.gif"><p>
<i>Figure 3. Knowledge-Based Mediation in MINERVA.</i><p>
<p>
The figure illustrates the process of decomposing a user query, which may
involve multiple concepts, using the network of concept definitions, down to
the source reference language components. In general, terminal nodes may have
multiple source reference language expressions attached to them, although in
the figure, for the sake of clarity, we separate out the text, structured data
and image reference components.<p>
The individual source reference expressions are communicated to the Source
Mediators, via the DIOE, and converted into native query language expressions
that can be applied to the actual sources. The information returned by the
individual sources (i.e., information about which data objects satisfied the
source reference expressions, together with a numerical measure of the degree
to which the expressions was satisfied) is passed back to Query Mediator where
the individual pieces of "evidence" are combined to produce an overall measure
of the relevance of the data objects to the original request.<a
href="#fn3">[3]</a> Once this combination process is complete, the Query
Mediator communicates the final results back to the user Desktop from whence it
originated.<p>
<b><p>
2.3. Full Text Data Access</b><p>
The current implementation of MINERVA provides a flexible, scalable approach to
heterogeneous text database access. In this section we briefly describe the
techniques we use as a way of introducing some of the basic ideas that underpin
our solution to the more general problem.<p>
For accessing text databases we have develop a general "Text Reference
Language" (TRL) into which concept-based queries are decomposed. The TRL
provides a mechanism for defining patterns of text which, if they appear in the
body of the document, can be taken as evidence for a concept. To illustrate,
suppose we have the following definition of the concept of a SCUD missile in
our concept KB:<p>
<tt><p>
CONC:+:Scud_Missile<p>
SCOP:PARAGRAPH<p>
PATT:100:100:(OR SCUD SCUD-C)<p>
PATT:90:100:(SENTENCE MISSILE TEL)<p>
PATT:60:100:(OR SSM PHRASE(MOBILE MISSILE))<p>
</tt><p>
where the lines that begin PATT: indicate that the expressions that follow are
statements in the TRL. The pairs of integers associated with each PATT: line
represent the necessary and possible support that the evidence gives for the
concept. The SCOP: line that precedes the PATT: line, define the "scope" within
which evidence for the concept is to be gathered. Here it has value PARAGRAPH,
indicating that the focus of the evidence gathering should be at the paragraph
level, so that a document that has all three pattern expressions satisfied
within the same paragraph will generate a higher score than a document that
contains them spread throughout the document. The TRL expressions themselves
are to be interpreted in an obvious way and are directly analogous to the
keyword-based query languages usually found in commercial text retrieval
systems (e.g., BRS/Search, NEXIS, DIALOG). The key distinction, however, is
that the MINERVA TRL is not specific to any particular retrieval system and is
designed to be a more general query language.<p>
The advantage of this strategy is that we can then develop intelligent Source
Mediators that transform the MINERVA TRL requests into native query language
requests. Not all text DBMS support all the operators defined by the TRL, so
part of the knowledge contained in the Source Mediator is the definition of
these translations. So, for example, many DBMSs do not support the SENTENCE
operator and in such cases the Source Mediator might replace this with a
conjunction. Similarly, some DBMSs might not support the PHRASE operator and
the Source Mediator might replace it with some form of adjacency operator. The
net effect is that the Source Mediators provide a "fill in" capability that
allows each individual text source to behave like a fully functional TRL
source. This needs to be contrasted with some other approaches to heterogenous
text database access in which the common query language is the "lowest common
denominator" of the individual source query languages. This is illustrated in
Figure 4.<p>

<p>
<img src="figures/tong4.gif"><p>
<i>Figure 4. Full-text Database Encapsulation.</i><p>
<p>
<b><p>
2.4. Structured Data Access</b><p>
While many enterprises make extensive use of structured data,<a
href="#fn4">[4]</a> they also often use a variety of database systems to store
and organize the data, and have then to deal with the issues of disparate, and
often conflicting, data models and access mechanisms. This is widely studied
problem and a great deal of literature has appeared in recent years. Within the
MINERVA framework proposed here, the challenge is to exploit this body of
research to design and develop Query Mediators that can treat text and
structured data in a uniform and transparent manner.<p>
The MINERVA approach is to exploit the existing knowledge-based Query Mediator
by extending the concept knowledge-base to incorporate (concept-based) models
of the structured databases, and by extending the source reference language so
that it can define evidential primitives both for text and structured data. The
advantage of this approach is that, from the user's perspective, all
information requests use the same query language and are always described in
terms of domain concepts, even though the underlying data models are quite
different.<p>
The key difference between the structured database access problem and the text
database access problem is that we have to consider the issue of representing
the semantics of each database instance. That is, it is not sufficient to
provide a generic Source Mediator for a particular DBMS, as we can do with
text, but rather we actually have to model the data stored in the database if
we are to access it intelligently. Thus if we are to connect to structured
databases as well as text databases, somewhere in the information system there
has to be a semantic model of this structured data.<p>
The primary research issues we are addressing in this approach are:<p>
*	the development of tools and techniques for building the concept-based model
of the structured data;<p>
*	the development of tools and techniques for merging the concept-based model
derived from the structured data with an existing concept KB;<p>
*	extensions to the knowledge representation to include a more general "Source
Reference Language" that subsumes the existing TRL and also provides a
mechanism for incorporating "SQL-like" expressions;<p>
*	modifications to the Query Mediator to exploit this extended knowledge
representation; and<p>
*	extensions to our model of what a retrieved object is so that the results
from text and structured databases can be "fused" in a coherent way.<p>
Our results to date are very preliminary, but to help illustrate the ideas we
are pursuing in addressing these research challenges, Figure 5 shows what a
model of a simple database might look like using an object-oriented
representation. The upper left part of the figure shows the schema for a a part
of a database that deals with international trade events, and the lower right
shows what an object-oriented model of it might look like (we use the same
notation as in Figure2). Obviously, there are a number of difficult problems
that need to be addressed in building this model and we are currently making
use of ideas develop by various groups (e.g., the Carnot Project at MCC [3] and
the Pegasus Project at Hewlett-Packard [4]) to develop tools that help
partially automate this process.<p>

<img src="figures/tong5.gif"><p>
<i>Figure 5. Object-Oriented Model of Relational Schema.</i><p>
<p>
The next step in our proposed approach is to "merge" the model shown in Figure
5 with the one shown in Figure 2. Again there are some significant challenges
in attempting to do this automatically and for now we are pursing machine-aided
approaches, rather than fully automatic ones.<p>
The main focus of our current activity is on the definition of a general
purpose Source Reference Language (SRL) and on the necessary changes we have to
make to the knowledge representation and to the inference mechanisms used by
the Query Mediator. We are exploring a structured SRL that uses ideas from SFQL
and object-oriented extensions to SQL, and are looking at a number of inference
strategies, including a single unified reasoning model that treats text and
structured data records as sub-classes of a more general retrieved data object.
This unified approach also allows us to begin addressing the problem of merging
search results when they come from heterogeneous sources.<p>
<b><p>
3. Summary</b><p>
The MINERVA system described in this short paper is a prototype of the next
generation of information systems that will be needed to take advantage of the
evolving National Information Infrastructure. Applications such as Digital
Libraries are a natural use of the MINERVA technology.<p>
<p>
<p>
<b><p>
Acknowledgments</b><p>
We would like to acknowledge the rest of the MINERVA team: Steve Ciccarelli,
Steve Crutchfield, Dave Pool, Steve Sandke, Chris Smith, Bill Terry, and Lee
Appelbaum.<b><p>
<p>
<p>
References</b><p>
[1]	G. Wiederhold. "Mediators in the Architecture of Future Information
Systems." Computer, 25(3):38-49, March 1992.<p>
<p>
[2]	R. M. Tong, L. A. Appelbaum, V. N. Askman. "A Knowledge Representation for
Conceptual Information Retrieval." International Journal of Intelligent
Systems, 4(3):259-283, 1989.<p>
<p>
[3]	C. Collet, M. N. Huns, W-M. Shen. "Resource Integration Using a Large
Knowledge Base in Carnot." Computer, 24(12):55-62, December 1991.<p>
<p>
[4]	R. Ahmed, P. De Smedt, W. Du, W. Kent, M. A. Ketabchi, W. A. Litwin, A.
Rafii, M-C. Shan. "The Pegasus Heterogeneous Multidatabase System." Computer,
24(12):19-27, December 1991.<p>
<hr>
<a name="fn0">[+]</a>Richard Tong is now with Verity Inc., 1550 Plymouth
Street, Mountain View, CA94043 and can be reached on the Internet at
rtong@verity.com. All requests for information about MINERVA should be directed
to David Holtzman at davidh@sartre.ads.com.<p>
<a name="fn1">[1]</a> The MINERVA system is the result of an internally funded
Booz*Allen project that to date represents approximately 12 person-years of
investment. MINERVA currently provides a general purpose full-text database
access capability, and is being evaluated by several users within the DoD,
where it is being used to access a number of commercially available full-text
systems including Verity Inc.'s Topicreg., BRS' BRS/Searchreg., Open Text's
PAT, and both the public domain and WAIS Inc. versions of WAIS.<p>
<a name="fn2">[2]</a> Source reference languages are the mechanism we use for
specifying the lowest-level information requests in the MINERVA
meta-language.<p>
<a name="fn3">[3]</a> In our model, the basic statement of belief is a
qualifier that attaches to a concept. Statements of belief are then statements
about the degree to which the evidence we have supports or denies that the data
object is relevant to the query concept. We make the further assumption that
support for a concept denies its negation, but we do not rule out the case in
which the evidence leaves us partially ignorant (as opposed to uncertain) about
the query concept. That is, we allow for the situation in which the evidence
tells us nothing about the concept. This results in an interval-based
evidential calculus in which we manipulate both the necessary and possible
degrees of support for concepts.<p>
<a name="fn4">[4]</a> By structured data we mean, of course, data that is
stored in a traditional DBMS, such as Sybase, IMS or IDMS, and which can be
represented using one of the standard data models -- that is, the relational,
hierarchical or network data models.

<!--#include virtual="/DL94/footer.ihtml" -->
Last Modified: <!--#echo var="LAST_MODIFIED" --> <br>
</body>
</html>
