<html>
<head>
<title>
DL94: An Interoperable Architecture for Digital Information Repositories
</title>
</head>

<body>

<!--#include virtual="/DL94/header.ihtml" -->

<h1>An Interoperable Architecture for Digital Information Repositories</h1>
<p>
S. Shen, R. Mukkamala, A. Wadaa, C. Zhang, H. Abdel-Wahab, 
K. Maly, A. Liu, and M. Yuan<p>
<i><p>
Department of Computer Science, Old Dominion University, Norfolk, VA
23529-0162<br>
email: shen@cs.odu.edu<p>
<p>
<p>
<p>
</i><b><p>
Abstract</b><p>
During the past decade we have seen a remarkable expansion of Internet which
connects millions of users at academic, industrial, and government institutions
nation-wide and world-wide. Additionally, it has caused a rapid growth in the
available amount of information and the variety of forms it is represented in.
Numerous Internet resource discovery systems have come into existence to solve
the problem.  The interoperability of these various systems and the efficiency
of their serving as components of an integrated digital library system remain
to be improved. We have proposed a three-layer interoperability architecture
which alleviates this problem. It allows a large, distributed user base,
autonomies of local systems,<p>
ease in integration of heterogeneous systems, and efficient retrieval of
information.  The implementation issues of our architecture have been
investigated and we believe that it is feasible and effective.<b><p>
<p>
Keywords</b>: Interoperability, resource discovery  systems,  digital
libraries.<b><p>
<p>
<p>
<p>
1.  Introduction</b><p>
The retrieval of information on the computer has become very popular in recent
years.  The availability of free access to some Internet resources and the
rapid expansion of Internet, particularly in the recent year or two, have made
the retrieval of remotely located information a popular exercise.  The
availability of free access to systems like World-WideWeb (WWW) even allows
users to incorporate their own relatively easy-to-implement hypermedia systems
intoWWW to become part of an integrated Internet resource system [2]. Several
different Internet resource systems have been linked with each other in
reasonable, though not ideal, ways. The realization of a practical digital
library system<a href="#fn0">[1]</a>  is becoming a reality. An example of
subsystem implementation is WATERS, Wide Area TEchnical Report Service, jointly
developed by ODU, UVA, SUNY Buffalo, and VPI [4].  In that system WAIS is
combined with WWW allowing anybody with access to Internet to search a
distributed database of computer science technical reports with subject
keywords, authors or university names, and issue dates among others. The user
does not have to worry about where a particular technical report is stored and
retrieval is optimal in the sense that technical reports are stored locally
under the control of the authors and sent directly by WWW to the requester once
WAIS has identified the location.<p>
    However, for a true digital library system, given the huge resource space,
for virtually any arbitrary query, its actual solution process and the
potential solution space can easily make the task non-feasible in terms of the
requirements on the computational and network resources. In addition, the many
available subsystems will require local autonomy and are not compatible with
each other and can hardly work together in a smooth manner. To resolve this
problem, the overall system must be able to utilize some resource partitioning
schemes and for each query it must be able to search the partitioned space
intelligently. We propose a three-layer  interoperability architecture to allow
a large, distributed user base and varying local resource system autonomies.
Users at local resources will continue to have their customary
user-interfaces;individual heterogeneous resource systems can easily
participate in the overall system; and information search over the integrated
system can be more efficient.<p>
    This paper is organized as follows. In section 2 we summarize the features
of some of  the existing information retrieval systems. The deficiencies of
each system are also discussed. We then describe the proposed architecture in
Section 3. The salient features of the architecture are discussed here. The
implementation issues associated with this architectureare are explored  in
Section 4. Finally, Section 5 concludes the paper.<b><p>
2.  Background</b><p>
To a large extent, Internet<a href="#fn1">[2]</a> collects and organizes
information around administrative units. This is not a surprise. Many
organizations that have benefited from the Internet also want to share their
knowledge with others. They put the information into some repositories and try
to make the information generally available.  Because the administration of the
Internet's networks is decentralized, no single organization has a clear idea
about which resources are available in the Internet to its end-user and how to
guide the end-user to the information of interest. The large amount of the
resources and the lack of guidance for their access has posed great challenges
for<p>
users to find, acquire, organize, retrieve, and use the information.<p>
Besides other technical aspects, a central problem for designers of document
and information retrieval is often related to the contents of the documents
rather than their organization. The contents or semantics of documents are not
well represented by only surface features such as individual words taken from
abstracts, titles, or even when taken from the entire documents. The problem is
complicated further when the type of information is image, audio, video, or
other media, where even the surface features are not generally available.<p>
    We see here a major conflict between the user's information needs and
Internet information organization principles: While the user wishes to locate
information based on the knowledge domain(s) it pertains to, the Internet
generally organizes information around administrative units. The conflict has
had profound effects on the design of the Internet information discovery<p>
systems and their search mechanisms.<p>
Archie addresses the problem of locating resources by filename in the Internet
[4]. Archie servers centralize indexing information on filenames which are
collected periodically from known public Internet file archive sites. Users can
query Archie's index to locate files that are available from public Internet
file archive sites, also known as anonymous FTP sites.  As the Internet grows,
however, maintaining a global resource directory is a non-feasible task.
Furthermore, the simple filename approach can not reflect the underlying
information content  most of the times.<p>
    Gopher organizes information into a directed graph in which intermediate
nodes are servers, directories, or indexes, and leaf nodes are documents [7].
The structure is basically administration-centered. Although it simplifies the
registration and management of servers and documents, this structure leads to a
complicated mechanism for information search and retrieval. In order to obtain
the desired documents, the user may need to manually investigate a number of
information servers, and issue the query. In the worst case, the user may need
to select a specific network to continue the search, which the user should not
be aware of.<p>
    WAIS uses a centralized directory of services and divides its indexes among
the servers [10]. Though it enables keyword search among the servers and the
administrative units are not visible to the user at the beginning, the result
of the initial query is a list of potential servers rather than a list of
relevant documents.  The user is then required to select a few
administration-centered servers from the list to continue the search. Thus the
user is forced early on to give up on completeness ("recall") of the search;
precision is also limited to a great extent because most WAIS databases use
only very limited amounts of information to rank the list of "hit" documents.<p>
    WWW organizes data into a distributed hypertext in which nodes are either
full-text objects, directory objects called cover pages, or indexes [2].
Hypertext offers great flexibility in organizing and browsing information.
Cover pages, if properly compiled, provide a good overview of or<p>
reference to the underlying data. The sheer volume of the Internet information,
however, has brought special difficulty for end-users to locate information of
interest. They often get lost in the information space.  The manual hypertext
compilation process also poses a great burden<p>
on hypertext designers, administrators, and publishers on tracking and
maintaining the links in the dynamic changing Internet environment as well as
maintaining the document itself.<p>
    Distributed information retrieval systems are emerging from the
research/development phase into the experimental deployment phase [1,8,9,11].
However, the existing systems (Archie, Gopher, WAIS and WWW) are still largely
inadequate in connecting the information needs of the end-user to the vast
Internet resources, given the diversity of user classes and resource
representations, the constraints on autonomous administration of the resources
and on the heterogeneity of information systems, applications, and user
interfaces.  Internet information discovery tools need non-trivial approaches
to map the massive administrative-centered resources into a kind of conceptual
information space that will closely reflect the user's information needs.<p>
    In this paper, we put forth a vision for a digitized information
architecture which meets the above requirements. The architecture integraties
different information services, resources, applications, user interfaces, and
end-users into a common framework.  The framework again is decentralized to
adapt to the rapid growth of the available resources.<b><p>
<p>
3.  Proposed Architecture</b><p>
An important long term goal of the Digital Library is to afford a massively
large number of heterogeneous classes of users (user groups), offering access
to a massively large number of distributed autonomous resource repositories, in
ways that are seamless, timely, and economic. Here, we propose an architecture
to achieve these goals. Before describing the architecture, we make the
following assumptions about two important entities of the system: users and
resources.<p>
<b><p>
A1:</b>	<b>Autonomous resource management: </b>The physical resource space
would consist of a multitude of autonomous, geographically distributed
Published Resource Repositories (PRR). For each PRR there would be an owning
entity, managing the repository autonomously. Hence, for each PRR a management
system called PRR subsystem is assumed. Also, for each PRR, a Published Access
Scheme (PAS) is defined in the framework of PRR subsystem, and is supported by
the owning entity as the scheme offered to potential users for accessing that
PRR. Due to political, performance, and other reasons, it can be foreseen that
entities offering to share their resources, would continue to exercise autonomy
on their PRR as described above. Owing to autonomy, several types of
heterogeneities, in terms of resource representation, storage, and access
schemes, could exist.<p>
<b><p>
A2:	Multiplicity of user interfaces:</b> Different classes of users, running
different classes of applications would be concurrently accessing the
underlying PRRs.The user-resource interface is a function of the class of users
and the class of applications at hand, as well as the characteristics of the
resources being accessed. Since the environment of the Digital Library is of
such a scale, that the combinations of users, applications,and resources cannot
be accurately identified a priori, it is simply impractical to have a single
user interface for accessing the underlying PRRs. Thus there is a need for
supporting multiple customized (user, application, and resource dependent) user
interfaces.<b><p>
</b><p>
Notice that the necessity for multiple user interfaces expressed by the second
assumption is orthogonal to the first assumption of PRR autonomy.  Supporting
uniform integrated access implies that the design of user interface(s) would
not be affected (dictated) by constraints of Published Access Schemes of
different PRRs. We now present our architecture that attempts to validate the
above assumptions.<p>
    Our architecture has three distinct layers (see Figure 1): an
interoperability layer (IL) managed by an interoperability protocol suite; a
resource repository layer (RRL) containing the different participating PRR
subsystems; and a user interface layer (UIL). We now describe each of these
layers.<b><p>
<p>
User Interface Layer (UIL)</b><p>
Our top layer (UIL) may contain any number of User Interface Systems (UISs)
each customized to the particular specification of a (user, application,
resource) situation.  A UIS is correct, i.e. supported in the UIL, if and only
if it relies on a set (library) of access primitives (APs)  defined and
supported by the Interoperability Layer (see below). The set of APs forms the
interface between UIL, and interoperability layer, and is designed to enable
the UISs to view and access a virtual integrated and structured resource space.
A given user may belong to any number of UISs. A user accessing resources by
way of UISs is called a Digital Library User (DLU), since his access is managed
by the digital library system. We distinguish a DLU from a user accessing a
particular repository using the associated PAS, since access activity is
directly managed by the corresponding PRR subsystem in this case.<b><p>
<p>
Interoperability Layer (IL)</b><p>
IL is the second and middle layer defined by our architecture. It is managed by
an Interoperability protocol suite. The major role of this protocol suite is to
define and support an appropriate set of APs. It effectively performs a two-way
mapping between the actual distributed physical resource space containing the
separate repositories and the virtual space represented by the APs interface.
Additional functions would be to bookkeep and manage events of
join-in/walk-out, add/delete resources to a repository, administer accounting,
and the like.<p>
The interoperability protocol suite integrates a multitude of algorithms and
tools, for organization, structure evolution, presentation and manipulation of
both the virtual and actual resource spaces described above. Some examples are
Hypermedia tools, Collaboration tools, search algorithms, and so forth.<b><p>
<p>
Resource Repository Layer (RRL)</b><p>
The bottom layer defined by our architecture is the resource repository layer
(RRL). This layer simply contains all PRRs and their associated PRR subsystems.
Interoperability across heterogeneous PRRs in the RRL is accomplished by a
contractual commitment of each PRR<p>
to support a set of resource repository primitives (RRPs).  Specifically, The
protocol suite of IL defines two sets of primitives at the interface between
the IL and the RRL. The function of these primitives is to enable the IL
protocol suite to view and uniformly access a distributed physical resource
space.  The uniformity is a result of limiting the IL protocol suite to access
resources in any contracted PRR using a standard set of primitives<p>
as discussed below. In essence, the role of the standard primitives at the
IL/RRL interface is to hide inherent heterogeneities of PASs across different
PRRs, by offering a uniform resource repository interface, that is minimal and
extensible.<p>
    The remainder of this section elaborates on the properties and usage of the
two types of RRPs, defined by the IL protocol suite.<b><p>
</b><p>
*	<b>Standard RRPs.</b> These can be defined as the minimum set of functions to
be supported by a published resource repository in order to join the Digital
library proper as a resource donor.  The objective of the standard RRPs is to
secure a minimally sufficient set of primitive functions for supporting a
uniform practical resource repository interface. Given a particular PRR some
examples of standard RRPs can be described. One example can be
TableOfContents() that returns a list of Digital Library descriptors for atoms
of the resource space (e.g., books, articles, and software library programs),
for the particular PRR at hand.  A second example can be GranularityStructure()
returning a table containing type names of different types of atoms recognized
in the particular resource space, and their granularity hierarchy, e.g.  a
repository recognizing proceedings, panel report and technical paper as
distinct addressable atoms, and defines the granularity hierarchy to have
(proceedings) as root, and both {panel report} and {technical paper} as sons. A
third example can be SearchDimensions() returning a list of type names and type
constructors of each dimension (the simplest form being keywords and character
strings), recognized by the local subsystem, to be used to search the resource
space.<p>
<p>
*	<b>Non Standard (specialized) RRP.</b> Owing to a special  characteristic, of
a resource its corresponding subsystem can offer additional primitives to
enhance access to its resource. For example, assume a NASA repository of
timestamped states of the earth's atmosphere during a period of time, each
state being defined in terms of a set of quantitative measurements. The local
subsystem can offer a special primitive called ComputeAtmosphere
(InitialAtmosphere, Pressure+/-, Temperature+/-) returning the new measurements
(state) of the atmosphere after enforcing changes in pressure and temperature
passed in the primitive call to the atmosphere defined by InitialAtmosphere,
after applying a proprietary atmosphere behavior model. Our protocol IL  can
automatically offer a user a menu of special primitives offered by a particular
subsystem if the user identifies part of the TableOfContents() of the subsystem
as relevant.<p>
<p>
We  now turn to discuss some of the salient characteristics of our protocol.<p>
<p>
*	<b>Satisfying the user and resource needs:</b> Both of our basic assumptions
are satisfied under this architecture. Interoperability essentially enables
sharing resources across PRRs, under constraints of full local autonomy of PRR
subsystems; thus satisfying our first assumption. Note that our second
assumption is also satisfied in this architecture. This is achieved by
enforcing a separation between functions attributed to user interface
specifications, given a particular user, resource being accessed, and
application situation, and functions attributed to accessing the resource from
the  underlying PRRs until it is passed to the user interface. Accordingly, our
architecture allows for multiple customized user interfaces to be built on top
of a virtual access layer.<p>
	<p>
*	<b>Encapsulation of Resource Sharing Services:</b>  As a result  of
restricting the UIL/IL interface to the set of APs,  algorithms of the
Interoperability protocol suite are hidden from the UISs. Therefore, autonomous
maintenance for enhancing performance, efficiency, economy, and functionality
of service in the IL would not affect any of the functioning UISs. Hence, our
architectural framework affords the creation of any user interface system in
the user interface layer so long as its access to the interoperabilityl ayer is
limited to the APs interface.<p>
	<p>
*	<b>Incremental Extensibility of Services:</b> Our architectural framework
allows the APs interface to be incrementally enhanced, and enables different
levels of utilization. If APs in general are designed to be independent, then
adding new primitives does not affect old ones. Also, a given UIS might be
built around a minimal subset of fairly simple APs. However, Advanced UISs can
utilize a larger subset of APs, that can incorporate sophisticated primitives,
e.g. Collaborative Access Support, Parallel Searching, etc..<p>
<p>
The next section describes a number of implementation issues of our
architecture<b><p>
 <p>
4.  Implementation Issues</b><p>
In proposing the new digital information architecture, one of our prime
consideration has been the ease and flexibility of its implementation.  In
particular, we are concerned with the heterogeneity of users' hardware, the
heterogeneities in the underlying networks (e.g., differences in speeds), and
the autonomous nature of individual repositories.  These considerations are
reflected in the architecture as well as the (proposed) implementation as shown
below.  The implementation issues are discussed in terms of its three
layers.<b><p>
<p>
User Interface Layer </b><p>
This is an important component of the proposed system. Since the digital
library system is designed to attract a variety of users, it is almost
impossible to think of developing a uniform interface to meet all needs.
Instead, each community (e.g., educational institutions, physicians, service
organizations) will develop its own interfaces, most suitable to its own
members. In fact, we foresee the development and availability of a variety of
commercially developed interfaces to meet these needs. Some of the major issues
in developing such interfaces are:<p>
<p>
*	The <b>level of transparency</b> offered to the user in terms of the locality
of resources, the cost of access, the boundaries of search domain, etc.<p>
	<p>
*	<b>Accuracy of the retrieved information:</b> Often users want a quick
response to a query, even if approximate, rather than wait a long time for the
most accurate answer. For example, an economist studying the economic trends
over the last 100 years does not care whether the data is accurate up to
yesterday or last week. The interface should offer an option for the user to
specify the required accuracy. The system can in turn use this information to
choose appropriate copy.<p>
	<p>
*	<b>Complexity of the interface primitives:</b> One of the key design and
implementation decisions at this layer is the nature of primitives. One
approach (e.g., RISC-like)is to offer simple and efficient primitives to the
user interface so that customized UIS can be built using them. In this case,
cost of building UIS will be high.<p>
	<p>
 *	<b>Representational issues:</b> Since this layer has to handle all types of
data including textual, pictorial, and audio (data, voice, and video in the
network terminology),a decision has to be made as to the role of this layer in
handling the data. If a uniform access approach is adopted, then it could
handle all the data just as one kind: digital. In this case, the user interface
will be responsible for recognizing the differences in the data types and
handling them accordingly. Some of the issues that pose implementational
difficulties are the management of buffers, management of computation and
communication resources, and dealing with any synchronization requirements. For
example, if the information retrieved from the underlying system consists of
independent audio and video sources, but to be  presented in a synchronized
manner to the user, there may be additional efforts necessary at the user
interface layer. It is also possible to transfer this responsibility to the
UIS.<b><p>
<p>
Interoperability Layer  </b><p>
This layer interfaces with a variety of underlying resources and offers a
uniform interface to the user interface layer. Some of the key implementation
issues to be dealt with at this layer are:<p>
<p>
*	<b>Tools:</b> In order to implement a variety of search/retrieve/combine
operations over a large domain and heterogeneous resources, we need to
integrate several existing commercial products. We certainly foresee the need
for hypermedia tools, collaboration tools, large-space search techniques, etc.
Selecting the tools as well as integrating them to achieve the desired task may
be quite a challenge.<p>
	<p>
*	<b>Interface routines: </b>The IL protocol has to deal with a slew of
resource clusters. Once again, depending on the uniformity (or non-uniformity)
offered by the interoperability layer, the routines could become quite complex.
Problems of heterogeneity, problems of size (large set of repositories to deal
with), and the complexity of services offered to the user interface call for
innovative design and implementation. Especially, we should be concerned about
the end-to-end delay expected by the users. The implementation choices will be
greatly dictated by the performance requirements of the end-user.<p>
	<p>
*	<b>Dealing with autonomous repositories:</b> Since each repository is
autonomous, there is no standard format for the data that it stores or the
primitives that it offers to the rest of the system. This creates problems in
the implementation of standard RRPs.<b><p>
<p>
Resource Repository  Layer  </b><p>
This is the layer that we have least control on-since retaining individual
repository autonomy is our objective. However, to facilitate its integration
with the rest of the system, we propose a flexible interfacing system. The
interface will be developed by the repository maintainer.  However, the digital
library administrators will provide guidelines as to the minimal services
expected from each repository. Whether the semantics (and syntax) of these
minimal services are to be standardized and set by the digital library system
administrators is not clear at this point. Such a standardization at least on
the minimal subset would ease the development of the interoperability layer. In
addition, it will guarantee some services from each repository. The issue of
charging a fee for access is the discretion of individual repositories.  We do
not intend to look into the issues of privacy and charging by the local
sites.<b><p>
<p>
5.  Conclusion</b><p>
We have seen in the past decade a remarkable expansion of Internet which
connects millions of users at academic, industrial, and government institutions
nation-wide and world-wide.  Additionally, it has caused a rapid growth in the
available amount of information and the variety in the<p>
forms the information is represented in.   Numerous Internet resource discovery
systems have come into existence.  However, for a true digital library system,
given the huge resource space, the actual solution process and the potential
solution space for an arbitrary query can easily make the task infeasible in
terms of the requirements on the computational and network resources.  In
addition, the many available subsystems will require local autonomy and are not
compatible with each other and can hardly work together in a smooth manner.<p>
To resolve this problem, the overall system must be able to utilize some
resource partitioning schemes and for each query it must be able to search the
partitioned space intelligently.  We proposed a three-layer interoperability
architecture which alleviates this problem.  It allows a large, distributed
user base, autonomies of local systems, ease in integration of heterogeneous
systems, and efficient retrieval of information. The implementation issues of
our architecture has been investigated and we believe that it is both feasible
and effective.<b><p>
<p>
<p>
References</b><p>
[1]	Andreessen, M. 1993. "NCSA Mosaic for the X-Window System. <i>Available via
anonymous FTP from ftp.ncsa.uiuc.edu:/Web/xmosaic</i>, Software development
Group, National center for Supercomputing Applications, University of Illinois
at Urbana-Champaign.<p>
<p>
[2]	Berners-Lee, T., Cailliau, R., Groff, J., and Pollermann B. Spring 1992.
World-Wide Web: The Information Universe. In <i>Electronic Networking:
Research, Applications and Policy,</i> 1(2), Meckler Publications. <p>
<p>
[3]	Bowman, C. M., Danzig, P. B., and Shwartz, M. F. 1993. Research Problems
for Scalable Internet Resource Discovery.  Department of Computer Science
Technical Report, University of Colorado at Boulder, (Boulder, CO,March).<p>
<p>
[4]	Emtage, A., and Deutsch, P. 1992. Archie-An Electronic Directory Service
for the Internet. In <i>Proceedings of the 1992 USENIX Technical Conference,
</i>(January), 93-110.<p>
<p>
[5]	Litwin, W., Mark, L., and Roussopoulos, N. 1990. Interoperability of
Multiple Autonomous Databases. In <i>ACM Computing </i>Surveys, 22,3,
267-293.<p>
<p>
[6]	Maly, K., Fox, E., French, J., and Selman, A. 1992. Wide Area Technical
Report Service.  Department of Computer Science Technical Report No. TR-92-44,
Old Dominion University, Norfollk, Virginia.<p>
<p>
[7]	McChaill, M. 1992. The Internet Gopher: A Distributed Server Information
System. In <i>ConneXions- The Interoperability Report </i>, Interop
Inc.,6,7,10-14.<p>
<p>
[8]	Schwartz, M. F., Emtag, A., Kahle, B., and Neuman, B. C. 1992. A Comparison
of Internet Resource Discovery Approaches. In <i>Computing
</i>Systems,5,4,461-493, University of California press.<p>
<p>
[9]	Source Book on  Digital Libraries, Fox, E. A. (Ed.), Virginia Polytechnique
Institute and State University, Blacksburg, VA, December 1993.<p>
<p>
[10]	Thinking Machines Corp. 1993. WAIS Source Distribution, version 8-b5.
<i>Available via anonymous FTP from think.com:/wais</i>, Thinking Machines
Corp., Cambridge, MA.<p>
<p>
[11]	Weider, C., and Deutsch, P. 1993. A vision of Integrated Internet
Information Service. <i>Available via anonymous FTP from
ietf.cnri.reston.va.us:/internet-drafts/draft-ietf-iiir-vision-0.0.txt.<p>
</i><p>
<img src="figures/shen1.gif"><p>
<p>
<hr>
<a name="fn0">[1]</a>In this paper the terms digital library and digital
information repository are used synonymously.<p>
<a name="fn1">[2]</a>In this paper, Internet is simply used as an example
framework that facilitates resource sharing . But the architecture and the
discussions in this paper are relevant even when a much broader framework is
considered. 

<!--#include virtual="/DL94/footer.ihtml" -->

</body></html>
