<html>
<head>
<title>
DL94: Digital Library Infrastructure for a University Engineering Community
</title>
</head>

<body>

<!--#include virtual="/DL94/header.ihtml" -->

<h1>
Digital Library Infrastructure for a University Engineering Community
</h1>

Bruce Schatz[1,3], Ann Bishop[1], William Mischo[2], and Joseph Hardin[3]<p>

[1] <em>Graduate School of Library and Information Science, </em> <p>

[2] <em>University Library</em><p>

[3] <em>National Center for Supercomputing Applications<br>
University of Illinois at Urbana-Champaign</em><p>
<p>
<b>contact:</b> Bruce Schatz, NCSA, Beckman Institute, 405 N. Mathews Ave, Urbana, IL
61801<p>
<b>emails:</b> bschatz@ncsa.uiuc.edu, bishop@alexia.lis.uiuc.edu, mischo1@vmd.cso.uiuc.edu, hardin@ncsa.uiuc.edu <p>
</i><p>
<p>
<p>
<b><p>
Abstract</b><p>
In the world of the near future, the Internet of today will evolve into the
Interspace of tomorrow.   The international network will evolve from
distributed computer nodes supporting file transfer to distributed information
sources supporting object interaction.   Users will browse the Net by searching
digital libraries and navigating relationship links, as well as share new
information within the Net by composing and publishing new objects and links.
The Net will thus appear as interconnected spaces of information objects, the
<b>Interspace</b>.<p>
We propose two concurrent and complementary activities that will accelerate
progress towards building the Interspace.   These together construct a model
large-scale digital library and investigate how it can scale up to the National
Information Infrastructure.<p>
*  Construction of a digital library <b>testbed</b> for a major university
engineering community, in which a large digital collection of interlinked
documents and databases will be maintained, software to browse and share within
this library developed, and usage patterns of thousands of users spread across
the Net evaluated.<p>
*  Investigation of fundamental <b>research</b> issues in information systems,
information science, computer science, sociology and economics that will
address the scalable organization of a large digital collection to provide
transparent access for a broad spectrum of users across national networks.
Our analysis will center on the testbed experiment and will form the basis for
future system design.<p>
<b><p>
Keywords</b>: Digital libraries, National Information Infrastructure,
information spaces, network information systems, Interspace<b><p>
<p>
<p>
Introduction</b><p>
The Illinois Digital Library project is constructing a large-scale digital
library for engineering documents and databases.   This project consists of two
inter-related parts.   The first is building a testbed of materials obtained
from professional and commericial publishers, with software that will be used
by an engineering community of thousands of users.   The second is performing
research in technology and in sociology to understand how to scale the testbed
model to the National Information Infrastructure.<p>
The project is a joint effort on the testbed side of the University Library
(UL) and the National Center for Supercomputing Applications (NCSA), and on the
research side of the Graduate School of Library and Information Science (GSLIS)
and the Department of Computer Science (CS).   The heads of these organizations
form an executive committee to coordinate and support the project: Robert
Wedgeworth (UL), Larry Smarr (NCSA), Leigh Estabrook (GSLIS), Duncan Lawrie
(CS).   Many other faculty and staff besides the authors here participate in
the project, e.g. Hsinchun Chen (University of Arizona), Roy Campbell (CS),
Leigh Star (Sociology), Larry DeBrock (Economics), Charles Catlett (NCSA),
Michael Folk (NCSA), David Stern (UL), Pauline Cochrane (GSLIS).<p>
This paper is the Executive Summary of a proposal submitted to the
NSF/ARPA/NASA Digital Library Initiative in February 1994.<p>
 <b><p>
Digital Library Testbed</b><p>
 The testbed centers around the new Grainger Engineering Library Information
Center at the University of Illinois in Urbana-Champaign (UIUC).   This $26M
Center is intended as a showcase for state-of-the-art digital libraries and
electronic information distribution.   The University Library at UIUC is one of
the nation's best and largest university libraries.   The Engineering College
at UIUC is one of the nation's best and largest engineering colleges.   <p>
Construction of this national digital library testbed is possible through the
active participation of two major institutions at the University of Illinois,
the University Library and the National Center for Supercomputing Applications
(NCSA).   The former has extensive experience with maintaining large digital
collections and supporting large user populations within the campus university
community.   The latter has extensive experience with developing generic
computing software and supporting large hardware configurations within the
national scientific community.   Together, they provide the institutional
infrastructure which enables substantial research to be undertaken within the
testbed, and each are committed to this project as crucial to their future
activities.   Each is also currently supporting a major heavily-used service to
access network information sources, represented by the Engineering Library
on-line system and NCSA Mosaic software respectively.<p>
Operation of the testbed will be supervised by the Director of the Grainger
Center, co-PI Mischo.   He will be advised by the Associate Director for
Software Development at NCSA, co-PI Hardin, and representatives from the
faculty of the Graduate School of Library and Information Science specializing
in design and in analysis of information systems, respectively PI Schatz and
co-PI Bishop.   Other investigators include faculty from the Departments of
Computer Science, Management Information Systems, Sociology, and Economics, in
addition to others from the Library and NCSA.   An Executive Committee,
consisting of the heads of all the major participating organizations, will help
insure institutional support.<p>
The Engineering Library currently processes a million queries a month to its
expanded on-line catalog including a digital collection of journal citations.
Through a variety of collaborations, the user population for the testbed will
expand beyond the Engineering College to the University community as a whole
(including the Chicago and regional campuses served by the University Library)
to the entire Midwest regional university system (via the CIC network) to the
national scientific community (via the NCSA metacomputer center).   This
provides a national testbed across the Internet of over 100,000
university-level users.<p>
The digital library itself will be centered around a collection of engineering
journals and magazines, obtained through collaboration with a range of major
professional and commercial publishers.   The intention is to attract a broad
range of usage from a broad range of users.   All documents will be structured
and complete, that is, encoded in SGML and containing all pictorial material.
The documents will include general engineering magazines (e.g. computer science
from IEEE), specific engineering journals (e.g. aeronautical engineering from
AIAA), and specific scientific journals (e.g. physics from APS).   Finally,
articles from commercial engineering publishers (e.g. Wiley &amp; Sons) will be
collected for use in our economics trials.<p>
We plan to gather a significant new digital collection of structured documents
in the engineering literature and combine this with existing sources available
from our front and back end software (see below).   For example, these
full-text materials will be integrated into an expanded on-line catalog
including access to major periodical indexes in science and engineering
(Current Contents, Engineering Compendex, INSPEC) which will be linked to the
SGML documents.   Collections on the Internet will also be made transparently
available, e.g. the physics preprints at Los Alamos, the Unified Computer
Science Technical Reports at Indiana University, and the international
collection of on-line library catalogs.   <p>
In addition to the document collections, a number of databases will be gathered
into the digital library and cross-linked to the documents where possible.
These include significant databases generated by other NSF-funded projects,
e.g. the BIMA Grand Challenge database in radio astronomy and the WCS National
Collaboratory database in molecular biology.   Associated GIS satellite image
databases include the NASA-funded data supported by the NCSA HDF project.
These projects are local to the University of Illinois and supervised by
collaborators on the digital library project.<p>
The testbed software will go through two primary phases within the proposal
period of four years.   The goal of version 1 is to leverage off our
substantial existing resources to build a functional digital library with a
large collection used by a substantial user population.   Concurrently during
this period, the technology research will be developing significant new
functionality and the sociology research will be observing the significant
usage patterns of the existing functionality.   Together, these efforts will
enable us to develop and deploy scalable digital library technology on a
national testbed.   The goal of version 2 is to demonstrate the technical
feasibility of a fully functional Interspace system and test its sociological
utility on a segment of our user population.<p>
The version 1 software will evolve from two of our existing projects.   The
first is the existing information retrieval system in the current Engineering
Library developed by co-PI Mischo.   This is based around a PC front end to a
full-text retrieval search from the major commercial vendor BRS.   It currently
serves the base user population with an on-line catalog connected to a large
collection of engineering journal citations.   This is in production use with 1
million search queries issued and 3 million items displayed monthly, and is a
supported product of the University Library.   <p>
The front end to this back end will be the NCSA Mosaic software developed under
the supervision of co-PI Hardin.   This is one of the most widely used
information services currently in the Internet, with a user base of nearly 1
million sites.   The NCSA server where the Mosaic Home Page resides is now
processing 1 million connections a week.   The software provides an easy-to-use
interface, on the three major current user platforms, for transparently
retrieving documents across the Net.   It supports display for the HTML subset
of SGML, and for pictorial displays including embedding of images within text.
This software is a supported product of NCSA and is being rapidly enhanced.<p>
Together, these software plus enhancements will provide a search and display
capability for full-text documents with pictures.   This will be a
representative system of the large-scale functionality available today.   The
Mosaic software will serve as the interface and gateway for two database search
engines that will support the structured full-text and image documents and
databases.   BRS Search, which is widely used in libraries to provide full-text
retrieval, will be interfaced via the standard protocol Z39.50.   Microsoft
Server, which is widely used to provide simple access to non-textual materials
such as images, video, and sound, will be interfaced via the standard protocol
SQL.   The gathered collection of documents (and databases) will be transformed
and indexed within this search system, then displayed using the internal and
external viewers provided by Mosaic for the user's local platform.<p>
Given the large-scale user population and the significant digital collection,
we will be able to evaluate the nature of usage of a digital library.   The
evaluation effort will cover a broad range of methodologies and usages with the
goal of answering a broad range of research questions.   This information on
effective/non-effective usage patterns will be fed back into the future system
design.   Methodologies will include ethnography observation and interviews,
controlled experiments and surveys, and system instrumentation and transaction
logs.   Both individual behavior, via interviews, and group statistics, via
surveys, will be observed.   Different samples of the broad user population
will be used as appropriate for these studies.<p>
The version 2 software will incorporate the research discussed below to
demonstrate a large-scale example of the digital library functionality
available tomorrow.   It will be a new system, designed from scratch for this
application, using the experience from the testbed and other projects of the
investigators to provide a scalable architecture for digital library
infrastructure.   We plan to implement this architecture and gradually
introduce it into the testbed.   The degree to which the new software ends up
being adopted is a key question for the technology and the sociology research.
<p>
The functionality of version 2 will demonstrate the range of functionality
possible in the Interspace.   It will be built upon information spaces and
support archive browsing and community sharing of objects with these spaces.
The design will be generalized from the Worm Community System (WCS) developed
by PI Schatz, which supports this range of functionality in a small specialized
scientific domain.   WCS was developed under one of the main project grants of
the previous NSF IRIS-CISE information systems program in National
Collaboratories, and has been featured as a national model for science
information systems in National Academy of Science reports and lead news
articles in Science magazine.   Other major inputs will come from the research
projects summarized below and from the IETF (Internet Engineering Task Force)
efforts on evolving existing architectures that NCSA is participating in.<p>
The primary goal of version 2 is to deepen the level of interaction and of
integration.   For documents, search will support semantic retrieval with
concept matching and display will be comparable to printed journals or
magazines.   There will be user profiles supporting customized retrieval, where
virtual magazines are delivered containing sets of desired articles displayed
with good layout.   For databases, there will be live manipulation comparable
to direct use without the system, plus links to related items and to documents.
 There will also be communications support for messages and annotations linked
to the documents and databases.   The system will be symmetric so that any type
of object or link which can be retrieved can also be added by users.   In this
sense, the system becomes a dynamic library which supports a complete
publishing cycle for the Net.<p>
<b><p>
Digital Library Research</b><p>
The complement of the testbed in our project is the research effort.   Each
research component is strong enough to stand on its own, while producing
results relevant to the critical system issues.   Projects were selected which
could make good use of the library testbed as an experimental vehicle and which
had the potential of generating results which could be used in version 2 of the
testbed or in subsequent projects which built on the testbed foundations.   The
goal was to build a group of collaborators, who professionally would span the
range of topics necessary for the complete infrastructure and who personally
were willing to actively participate in the project as a whole.   Research
components include: information systems, information science, computer science,
sociology, and economics.   As opposed to the testbed efforts which are carried
out primarily by professional programmers and librarians, the research efforts
are carried out primarily by academic faculty and students.<p>
The Information Systems Research centers around designing an architecture for
the Interspace, supervised by PI Schatz.   This architecture will consist of an
information space environment along with protocols for plugging objects into
the space.   The information space representation is a schema for federating
heterogeneous objects distributed across a network via the use of relationship
links.   The protocols include support for information search and display
(object typing), interconnection forging and following (object linking), and
publishing control and communications (object distributing).   With the
protocols, it is possible to add new documents and databases to an information
space with full interactive capability, and to communicate with other users via
messages and links to any other objects.   In this project, the architecture
will be implemented and used as the basis for version 2 of the digital library
testbed.   When combined with the other technology research, it should provide
a much deeper level of interaction and of integration than in version 1.<p>
The Information Science Research centers around semantic retrieval and user
customization, supervised by co-PI Chen.   The semantic retrieval supports a
higher level of abstraction in user search which can help overcome the
vocabulary problem for information retrieval.   Rather than searching for words
within the object space, the search is for terms within a concept space.   A
concept space is a graph of terms occurring within the objects linked to each
other by the frequency with which they occur together.   This graph can be used
to suggest alternative ("related") terms that a user may wish to search for.
Co-occurrence graphs seem to provide good suggestive power in specialized
domains, such as biology.   The research questions revolve around their
effectiveness in the more general domains considered here.   Using the same
sort of statistical methods, it is possible to infer terms of interest to the
users from the objects that have been retrieved.   These techniques will be
used to provide a form of customized retrieval, where a user profile consisting
of terms and demographics specified by the users orients the search matching
towards more preferred objects.   In this project, the semantic retrieval and
user customization will be used to supplement the full-text search in the
testbed.<p>
The Computer Science Research centers around operating systems for network
information services, supervised by co-PI Campbell, in collaboration with co-PI
Catlett from NCSA.   This concentrates on the physical performance of the
objects rather than on the logical functionality as with the information
science (and information system) research.   Investigation will be made of the
bottlenecks occurring within the information space and how operating system
solutions can alleviate them.   Measurements will be first made on Mosaic in
the Internet, then on the Testbed Facility.   Although this latter is also
based upon Mosaic, the traffic pattern will likely differ, due to the most
common interaction being search rather than navigation.   Issues revolve around
both the retrieval fetching itself, caching across the network memory
hierarchy, and the link following to navigate to other objects in the Net, name
resolution within a large distributed system.   In this project, the
measurements will be used to guide both short-term solutions for the testbed
and long-term solutions with new object-oriented operating systems for
supporting information space architectures.<p>
The Sociology Research centers around user behavior studies, supervised by
co-PI Star, in collaboration with co-PI Bishop.   This research is the
evaluation component of the testbed discussed above.   Part will concentrate on
ethnography (Star), seeking descriptions of the conceptual structures needed
for users to effectively interact with a digital library, with the goal of
influencing the design of future systems.   Part will concentrate on user-based
methods (Bishop), such as surveys and interviews, seeking group-level
statistical information about patterns of usage. These two approaches will be
performed on a range of different groups of users, to get a broad and detailed
picture of the digital library.   In addition, a methodological investigation
will be done to attempt nethnography (net-ethnography), where the facilities of
the system itself are also used to observe user behavior remotely across the
network.   A success at this new methodology would enable user studies to be
done on much larger national systems in the future.<p>
The Economics Research centers around charging schemes for network access,
supervised by co-PI DeBrock.   The sociology studies will discover patterns of
usage in a large testbed, when there are no limits on user access.   But in the
real world, the economics of cost play a significant factor in determining
usage.   In some sense, the main testbed will be studying a flat-rate fee,
which is being absorbed in the experimental phases by our project.   But many
NII applications will require per-use charges for economic viability.   In this
project, different fee charging will be required for selected portions of the
user population, e.g. a small subset of remote users on the commercial
materials.   This will investigate how actual costs affect actual usage.
These economics experiments will give an indication of what people might be
willing to pay in a national digital library, just as the sociology experiments
will give an indication of what people might be able to do.<p>
<b><p>
Conclusion</b><p>
Together, the large-scale testbed and the broad-spectrum research will provide
a significant demonstration of a model information system for a national
digital library, along with an analysis of the requirements and a design of a
system that can scale up to the National Information Infrastructure.<p>

<!--#include virtual="/DL94/footer.ihtml" -->
Last Modified: <!--#echo var="LAST_MODIFIED" --> <br>

</body>
</html>

