Access to Large Digital Libraries of Scientific Information Across Networks

José-Marie Griffiths[1] and Kimberly K. Kertis[2]

[1] Graduate School of Library and Information Science, The University of Tennessee

[2] Center for Information Studies, The University of Tennessee and Martin Marietta Energy Systems, Inc.

1. Introduction

The University of Tennessee submitted a proposal in conjuction with the University of Pittsburgh in response to NSF's Digital Library Initiative. The primary concern of the proposal is the information content of digital libraries and its usefulness and meaning to multiple user communities. As digital information resources available via interconnected networks proliferate, what can be done to facilitate the identification, selection, retrieval and delivery of needed information content to users, in form and medium preferred, in a cost-effective manner? How can we improve the ability of access mechanisms to extract relevant content, reduce duplication, analyze conflict and present information content in an optimum manner consistent with the users' needs and preferences? Information access mechanisms are complex in scope. For purposes of this paper they contain the following interrelated components:

* Provide access to collections of multimedia information built upon the integration of text, image, graphics, audio, video (and other continuous media),

* representation of information content in an organized way so that users can identify and select both from among and within various information resources

* navigation through and retrieval from both representational and primary information

* presentation of both representational and primary information to users

Each of the three components is integrally related to the other two. Information resources are generally developed with specific groups of potential users in mind. Therefore approaches and methods developed for any one component need to be tested in conjunction with the others so that the entire access mechanism can be tested from the perspective of users in an attempt to answer the questions of: how can interfaces supporting such access be customized for different groups of users; how well do they perform from the user perspective; and how well do the proposed techniques for access perform at varying levels of scale?

The availability of information resources on the ever-expanding network infrastructure makes all resources potentially available to anyone with network access. However, users have varying needs and requirements for information. They have different preferences and behaviors for identifying, locating, selecting, retrieving, receiving and using information. One aspect of this research is aimed at investigating the preferences and behaviors of various users and potential users of environmental information; specifically, identification, representation and selection, retrieval and navigation, presentation and performance approaches that facilitate access to and use of information by these various groups. Approaches to information system development have tended to provide the same user interface to all users, regardless of needs or preferences. Building on results in information science, computer science and cognitive psychology that define different cognitive styles and information-seeking habits we hope to research the effectiveness of different interfaces on users with different styles.

A second key aspect of this research is to test approaches and methods to facilitate user group access to digital libraries at varying levels of scale. Many approaches and methods researched have performed well in experimental and prototype environments, most of them on a relatively small scale. It needs to be determined whether they are robust in terms of performance at various levels of scale, i.e., at what point does performance degrade to levels unacceptable by users.

The two key areas referred to above, can be organized into four components of access: representation, identification and selection; navigation and retrieval; presentation; and performance. Since methods of representation to support the identification and selection of appropriate information resources influence retrieval methods which, in turn influence presentation options, all of which affect performance from the user perspective, research needs to conducted across these four components.

2. Representation, Identification, Selection

The issue of appropriate representation of information resources for successful identification and selection of resources and the information they contain is fundamental to successful access to digital libraries. Many of the methods in use in today's networked environment do not contain sufficient information to enable users to identify and select appropriate resources, nor are they sufficiently discriminating for helping users select information from within the resources selected. In the emerging networked environment, with large numbers of large-scale digital resources spread throughout the network, containing many forms of information (text, numeric/statistical, still and moving images, sound, software, complex objects/structures, etc.), it is increasingly difficult for users to know what resources are available and how they differ in terms of scope and content, and then how to extract the specific information they need from appropriate resources. An approach to this problem is the design and development of metadata for access to digital libraries.

Metadata are representations of the structure, organization and content of information resources and associated representations (e.g., thesauri). Metadata must include sufficient information about the available resources for various users groups (or intelligent agents on behalf of users) to be able to do several things. It must allow them to identify appropriate resources for addressing their information needs, to select from among the possible appropriate resources, to select from the resources the information relevant/pertinent to their needs, and then to combine information from multiple resources (as needed) in a valid way, all prior to actual retrieval of information from the selected resources. Metadata must therefore be designed to provide information about the information contained in the resources (already in existence but to greatly varying degrees) and information to provide a context for deciding whether the resource contains the information actually needed, in the needed form.

Early approaches to metadata development started in the late 1970s and early 1980s relative to online textual and numeric database access. Approaches included the development of centralized metadatabases [1] and incorporation of metadata into intelligent gateways [2]. The Data Resources Directory project of the U.S. Department of Energy's Energy Information Administration (EIA) included the design and implementation of a metadatabase describing EIA's 300-plus numeric and statistical databases. The resulting metadatabase included description of the data collection instruments, components of the sample design (including the sample frame, specific design, selection process, weighting schemes, arithmetic calculations, statistical models, etc.), the data elements themselves, the software, models and algorithms used to process the data, the tables generated and the reports and publications in which they were presented. This project was unique in that it applied bibliographic methods of classification, thesauri, indexing and abstracting to description of numeric and statistical data elements. Using the metadatabase, a user could start at any point in the lifecycle of EIA data management - a table in a publication or a potential question in a survey instrument and track its origin and destination. In the case of a published table, the analyst could track back through all operations performed on the data element and find its original ancestor as well as related and derived data elements and where they could be found. In the case of the survey question, the instrument designer could determine whether an existing survey collected the same information, whether collected from the same respondent community, how frequently, etc. and where the results were published. This metadatabase was the primary resource for data retrieval for EIA - users query the metadatabase to determine whether needed data are available, what additional information is available and sufficient contextual information to select specific data elements and, if necessary, to determine whether multiple data elements could validly be combined. It was also used to establish the exact meaning of information found in tables and texts. One additional feature of the metadata was that the classification scheme and thesaurus were designed to incorporate numerical properties of additivity, etc. In this way data could be retrieved from the data resources even though they were not explicitly included in the resources. By selecting the appropriate term in the thesaurus, the system could retrieve the data explicitly described by that term, and data aggregated across all the narrower terms (with conditions of inclusivity imposed).

The results of this research and development were further explored through a grant from NSF in 1984 [3]. This research effort focussed on the cost versus performance issues of integrating access to alternative data structures (hierarchical and relational in particular). Using a simulation approach the study looked at alternative forms for implementing metadata - as a separate metadatabase, through an intermediate lexicon [4], and through direct mapping of query language capability and indexes [5]. The optimum cost-performance was achieved through separate metadata.

One of the limiting factors of the research described briefly above was the relative lack of appropriate networking infrastructure for cost-efficient data value retrieval from the heterogeneous distributed resources. There exists a potential for distributing metadata to reside with the resources they describe which would effect the cost and performance of systems and for developing a design of "intelligent metadata" - self-maintaining metadata. In 1989 the issue of heterogeneous distributed databases re-emerged as an issue of considerable importance [6]. In 1992, UT received a contract from Martin Marietta Energy Systems, Inc. to further explore metadata design and development for access to heterogeneous distributed databases in a multilevel secure environment [7, 8]. This is a collaborative effort between UT's Center for Information Studies and MMES's Center for Computer Security. To date, a trusted multilevel secure heterogeneous network has been implemented and metadata are in the design stage for environmental information resources. Metadata designs need to be developed specifically for the available digital collections and then tested at varying levels of scale. In addition to the description of information content, structure, organization and context, the metadata should include information about parameters that affect performance of digital library access such as size, form, compression, transfer times, etc. under various access conditions. This performance information will extend metadata descriptions beyond those currently available and will provide users with yet another perspective to consider in their selection of information resources and content.

3. Navigation and Retrieval

A significant challenge for digital libraries is to render the vast and growing quantity and variability of information on the Internet to users in a practical and user- friendly way. To fully utilize the assets available on the Internet, users must have a reasonably efficient means of navigating through metadata to determine several key factors such as the most promising candidate sources for needed information, requirements for accessing those sources, public availability, charges (if any) for the information, languages in which the information is stored, and network addresses. New forms of metadata will require new capabilities for navigation and browsing through metadata, and the multiplicity of primary information formats also require new navigation and retrieval approaches.

Latent Semantic Indexing (LSI) is a method to develop conceptual indices at file network level which can be rapidly constructed and searched to aid users in finding textual and image data. LSI [9, 10] addresses the problems of word- based access by treating the observed word to text- object association data as an unreliable estimate of the true, larger pool of words that could have been associated with each object. It is assumed there is some underlying latent semantic structure in word usage data that is partially obscured by the variability of word choice. By using mathematical models based on the singular value decomposition [11], the latent structure can be estimated and obscuring noise removed. Such models allow the closeness of objects to be determined by the overall pattern of term usage, so that documents can be classified together regardless of the precise words that are used to describe them. In other words, a document's description depends on a consensus of its term meanings, thus dampening the effects of polysemy.

Information scientists have long been aware that most information system users are ill-equipped to translate statements of information need into the precise queries required by conventional information retrieval systems. Such users often prefer to use browsing, a combination of heuristics and serendipity, as a retrieval strategy. For browsing to be effective, however, the document space has to be organized in a manner that is readily understood by users.

In a traditional library, documents are physically arranged into subject clusters using a classification scheme. A common retrieval strategy employed by users is to locate subject clusters of potential interest and then browse these clusters for documents of interest. Similar strategies are currently used on the Internet where searchers use tools such as Archie, Gopher and Mosaic to locate and browse clusters of digital documents, but, as regular users of the Internet are aware, browsing rapidly becomes ineffective as the size of the document space increases.

Given this problem, the use of a concept space as an aid to retrieval from large document collections may be a useful approach. Such use of concept spaces is common in traditional libraries where thesauri are used as aids in searching large online databases [12]. The thesauri used in traditional libraries, however, are usually generated manually and are often available only in printed form. A consequence of the latter is that the concept space represented by the thesaurus is not browsable online and cannot be linked to the online database.

The concept space, derived algorithmically from the document collection, is organized as a semantic net. The nodes of the semantic net represent concepts, and links between nodes represent the relationships between concepts. Searchers will use online browsing tools to traverse the net. Once an interesting concept has been located, links to the document collection could be dynamically instantiated providing the user with a filtered view of the collection. Previous work [13] suggests that users can effectively navigate a concept space of approximately ten thousand nodes and sixty-six thousand links with an associative browser. Hierarchical browsers have been tested with rather smaller concept spaces and found to be effective [14]. The scalability of these browsers to concept spaces of different sizes will be of particular interest.

An important element in the implementation of digital libraries is in search and retrieval of nontextual data, such as images and graphics. Many search and retrieval techniques exist for structured and/or textual data; however, digitized images are inaccessible except through textual descriptions of image content. Effectively searching image data represents an important problem for libraries that contain non- textual data. One possible solution is pattern matching algorithms that can be used to search and detect patterns in image and other types of non- textual data.

4. Presentation

At the core of each information representation and presentation system is an underlying mechanism that is capable of fusing information from different sources, to allow for its presentation to the user for assimilation in a coherent and efficient manner. This is the component of an information access system that significantly affects the users' perception of success of the system yet it has received considerably less attention from researchers whose experimental and prototype systems have typically presented information to users in the same way, regardless of need or preference. Alternative presentation formats for user groups with differing needs and preferences are needed such as a parallel data fusion paradigm to integrate spatially and/or contextually incongruous multisensory data. This paradigm is a non-bayesian uncertainty and data fusion approach. This new fusion algorithm is based on interaction between two constraints: (1) the principle of data corroboration, which tends to maximize the final belief in a given proposition, is either of the knowledge sources supports the occurrence of this proposition and (2) the principle of belief enhancement/withdrawal which adjusts the belief of one knowledge source according to the belief of the second knowledge source by maximizing the similarity between the two sources. These two principles are combined by maximizing a positive linear combination of these two constraints related by a fusion function, to be determined. The latter maximization is achieved and the fusion function is uniquely determined using the Euler-Lagrange equations in calculus of variations. This method has been tested using various features from synthetic and real data of various types and of many dimensionalities resulting in fused data which satisfy both of the principles mentioned above.

There are four basic categories of methods for inferring knowledge from two or more knowledge sources. The first class of techniques is based on the Super-Bayesian Theory. These techniques are centered around Bayes' theorem which uses past knowledge about the occurrence of an event to infer the chances of occurrence of that event in the future. One of the difficulties with Bayesian schemes is their high sensitivity to prior information which, in many practical applications, is not available. The second class of techniques is based on Belief (or Evidence) Theory. These techniques are founded on Dempster's rule of evidence combination where the belief in the occurrence of a given event is computed as a function of two or more assessments provided by different knowledge sources. Two weaknesses of fusion techniques which are associated with evidence theory are: (1) failure to accommodate highly conflicting sources of information; and (2) numerical sensitivity of the final fusion of fluctuations in the inputs to be fused. A third class of inference mechanisms includes those based on evidential reasoning functions often defined in a fuzzy framework. Various evidential reasoning functions have been proposed over the years. These functions are two-dimensional functions of the knowledge sources' assessments regarding the occurrence of a given event. The lack of justification of newly proposed fusion functions led to a number of solutions which were often contradictory. The fourth class of techniques includes methods that do not fit either of the three categories mentioned above. These techniques infer knowledge from two or more assessments by using constraints on the evidence collected to compute the final assessment. They are often classified as analytic or geometric techniques. The fundamental issue that underlines some of the difficulties that this category has is the diversity in procedures and the lack of standards in the way knowledge sources are dealt with.

5. Performance

Performance issues fall into two distinct categories. First is the inclusion of information about and parameters affecting performance in the metadata design. This would give users the information they need to understand the likely performance characteristics of accessing information identified (e.g., time to process, transfer; storage requirements, etc.). The second area of performance is how well the access mechanism in its entirety performs from the user perspective. To this end research into user information-seeking behaviors and preferences of the user groups would enable customized interfaces to be designed reflecting user needs and preferences, and provide input into issues of development of digital collections.

The size and nature of the potential user community must be considered a major issue facing digital libraries, or any other new media, if they are to be useful, accessible, and sufficiently accountable to recover investments. User community analysis may be separated into three components: demographics; general media use and information- gathering habits; and behavior and learning styles associated with use of digital libraries.

The first component, demographics, permit us to find out who will/might use the digital library. This is a fundamental step in any user analysis because such descriptive variables as education, income, age, gender, race, etc. are the building blocks for more involved behavioral analyses. It is generally safe to assume that different groups will have different skills, motives and demands as they approach digital libraries. Furthermore, these behaviors will change as familiarity with digital libraries and associated access mechanisms change. In addition, these descriptors become important factors in future decisions that involve library content and pricing, e.g., how it is marketed.

The second component is general media use and information- gathering habits. This builds on the first stage by putting the demographic portrait into a meaningful and useful informational- lifestyle context and allows better comprehension of the pieces of the media information that have roles and affect decisions in people's lives. From an understanding of media habits, a better understand of why people use digital libraries would be reached. This information could be derived, in part, by knowing what media are being displaced when people use digital libraries. Because people have finite time and money, the introduction of new media usually means the displacement of an existing one. Discovering which media are displaced will provide valuable insight into the purpose the digital library serves (such as basic research, applied research, current awareness, education, etc.). In addition, it would provide input to the development of a rational pricing system.

* Additional research questions that need to be addressed through user group analysis include:

* What are the information needs of a particular audience?

* Which of these information needs could best be met by the digital library and why?

* What expectations might users have of the digital library?

* What are these expectations based on or where did they come from?

* What are the variables associated with early adoption or use of the digital library?

* How was the digital library actually used, including analysis of transaction logs or equivalent to see when the information was used and how time was spent in the discovery and retrieval of the information.

* How does use of the digital library change over time, especially when the novelty has worn off?

* To what degree were expectations met and why? Exploration of the degree to which the digitized information was used, useful (made a difference or contributed to productivity), and useable (information was easy to use)? The productivity issue is an important one. Much digital information distributed on the net may actually lower productivity by stealing time away from what really needs to be done. This is especially true if the discovery and retrieval tools are not precise enough.

* How has use of the digital library affected the use of other information sources, including personal and institutional libraries and information centers and informal information sources such as colleagues?

In the third component is research into the behaviors associated with use of the digital library, people's learning styles, and their levels of satisfaction with their skill in using the new technology and with the information they access through it will be evaluated. Though the list of possible behaviors is endless, one starting point would be time spent with the system, demonstrated access skill, level of frustration, indications of provoking curiosity (or stifling it), downloading and/or printing of information, and information searching strategies other than use of the digital library, satisfaction with attributes of the information retrieved (accuracy, relevance/pertinence, comprehensiveness, specificity, assimilability, currency, etc.) and with the attributes of the access mechanism (timeliness, ease of use, training burden, cost to use, etc.).

Finally, issues related to the lifecycle management of digital library collections such as the following need to be studied:

* How are collections of digitized information actually developed (what policies govern the selection of information to be included?

* How are these collections selected for the digital library? For example, how does the owner or manager of a server decide what files or databases to mount? What criteria are used? How is quality control exercised?

* What criteria (policies and procedures) are used in maintaining, archiving or disposing these files or databases from the collection?

* How are digitized collections evaluated?

* How are electronic materials preserved, especially with software and hardware changes that may require reformatting on a somewhat regular basis?

* How are electronic collections promoted or made visible to potential users?

6. Domain of Networked Information

The domain of networked information incorporates three components: (1) digital collections, (2) telecommunications network, (3) user groups. The digital collections contain a variety of information forms including data, full text, bibliographic, still and moving images and sound. Many terabytes of data are contained in these collections which can be configured in several ways by type of format, and by size of data collection. The telecommunications network and its capabilities directly impacts the performance of access to scaled digital collections. But is there an optimum set of capabilities that are needed by certain user groups or collections?

Research conducted into information users has demonstrated changing needs, behaviors and expectations on the part of various subsets of the total potential user community. Considerable attention has been paid to scientists as users - tending to focus on their use of formal and informal information resources. More recently, users have been divided into their cognitive information seeking and assimilation behaviors where different categorizations of information seeking behaviors have evolved, such as the hunter, gatherer, etc., particularly in regard to networked information. Unless they are scientists themselves, managers of research scientists are often excluded from information user studies. Managers need access to scientific information but often in more condensed, synthesized forms. The availability of vast amounts of scientific information on networks offers great opportunities to educators, students and librarians at all levels. The ability to demonstrate specific points or to further illustrate an educational objective using information collected elsewhere can enhance the educational process significantly. The network and the information resources distributed on it can expand the range of experiences that can be brought into the classroom no matter what level the class: university/college, high school, elementary school, kindergarten and primary school.

7 Expected Accomplishments

The two primary goals of our research effort are to (1) perform research leading to the design of customized user interfaces for access to digital libraries and (2) to perform research on the effects of scale (along three dimensions - digital collections, telecommunications network, and user groups) on the cost and performance of the components of access (representation, identification and selection; navigation and retrieval; presentation; and performance). The accomplishments we expect to emerge from our lines of research are:

* an understanding of how representational schema can be designed to sufficiently discriminate among and within digital collections

* an understanding of appropriate metadata components and designs that facilitate access to digital collections

* an understanding of the cost and performance implications of implementing self-updating metadata in a distributed manner

* an understanding of the cost and performance implications of latent semantic indexing, concept space browsing, and metadata navigation at various levels of scale and for various user groups

* an understanding of the cost and performance implications of various pattern recognition algorithms for non-textual search and retrieval at various levels of scale and for various user groups

* an understanding of the cost and performance implications of fuzzy logic applications for representation, retrieval and presentation of multimedia information at various levels of scale and for various user groups

* an understanding of how representation, navigation, retrieval and presentation approaches combine to optimize cost-performance at various levels of scale and for various groups of users

* an understanding of the transferability of the research results to other digital library domains

* an understanding of the information needs and information delivery preferences of different groups of users of digital libraries

* an understanding of the information-seeking behaviors and cognitive styles of various groups of users of digital libraries

* an understanding of how information seeking behaviors and information delivery preferences have changed as a result of exposure to digital libraries

* an understanding of multi-institutional collaborative research patterns, problems and potential solutions that enhance or inhibit the research process

References

1. J.M. Griffiths and J.A. Lipkin (1984). "Investigation of Alternative Approaches to Data Structure Integration: Final Report." Prepared for the Division of Information Science and Technology, National Science Foundation. Rockville, MD: King Research, Inc.

2. M. E. Williams and S.E. Preece (1981). "A mini- transparent system using an alpha microprocessor." Proceedings of the Second National Online Meeting, 499- 592.

3. J.M. Griffiths, et al (1982). "Research into the Structure, Accessing, and Manipulation of Numeric Databases." Rockville, MD: King Research, Inc.

4. V. Horsnell (1976). "The Intermediate lexicon." Paper presented at Informatics III, Emmanuel College, Oxford, England.

5. M.E. Williams and S.E. Preece (1981). "A mini- transparent system using an alpha microprocessor." Proceedings of the Second National Online Meeting, 499- 592.

6. P. Scheuermann and C. Yu (chairs) (1989). "Heterogeneous Database Systems: Report of the NSF Workshop." Northwestern University, Evanston, IL, December 11- 13, 1989.

7. J.M. Griffiths, R.O. Chester and K.K. Kertis (1993) "Environmental Databases and their Metadatabases" Proceedings of the National Online Conference, New York City, New York, May 6, 1993.

8. J.A. Rome, L.R. Bayor and P.W. Payne (1994) "Using Trusted Databases for Unclassified Purposes" Proceedings of the 16th U.S Department of Energy Computer Security Group Training Conference, Denver, Colorado, May 2 - 5, 1994. Preconference Draft.

9. S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman (1990). "Indexing by latent semantic analysis." Journal of the American Society for Information Science 41:6, 391-407.

10. P. Foltz, and S. Dumais (1992). "Personalized information delivery: An analysis of information-filtering methods." Communications of the ACM 35:12, 51-60.

11. M.W. Berry (1992). "Large scale singular value computations." International Journal of Supercomputer Applications 6:1, 13-49.

12. L.M. Chan, and R. Pollard (1988). Thesauri used in online databases: An analytical guide Westport, CT: Greenwood Press.

13. R. Pollard (1993). "A hypertext-based thesaurus as a subject browsing aid for bibliographic databases." Information Processing and Management 29:3, 345-357.

14. A. Simpson, and C. McKnight (1990). "Navigation in hypertext: structural cues and mental maps." IN: R. McAleese & C. Green (eds.) Hypertext: State of the Art Norwood, NJ: Ablex, 73-83.