I love the Web. You can find all kinds of interesting information. At least theoretically you can. Of course, things come and go, and sometimes you can't find what you're looking for. -- Professor of Computer Science.
KEYWORDS: digital library, collection, maintenance, World Wide Web, organizational memory, usability
To distinguish our concerns from traditional collection management, we call these concerns collection maintenance. We use "maintenance" to deliberately invoke "software maintenance" and its often ignored importance for software systems.[1] As will be discussed below, collection maintenance is likely to be a significant problem in the digital library -- more so than in traditional libraries or current organizational memory repositories.
The conception of collection advocated here includes access to informal and dynamic information, albeit with a strong caveat. In our view, informality and dynamism materially increase accessibility and content issues over the long run. Collection maintenance for this expanded notion of collection will be difficult, but absolutely critical since:
The necessities and problems of collection maintenance are intertwined with institutional viability over the long-run.The traditional, or paper-based, library has established methods of maintaining access over the long-run. However, we do not yet have maintenance methods for the digital library, especially within the conception that includes informal and dynamic materials.
This paper begins by discussing the differing notions of the digital library, anchoring the issues in an analysis of institutional needs and practices. We then examine the various types of collections, including those that include dynamic and informal materials. This consideration of collection types and their control lends itself to analyzing the institutional arrangements and resulting maintenance issues for digital libraries.
This analysis proffers two conclusions. First, maintaining collections that are extensions of traditional collections (with delineated boundaries) not surprisingly requires only extensions of traditional methods. Existing institutional arrangements and resources can be modified to handle these requirements. Second, maintaining collections that include dynamic and informal information will be possible only with new technical solutions. We therefore end the paper with some prototypical software tools, using the World Wide Web as an example. One, called MOMspider, checks links within a defined area of the Web. The other, called Web:Lookout, checks for new links and information in previously viewed Web pages.
We argue ... that the design of digital libraries must take into account a broader range of materials, technologies, and practices -- transient as well as permanent documents, fluid as well as fixed materials, paper as well as digital technologies, and collaborative as well as individual practices. (p. 163)In a related paper, Marshall, Shipman, and McCall [20] argue for fluid access and integration among many disparate sources. They further offer an idealistic conception of what technology could allow. On the other hand, the narrowly-construed library is a single collection, defined sharply from materials not in the collection. As opposed to the broadly-construed library, the narrowly-construed library lacks integration with other information sources and community practices. It presumably does include new possibilities of access, but perhaps only those pre-defined by the library designers.
Miksa and Doty [21] argue for the narrowly-construed digital library. This view idealizes the role of the collection and the additional indexing above the raw material:
...the idea of the library includes the construction of a set of arrangements that overcomes the disparateness of the individual sources by relating them to one another in terms of a single, operational, intellectually structured whole. (p. 4)Both of these papers attempt to confront the boundaries of the collection, and those boundaries are in different places for the two papers. Miksa and Doty emphasize the collection and intellectual access to it. Levy and Marshall's emphasis is on access and use of the collection by a community, and since their emphasis is on practice, the collection is used in conjunction with other information sources. These Digital Library `94 papers offer visions of what is important in a digital library, and as such, offer considerable insight into two sets of important core beliefs. These two sets of beliefs are not completely separate; we tease them apart here because of their subtle implications for maintenance and the kinds of institutional arrangements required for that maintenance in the long-run. One view, more intrinsic to the library community, has found it important to consider collections, the intellectual access to them, and classification (e.g., [12, 16, 24]). Another view, held more strongly by the hypertext community, has an interest in direct-manipulation access and integration issues among heterogeneous materials (e.g., [22]). Although not completely separable, these views point to different emphases on what is important to maintain. One emphasizes selection from a bibliographic universe; the other, interaction among materials.
These architectures suggest that by extending other metaphors, we can find other results. For example, if we extend "personal library", we find a need for integrating information sources, not for all of society but for an individual. We can also extend "corporate library" or "organizational memory" to consider information repositories for groups or organizations, intermediate groupings between society and the individual.
Such architectures provide not just intriguing technical possibilities for computer scientists. They also provide ways for the narrowly-construed digital library to be incorporated within personal, group, and organizational information repositories. These intermediate architectures suggest that the idea of collection will be much more porous in the digital library: The intermediaries that are only implicit in the traditional library will be much more tightly tied into the digital library. However, the linking in and arrangement for multiple collections, intermediate layers, and informal materials makes some current institutional practices difficult, especially those related to the maintenance of the library over time.
Figure 1: Potential configurations of personal clients, intermediate collections, formal collections, and informal sources In this institutional analysis, any set of practitioners develops normative structures over time to promote the maintenance and promulgation of their community (as in [11]). (This is the larger context of technical maintenance.) It is not just that the institutions themselves provide for their continued existence; the community members promote such activities for the institutions [5].
It is essential to note that this set of practices and ideals maintain the traditional library as institution and as community of practice together over time. This equilibrium is in the process of being disturbed or destroyed with the intrusion of new technical possibilities.
We therefore ask: What will be required to maintain continued access and availability of the digital library over time?
Only some of the practices of the traditional library will carry into the digital world, perhaps only for a narrow conception of the digital library. Furthermore, the hypertext and computer science communities have no such traditions and practices for the broadly-construed library.
If we wish either vision to be successful, the above question must be answered. Some of the answer will be institutional and some of it will be technical. The mix that is possible, however, will be dependent on the types of desired collections as well as the potential control mechanisms for those collections.
In a traditional, or paper-based, library, there is considerable control over the collection. Versions of publications are stable (i.e., the contents of non-ephemera do not change from copy to copy). More importantly, because the collection is physically contained, there is considerable control over the collection. Library staff can decide what is and what is not in the collection. Maintenance of the collection is within the purview of the institutional members.
At one opposite is the World Wide Web. The Web nodes often change, in content, location, and even existence. On the other hand, the content in a location does not shift rapidly; it tends to remain relatively stable. There is no control by any given individual over the entire Web. An individual has control only over his nodes and the selection of pointers to others' nodes (URLs) that provide the capability for extended collections.
digital library type traditional organizational World Wide Web Usenet or library information other repository computer-mediat (organizational ed memory) communication system type of monographs, documents and text and messages collection serials, files multimedia special nodes collections authoring authors and organizational anyone anyone agencies their members or publishers sanctioned individuals collection selection by selection by none (or none (or control organizational organizational individual) moderator) members (e.g., members. May bibliographers) have requirements for official approvalTable 1: Collection control in digital libraries (by type)
At another opposite is Usenet or similar computer-mediated communication (CMC) systems. The locations (i.e., channel or topic) do not change, but the contents of any given location (e.g., comp.sys.laptops) change constantly. The control over the collection for this type is also very low for any given individual.
In the middle of both dimensions is the organizational information repository (organizational memory). Since the information resides within an organization and since the organization generally provides a sanctioned organizational member to manage the system, there is considerable control over its collection. Nonetheless, organizational memories tend to be more dynamic than traditional library collections; for example, they may include informal information, time-sensitive information, or bulletin board mechanisms.
Figure 2: Dynamism (volatility) in types of digital library collections Digital libraries, if self-contained in a manner similar to traditional libraries (i.e., the narrowly-construed library) will be able to draw upon traditional methods and practices. Traditional libraries, as noted above, have developed methods for maintaining their core set of institutional ideals, their community of practice, and their collections of materials. Although many maintenance practices will need to be adapted -- such as preservation and circulation control as extreme examples -- the existing practices can serve as the bases for new norms and practices. We have argued above that such norms and practices are required for the long-term viability of the digital library as institution. Therefore, if the adaptation of current practices is successful, then the narrowly-construed digital library will stand a much greater chance of success in the long run.
These traditional practices, however, have had their limitations. They were based on a constrained collection; i.e., a selection from the bibliographic universe. A traditional library could never cope with much ephemera; it would require too many resources. To adequately index, catalog, and otherwise conform to the institutional ideals for the broad range of ephemera would be overwhelming. In the past, there were strong institutional reasons to constrict the bibliographic universe.
Going digital, however, changes the cost structure, and collection and maintenance costs need to be revisited. It is not necessary, for example, in a digital universe to hand-index all materials; there are other methods of access.
There is considerable evidence that this change in the cost structure is affecting organizational memories (see [15] for a popular summary). As Figure 2 shows, however, the dynamism and volatility of organizational memories tend to be close to that of traditional libraries. That is, organizational members can reliably return to the same location in the organization's information repositories to find the same materials. However, the widespread interest in Lotus Notes, which combines electronic messaging and document storage, argues that combining formal organizational materials with informal is valuable (e.g., [23]). It will be interesting to see how existing maintenance procedures within organizations will adapt [7, 8, 25, 27].
Nonetheless, because institutional adaptation is much more manageable and amicable than extreme change, we expect that there will be pressure to have digital libraries as discrete and contained collections. The resource limitations of any social institution argue for defining a discrete collection as well [19]. Of course, it will be possible to incorporate these narrowly-construed libraries in architectures that promote broad access.
In addition, the broadly-construed digital library lacks practices that will maintain itself over the long run. The same incentive that promotes use of the Web or of Usenet -- namely that there is no intervening institution between the author and his potential audience -- also brings its own disincentives to systematic use. As discussed, both the Web and Usenet have high levels of volatility. The location of Web nodes can change without notice. Usenet groups (e.g., comp.multi-media) do not change locations, but the message traffic changes constantly. And from Table 1, there is little, if any, control over either of these informal collections.
The individualistic orientation of both the Web and Usenet argues against easy control over the collection. Without institutional control, each individual using the collection must deal with the maintenance issues himself. If the informal materials are dynamic or volatile, then there is a constant maintenance problem.
Requiring every user to provide for collection maintenance raises the costs considerably. Users must determine whether materials will continue to be available, accessible, and even understandable.
Collection maintenance, then, will be a critical limitation of the broadly-construed digital library. If the broadly-construed library, with its access to multiple sources, is to be viable, users cannot have a high maintenance cost over the long run. It will be necessary to reduce the cost of long-term use, and without mediating institutions, this reduction will need technical solutions.[3] The following section examines two such maintenance tools.
The Web shares the same maintenance problems as most hypertext systems. For example, in a hypertext collection such as the Web, one needs to: * Check that there are nodes at the other end of a link. In a distributed system such as the Web, the site may no longer exist, or the node may have moved. If a node or link has changed or has vanished, there will not be a local notification.
* Determine that all nodes have links into them (i.e., there are no orphans other than entry points into a local area).
* Check that the information in nodes is not obsolete (e.g., by examining expiration dates for objects). Since standard Web clients do not provide type information for nodes or links, consistency checking for types is not feasible. Similarly, since versioning is not generally supported, no checking for the correct version can be made.
Below are several semi-autonomous agents that attempt to ameliorate Web maintenance problems. Neither agents nor the list of problems is without omission, and again, these mechanisms are meant to be suggestive of work that will be required for the broadly-construed library.
This automatic maintenance obviates the need for continuous, manual traversal of a Web collection. The Web may be traversed by owner, site, or document tree, and it is possible to mark areas for the spider to avoid. However, for efficiency reasons, MOMspider performs the traversal from the perspective of a site administrator. Furthermore, MOMspider cannot analyze changed materials to see if the changes are significant to the user. This centralized focus makes it harder for individuals to use MOMspider for personal collections and limits their ability to customize the maintenance process (and focus it) to their needs.
Unfortunately, without examination, there is no way to know when an author has added in a new link. Similarly, one cannot know when an author has added nodes to his Web area, perhaps providing new descriptions or progress on a project. Instead of having to manually traverse these lists to look for new items, Web:Lookout allows users to automate this process.
Unlike MOMspider or other robots that periodically examine specific locations for any change, Web:Lookout notifies the user only upon interesting changes. Its heuristic examines nodes for the specifics of content and link changes. For example, if the user were looking for new publications by a colleague, he could request that Web:Lookout note whether a local link has been added to the colleague's publication page. In this scenario, the user would probably not be interested that links have been deleted or text has been added; Web:Lookout would not notify the user of these things if so instructed. Web:Lookout can also determine whether the content has significantly changed (measured by a similarity metric which the user can set), whether a localized Web area has changed shape, and whether other individuals have stopped finding a particular link interesting. Web:Lookout can also place its results in a Web page for others to use.
The ability to obtain automatic notifications of changes in other parts of the Web complements the social practice of providing interesting links. As such it serves as a useful social filter, similar in effect but not in mechanism to Maltz and Ehrlich's pointer filter [18]. Web:Lookout is, in their parlance, an active filter; however, Maltz and Ehrlich require each author to notify potential users. Web:Lookout removes this notification bottleneck.
However, the costs to any given individual user may not be large. It is possible that any given user would not require substantial storage, and a simple mirroring process might be adequate. Empirical investigation of user habits, however, is required to determine whether this is feasible.
A further limitation is the inadequacy of any computational method of examining content. Robots cannot fully examine the intellectual content of informal information, and any attempt to even partially examine contents may require extensive network resources. We have tried to consider this in the design of Web:Lookout, which uses no more resources than manual traversal. However, careful consideration will need to be paid to the tradeoff between efficiency and usefulness.
In the broadly-construed digital library, users are able to access diverse materials. The inclusion of dynamic and informal materials in the collection, however, leads to serious control and long-run maintenance issues. Because of the lack of control over the collection (or collections), technical mechanisms will be needed for collection maintenance.
While this paper ended with several technical possibilities for collection maintenance, we wish to emphasize that we perceive this problem to be both technical and institutional. Many proposals for digital libraries remove social exchange and interaction, focusing narrowly on the technical mechanisms of information; however, a strictly technical emphasis will not lead to an adequate understanding of the long-run issues in digital library use. Accordingly we have tried to emphasize both the technical and the social perspectives. The digital library is more than a set of technologies; it is also a social institution with long-term needs and maintenance requirements.