KEYWORDS: technical reports, full text indexing, mg, bibliometrics, scientometrics
This paper describes a pilot project that is designed to explore the potential of such digital libraries in the context of a national research community that is relatively small and focused. The project will provide a full-text index to computer science technical reports accessible via Internet, and make it available to academic researchers in New Zealand computer science departments. Two factors that are heightened by geographical isolation are network transmission costs and response time variability. To reduce trans-Pacific network traffic and local storage requirements, the full text of the technical reports will not be transferred but downloaded on demand. Particular attention is being paid to scalability, and it is intended that growth be socially controlled in a manner described below.
By far the greatest problem in setting up a digital library is in obtaining and digitizing the raw material. Computer science is an appropriate choice for a digital library because a huge amount of high-quality information already exists in digital form and is freely accessible on the Internet in the form of technical reports. For example, the University of Indiana maintains a Unified Computer Science Technical Report Index which, as of October 1994, contained 10,500 items found at 180 Internet sites (an estimated 10 gigabytes of data). The list of sites is growing quickly and is potentially much larger.
Our design goals for the system include:
* Low maintenance. The index construction process will be automated as much as possible, and documents will be added to the index with little or no intervention from a system administrator.
* Logically central index, physically distributed documents. The New Zealand site will hold only an index and search engine; the documents themselves remain in their original repositories.
* Full text indexing. The system will index the entire contents of the documents, rather than being restricted to file information or title/author/abstract summaries.
* Transparency for providers. The system will not require any effort on the part of participating technical report repositories, and indeed these providers will in general not even be aware of their inclusion in our index. No special software, archive organizations, or file formats will be required of the providers.
This paper is organized as follows: the next compares our proposal with existing technical report indexing schemes; we then describe the proposed system architecture in detail, and briefly discuss the potential for gathering scientometric/bibliometric information on New Zealand researchers and the computer science literature; and the final section presents our conclusions.
The provision of a full-text index of the entire contents of each document is unique to our proposed system. Other schemes index on user-supplied document descriptions, abstracts, or similar document surrogates. UCSTRI, for example, provides a searchable text index based on text obtained by parsing the index file that is present by convention in most ftp directories of technical reports. This text does not necessarily characterize the report very closely, and in any case is only a small subset of the full text in the document. Moreover, the parsing procedure is sensitive to the format of the index file, and cannot be guaranteed to succeed. HARVEST's ESSENCE sub-system ([9], [10]) extracts "content summaries" whose composition may vary widely. ESSENCE relies on filetype-specific procedures to extract relevant information from the document itself; for example, LaTeX documents can be parsed for author and title information. ESSENCE's success in extracting an appropriate document surrogate depends on its ability to cope with the file type of the document and the semantic cues provided by that type.
The remaining systems under consideration - DIENST, NTRS, WATERS, and the PHYSICS E-PRINT ARCHIVES - require the submitter of the technical report to provide cataloging information, while WATERS requires a designated site librarian to maintain a local catalog.
UCSTRI, HARVEST, and our proposed system primarily provide keyword searches, as their indices do not contain formal bibliographic catalogs. The DIENST, NTRS, WATERS, and physics E-PRINT ARCHIVES can support more detailed information about each report (such as author, title, and CR category field searching), but this more sophisticated search functionality comes at the expense of requiring participating repositories to use specific software. As a consequence, these latter systems provide access to only a handful of sites, whereas UCSTRI, HARVEST, and our system can access a broad range of providers.
The scheme rests on the feasibility of full-text indexing of large corpora of text. The public-domain system mg can store a full-text index to a large collection of text in only 5% of the size of the original text [18]. Further, it provides a search engine that can process queries efficiently. Experiments with the 750,000 document TREC collection give response times of three to five seconds to produce ranked output for queries of forty to fifty terms.
Computer science departments generally make their technical reports available in PostScript form, although our system could support other formats such as DVI and RTF should this prove necessary. Software is available for extracting plain text from such files, and this text can be indexed to allow searching. Some sites (for example, Cornell University) provide backward compatibility to their non machine-readable technical report archives by storing old reports as TIFF files of page images, together with an ASCII version of the text (obtained by OCR) [5]. These can be incorporated naturally into our scheme, with the ASCII text being indexed and the page images retrieved when the report is requested.
With the cooperation of a site geographically close to the text archive being indexed, there is no need for the entire remote archive to be transmitted. Only the vocabulary list and word count, along with the abridged version of the document, is needed to update the local information base. A typical technical report occupies 1 megabyte in PostScript form, or 250 kilobytes compressed, while the compressed indexing information (a list of words and their frequencies) occupies only 10 kilobytes. A cooperating site will download all the documents, strip out the text, and send the appropriate information to the New Zealand host. This activity can be carried out when the machines and the network are lightly loaded. Given the currently high Internet costs to New Zealand, this scheme will be significantly less expensive than downloading directly to New Zealand and performing the indexing there. If sufficiently many sites cooperated, new indexing information could be exchanged systematically in much the same way that the Internet news is propagated.
* Timestamp: A coarse type of publication date search is supported by specifying a desired range of dates in which the technical report was entered into its repository. Given that a number of repositories are digitizing their older paper reports, this type of search is likely to produce uneven results (since the timestamp can only record the date that the report was inserted into the database, not the date that it was originally produced). However, we expect timestamp search to become more accurate as the repositories "catch up" on their retrospective conversion.
* Initial page: In the vast majority of reports, the first page contains important bibliographic information such as the title, author, the institution with which the author is affiliated, etc. By limiting a search to this first page, the user can approximate a search based on this type of information (for example, an initial page search for documents authored by "Knuth" will not retrieve documents which only cite previous work by him).
This approach avoids requiring system administrators or report providers to provide formal cataloging for technical reports, in accordance with our goal to eliminate the need for active participation from repositories. An intelligent document parser such as HARVEST's ESSENCE summarization system ([9], [10]) could potentially provide more precise bibliographic information, but at the expense of requiring a significantly more complex and filetype-dependent indexing system.
Mg uses standard techniques for producing ranked query output. A list of index terms and term frequency statistics from the documents in the collection are used to assign weights to terms. Using the vector document representation, similarity between a query and document can be measured as the cosine of the angle between their two vectors. The user then need only provide a list of words relevant to the topic of interest, and the system automatically ranks documents according to their "closeness" to the query terms.
However, an all-inclusive information collection policy is basically unscalable and will become infeasible if the Internet continues to grow exponentially. One alternative is a socially-facilitated, access-dependent scheme for pruning the information base. This works by monitoring every participating user's (i.e. New Zealand computer science researcher's) access to technical reports, and in particular noting the sites that see the least use. These sites are prime candidates for removal when the size of the collection becomes unmanageable. The idea is that only potentially `interesting' sites are included, where `interesting' is defined as `has been accessed by a colleague.' This means that the rate of growth of the collection, and hence the resources it consumes, is governed by the size, diversity, and level of activity of the user population rather than by the rate of growth of the bibliographic universe.
An interesting extension would be to measure the "potential for interest" for each new site to be added to the collection. Here, we would match the index terms for the proposed new site to the terms found in those sites receiving the heaviest use. If the new site does not sufficiently overlap the topics covered in the well-used site (and, by extension, does not cover areas that New Zealand computer scientists currently work in), then the site would not be included.
On a more technical level, the current implementation of the indexing platform (mg) requires that the entire index be re-built each time a new document is added to the collection. A project at Canterbury University (Christchurch, New Zealand) has developed an extension (mgmerge) which allows several existing indices to be merged together, so that indexes can be accumulated incrementally [12].
Surprisingly few previous reports on index or repository projects provide even a cursory analysis of the usage data they collect on their systems. As a sample of the types of analysis possible, Paul Ginsparg notes a weekly periodicity in the number of search requests made to the physics e-print archives. From this he adduces that many physicists do not yet have weekend access to the Internet (an alternative, slightly more cynical hypothesis is that even high energy theoretical physicists take the weekend off) [8].
In addition to monitoring user access times, we will also record the specific documents retrieved by users. Analysis of the index terms for these documents will allow us to create user and site profiles that characterize the types of research carried on in these departments.
Finally, there has to date been no attempt to seriously examine the characteristics of the computing literature as represented by the contents of the technical report repositories. Studies that could be supported by the information contained in our index include:
* examining the "physical" characteristics of computer science technical reports (for example, the range of the size of these documents as measured by their word count)
* determining the obsolescence rate of computing literature by analyzing the range of dates in the references of technical reports
* tracking shifts in the focus of individual computer science departments through analysis of changes in the terms used to index their technical reports
* detecting cycles or regularities in the rate of production of computing research, as measured by the timestamp of documents added to the repositories (for example, is more research produced over the summer, when the teaching load is lighter? or is research steadily produced throughout the year?)
Above all, it shows one way of dealing with the new realities of publishing, where information is provided in a widely distributed manner and it is up to the information consumer to locate what is needed.
<URL:http://www.cs.cmu.edu:8001/afs/cs.cmu.edu/user/jblythe/Mosaic/cs-reportshtml>
2. Bowman, C., Danzig, P., Hardy, D., Manber, U., & Schwartz, M. Harvest: A scalable, customizable discovery and access system, Technical Report CU-CS-732-94 (1994), Department of Computer Science, University of Colorado, Boulder, Colorado.
<URL:ftp://ftp.cs.colorado.edu/pub/cs/techreports /schwartz/Harvest.ps.Z>
3. Bowman, C.M., Danzig, P.B., Manber, U., and Schwartz, M.F. Scalable Internet resource discovery: Research problems and approaches, Communications of the ACM 37(8), 1994, 98-107.
4. Davis, J. & Lagoze, C. Dienst, a protocol for a distributed digital document library, Internet Draft (work in progress), 1994.
<URL:http://cstr.cs.cornell.edu/Info/dienst_protocol.html>
5. Davis, J. and Lagoze, C. "Drop-in" publishing with the World Wide Web, Proceedings of the Second International WWW Conference, Chicago, 1994.
<URL:http://www.ncsa.uiuc.edu/SDG/IT94/Proceedings/Pub/davis/davis-lagoze.html>
6. Davis, J. & Lagoze, C. A protocol and server for a distributed digital technical report library, Technical Report 94-1418, Computer Science Department, Cornell University, 1994.
<URL:http://cs-tr.cs.cornell.edu/TR/CORNELLCS:TR94-1418>
7. Ginsparg, P. After dinner remarks: 14 Oct `94 APS meeting at LANL, 1994.
<URL: http://xxx.lanl.gov/blurb>
8. Ginsparg, P. First steps towards electronic research communication, Computers in Physics 8(4), 1994, 390-401.
9. Hardy, D., and Schwartz, M.F. Essence: A resource discovery system based on semantic file indexing, Proceedings of the USENIX Winter Conference, 1993, 361-374.
10. Hardy, D., Schwartz, M. Customized Information Extraction as a Basis for Resource Discovery, Technical Report CU-CS-707-94, Department of Computer Science, University of Colorado, Boulder, Colorado, 1994. To appear in an upcoming issue of ACM Transactions on Computer Systems.
<URL:ftp://ftp.cs.colorado.edu/pub/techreports/schwartz/Essence.Jour.ps.Z>
11. Harris, Rik. "Computer Science Technical Reports Archive Sites." <URL:http://www.rdt.monash.edu.au/tr/siteslist.html>.
12. Hudson, S. Dynamic Inverted Files for Full-text Retrieval. Proceedings of the Second New Zealand Research Students Conference, Waikato University, Hamilton, New Zealand, April 1995, 103-110.
13. Maly, K., Fox, E.A., French, J.C., and Selman, A.L. Wide area technical report server, Technical Report , Dept. of Computer Science, Old Dominion University, 1994.
<URL: http://www.cs.odu.edu/WATERS/WATERS-paper.ps>
14. NASA. "Technical Reports, Preprints and Abstracts." (list of sites supporting digital libraries or indexing services)
<URL:http://www.larc.nasa.gov/org/library/abs-tr.html
15. Nelson, M.L., Gottlich, G.L., and Bianco, D.J. World Wide Web implementation of the Langley Technical Report Server, NASA Technical Memorandum 109162, Langley Research Center, Hampton, Virginia, 1994.
<URL: ftp://techreports.larc.nasa.gov/pub/techreports/larc/94/tm109162.ps.Z>
16. Salton, G., and McGill, M.J. Introduction to modern information retrieval, McGraw-Hill Book Company, 1983.
17. VanHeyningen, M. The Unified Computer Science Technical Report Index: Lessons in indexing diverse resources, Proceedings of the Second International WWW Conference, Chicago, 1994.
<URL: http://www.cs.indiana.edu/ucstri/paper/paper.html#ref-odlyzko>
18. Witten, I., Moffat, A., and Bell, T. Managing Gigabytes: Compressing and indexing documents and images, van Nostrand, 1994.