We present two simple ideas: text-image maps coordinate text and image representations of a document, and our probabilistic search heuristics generate hypertext links from the text of citations. Using the World-Wide Web, we built an interface that lets readers move from a bibliography entry to the cited document with a mouse click. Similarly, readers can click on entries in the table of contents and move directly to them.
The use of bitmapped images also poses problems for one of the most basic uses of a digital library: following a reference from one document to another. A user is browsing a document -- looking at images of the document pages-- and discovers a reference to another work that is potentially of interest. The user highlights the reference with his or her mouse and clicks a button; the document that was referenced appears in a new window on the user's screen.
The preceding vision of a digital library -- a collection of page images with a hypertext-like interface -- is due to Saltzer [8]. Our work examines some of the issues involved in implementing this vision.
The use of images as the primary data format gives rise to a common question. What is the advantage of images over Postscript or SGML or some other format? We see two primary motivations for page images. First, images are a good format for long-term, archival storage for three primary reasons:
Changes in disk, network, and display technology are rapidly making it feasible to build a large library of digital images, but the problem we address is the software support that will be needed to build hypertext documents with images. First, we describe text-image maps that allow users to treat images the same way they would treat text when they highlight regions and select references. Second, we use those text-image maps to generate hypertext links between documents on-the-fly.
Several other digital library projects have used images to display documents, including CORE [2], Dienst [6], RightPages [3], and TULIP [7]. Our work is different primarily because it focuses on a closer integration between text and image than has been reported before; we link text and image at the word level instead of the paragraph level. Also, much of the published work on CORE and RightPages discusses image processing and recognition, which we do not deal with.
When a user sees an interesting reference and wants to get a copy, he clicks on it; the system will receive the coordinates of the user's mouse click. Before the system can attempt to locate the document the user wants, it must use the image coordinates to locate the actual text that the user clicked on. Once this text is located, the system can use the search heuristics we describe to find the document cited.
The text-image map is nothing more than a list of words, and the coordinates of their location on the page. Given a pair of x,y coordinates, it requires a simple lookup to determine what word (if any) is at that location; given a word, the coordinates of its location on the page can be looked up.
The text-image maps we currently use are generated by a Perl script that parses the XDOC source. XDOC describes pages as a series of lines; for each line, the format provides the location of the line on the page, typographic information about the text, the actual words, and the location of spaces between the words. Based on the line's location, the height of the font, and the width of the inter-word spaces, we can reconstruct the coordinates of a rectangle that bounds the word. The text-image map is a simple listing of each word along with its bounding rectangle. On an RS/6000 PowerStation 250, it takes between a half second and a second to generate a single image map using a Perl interpreter.
Our system allows users to view images a page at a time using a World-Wide Web browser. We use the imagemap interface to allow users to click on a single point on the image. The service locates the single word the user clicked on and the text of the entire line clicked on. The demonstration returns the text that the user clicked on and a hypertext link that will cause the entire line of text to be used as input to our search heuristics.
The position information provided by a text-image map (or by the raw XDOC files) can also be used to identify particular features of a document and build links to them. For example, the page number can often be located on a page by looking for a number with little or no text nearby that falls at the top or bottom of the page. The extra position information also distinguishes a heading labeled "Bibliography" from an occurrence of the word "Bibliography" in the text of a document.
Our system is somewhat primitive, limited primarily by its use of the World-Wide Web as an interface. The Web interface is limiting because the browser only allows users to send the coordinates of a single mouse-click in an HTTP request; thus we can identify only the exact word and line the user clicks on, which does not always provide enough information to perform the catalog lookup. In an earlier paper [4], Hylton recommended several changes to the Web that would allow it to handle image-based documents better.
The difficulty with document sources is that there is no standard for bibliography creation which is strictly adhered to. Different style guides recommend many different formats for bibliographies. Even within the same style guide, the citation is different for books, journals, and video tapes.
Human entry is also a problem. Even with strictly defined rules for bibliography formation, humans will take certain liberties in the creation of the text. For example, the name "Guy Lewis Steele Jr." can be written at least seven ways, from Guy Steele to G. L. Steele Jr.. The problem is compounded by the fact that the order of first and last names varies between bibliography types.
Human entry is also very error-prone. Although most published documents have been checked for accurate bibliographies, typographical errors still abound, especially in older documents. An incorrect date could result in the inability to find the document in the library database. While an obvious misspelling may be easy to catch, the appearance of 1982 instead of 1983 is more subtle. It is also possible that the difference is not actually a mistake. A paper may have been accepted for publication in 1982, but not published until 1983. Depending on when the reference was created, the bibliography will reflect this fact.
The intermediate step between image and text is also riddled with problems. Although optical character recognition technology is improving, it is not yet perfect, and most packages lack useful features such as word to location mapping. There are still a number of cases in which OCR fails to properly read a document, and typographical mistakes result. Just as with human typographical errors, it is not an easy task to counter this problem when parsing the bibliography information.
Our search heuristic functions less than optimally when parsing partial bibliographies. The limited Web interface makes this problem more pronounced.
On the server side we again encounter the same two problems as the document source. Library databases are stored as MARC records, RFC-1357, WAIS, and many other formats that are library specific.
Additionally, the amount of information represented in the database is varied. For example in the RFC-1357 format [1], the only three mandatory fields are BIB-VERSION, ID, and ENTRY date. With only the bibliography information in hand, it would be impossible to ascertain if the item exists in a library whose database contains only these three fields.
As an experiment a heuristic was created that attempted to extract the author's name and the date from the bibliography string, because the author's name was generally unique and the date would limited the number of results when the author's name was common. For example, looking up the name "Smith" in the local reading room catalog (which contains on the order of 20,000 records) reveals 127 entries. Looking up "Smith and Mar," narrows the returned records to 10, and looking for "Smith and Mar and 1988," brings the number of records to 2. Because of the arbitrary use of abbreviations, searching the catalog for "Smith and March" yields no results.
To deal with the date problem, a database was added to the heuristic which contained possible abbreviations for month names. For September the entries were, "Sept Sep September 9." The heuristic would then iteratively search the library for the author's name, and each of the possible abbreviations for the month element in the bibliography.
The name and date approach failed not because of the complex way in which dates were extracted, but rather the way in which the author's name was determined. It was based on the assumption that the first element in the bibliography string was the author's name. Even if it was not the author's name, chances were that it was part of the title, and so was unique in that way. However, there is no guarantee that the user in clicking on the imagemap was able to select a complete bibliography entry. The line received may only contain the title, leading us to the following problem: no date can be extracted, and the first word can be a commonly used word, e.g. "the." Experiments showed that the name and date extraction heuristic performed poorly, and the idea was discarded.
We also explored using WAIS[5], or similar relevance ranking search methods. In WAIS, entries are returned and are ranked based on the number of times the search words occur in each entry. However, in order to use this scoring method it would be necessary to convert the catalog of the library to WAIS, not a minor task for any library, and not one which we attempted. The number of services that would break with a change to the library interface is too large for such a change to even be considered. However, the general idea of relevance searching is very powerful, and is similar to the heuristic we use.
1. All stop words are removed from the bibliographic string, abbreviations and punctuation marks are deleted. These include words such as "of, the, and." In addition, words such as "computer" in a computer science library resulted in a large number of database hits. The useful feature of a stop word list is that it acts to focus the result set. Our particular service uses a stop word list that is optimized for a computer science library, and contains over 60 words. The list will change by necessity for different types of libraries. For example, we might wish to make "DNA" a stop word for a biology library.
2. From the resulting cleaned bibliography string the heuristic attempts to create at most 12, three-word sets. The words are selected randomly.
3. Each of the three-word sets is sent to the library search interface. The library database returns a list of identifiers which are unique pointers to database entries. The identifiers are "scored" by the number of times they are returned. For example: if an identifier is returned by the server as a result of ten searches, it has a score of ten.
We chose a 12/3 system for the following reasons: If the number of word sets is decreased from twelve, the scoring of the identifiers does not allow for a definite answer. However, increasing the number beyond twelve slows the search process to a speed unacceptable to users. Additionally, we do not want to choose less than three words for each set because the number of results would be high. However, we don't want to choose a number much greater then three because the logical "and" operation when used with a large number of words in a search tends to fail.
4. Identifiers are sorted by their score, top scoring identifiers first. The identifiers are handed to the library server, and the returned database entries are displayed to the user.
This heuristic has proven to be remarkably successful for a number of reasons: it is fast, it does not require any knowledge of bibliography formats, and it deals well with bibliography strings with errors.
As an example of error handling performance, the original bibliography string: "d. knuth, the art of computer programming volume 2 seminumerical algorithms 2nd edition addison-wesley 1981," was mutated to include two errors (seminumrical and algothims). Running the second (erroneous) string through the heuristic 50 times yielded the following results. The percent of tests in which the correct library record was returned as the top scoring result was 57.3. The percent of times the correct record was returned at all was 99.8. Probabilistically, it is possible that the entry is in fact in the library database but the heuristic is unable to find it because it has chosen the wrong random words.
Probably the best way to categorize the heuristic is as a "best-effort." Statistically, the results are very optimistic. However, a user may find the randomness to be discomforting. That is, running the same bibliographic string through the heuristic may yield two different results. It is therefore necessary to attach a disclaimer, that if a item is not found by the heuristic, it does not mean it is not in the library catalog.
In the current demonstration, we needed to identify a correspondence between image numbers, which typically start with a cover page image, and page numbers, which often begin after the table of contents or acknowledgments. It is not unusual for a document to start with 10 unnumbered pages, so page 1 is actually image 11. Our simple search heuristic identified page number as a number with no other text on the same line that fell at either the top or bottom of the page.
From the heuristic viewpoint there is also room for improvement. The 12/3 combination of sets and words was seemed to yield the best results. By adjusting the numbers of sets and words we determined the most useful balance between accuracy and speed. However, the tests were done on a rather stable group of bibliography strings. We know from the experience of pulling text from images that the assumption of receiving useful information from the reference is a false one.
It is probably here where we can most improve the heuristic. No matter what the size of the bibliography string, the heuristic will still use twelve word sets, three words each. It may be of benefit both in speed and accuracy to use a variance system. By varying the number of word sets and words(within some limits) depending on the number words in the bibliography string, it may be possible to attain this improvement.
The second improvement can be obtained on a higher level. By improving the quality and quantity of material retrieved from the images the heuristic will greatly benefit. In its current state a user can click on the wrong line of text in the image and get the line "2nd edition, Addison-Wesley 1981." The heuristic works best when it is given "quality" information. So a string such as, "D. Knuth, The Art of Computer Programming volume 2 Seminumerical," will work far better then the line described above.
This work was supported in part by the IBM Corporation, in part by the Digital Equipment Corporation, and in part by the Corporation for National Research Initiatives, using funds from the Advanced Research Projects Agency of the United States Department of Defense under grant MDA972-92-J1029.