Data Retrieval and the Realities of Document Conversion

William B. Cavnar and Andrew M. Gillies Environmental Research Institute of Michigan P.O. Box 134001, Ann Arbor MI 48113-4001, cavnar@erim.org

1.0. Motivation

One of the main thrusts of the Digital Libraries Initiative is the conversion of large numbers of paper and microfilm documents into electronic form using scanning coupled with optical character recognition OCR. There are commercial ventures attempting similar projects, such as producing CD-ROMs having both document images and text derived from OCR of the images. Because of the inevitable problems with the recognition, these efforts require expensive manual cleanup and indexing of the generated text in order to provide some reasonable access to the documents.

These problems will not be trivial to solve. Electronic documents produced by scanning and OCR, even under ideal circumstances, can contain ten to twenty recognition errors per page of text, and significantly more if the document image is in some way degraded. OCR errors are especially prevalent in documents having some image quality problems, such as discolored backgrounds and faded ink due to aging in paper documents, and noise and low contrast in microfilm documents. Those particular documents may then become inaccessible using conventional retrieval on their OCR results.

Another important area that needs considerable attention is that of converting non-text portions of documents, e.g., images, graphs, diagrams and tables, into retrievable electronic form. Conventional OCR is of no use in accessing text in these portions of documents. Retrieval methods are also required for non-textual contents.

A third area that has gotten relatively little attention is the conversion of hand-written documents. Although these are obviously much harder than machine-printed documents, they are still important, especially in dealing with historical records.

Our research group at the Environmental Research Institute of Michigan ERIM has been pursuing a number of different research efforts that address these data conversion problems. All of these efforts have in common an emphasis on inexact matching. In other words, we build all our databases and indexes under the assumption that there will be errors that prevent perfect matches, and that our systems need to deal with this reality. We feel that application of inexact matching techniques is crucial to accessing the products of large-scale document conversion projects.

2.0. N-Gram-Based Inexact Matching

One approach to text retrieval that overcomes some of these problem with electronic text generated by a conventional OCR is to use various forms of high-performance N-gram analysis on the documents. In this context, an N-gram is an N-character slice of a character string. By treating a word not as a single unit but as a set of overlapping N-grams, this approach can partially overcome the problems mentioned above. The natural redundancy present in the overlapping N-grams provides a means for accessing text that has these kinds of errors. Suppose a word in a document, RETRIEVAL, has been misrecognized by an OCR system in a given passage of text as a second word, RETBIEUAL. A conventional word-based indexing would have missed a passage containing this misrecognized form of the word. However, by breaking the text passage into its constituent N-grams, we find that the two words share 6 out of 10 possible bi-grams. Another advantage is that N-gram analysis essentially provides word stemming for free. For example, RETRIEVAL and RETRIEVING share 7 out of 10 bi-grams. This 'free' word stemming becomes even more attractive in a multi-lingual environment since the system can perform this kind of analysis without any language-specific knowledge at all. Both aspects, retrievability and word-stemming, are completely automatic, and require no manual cleanup of the text.

We have been working with several N-gram-based representations of documents which have advantages in different cases. One method uses super-imposed coding techniques to produce compact indexes for fast retrieval of sub-documents given a query string [Cavnar92], [Cavnar93]. Another approach uses N-gram frequency statistics to characterize a document for matching purposes [Cavnar94]. Both methods are active research areas at this time.

3.0. Oversegmentation-Based Word Matching

Another approach we have been pursuing is recognizing and matching characters in the context of whole words. ERIM has developed an extremely successful approach to word recognition based on text oversegmentation. In this approach, words are segmented into primitive elements which are character-sized or smaller. These elements are then reassembled into characters and words using feedback from a neural-network-based recognition system. The method makes use of domain specific lexicons, that is, word lists, to constrain the recognition process. Using these techniques, a practical system for handwritten mail sorting was developed [Gillies92], [Ganz92]. An early prototype of the system was able to sort 34% of the hand-written mail with a 3.4% error rate.

In February, 1994, a more advanced version of this technology was used in a competitive test of handwritten OCR systems conducted by NIST for the Census Bureau. In the test, 9000 occupation fields from 1990 census forms were processed. The system used a lexicon of 8000 occupation-related words and phrases containing the correct response for 62% of the 9000 fields. The remaining fields contained responses which were not present in the lexicon. Using a confidence-based rejection criterion, the ERIM system was able to recognize 40% of the fields with a 3.6% error rate. An error was counted if the response differed from the correct response in any fashion. The average field contained less than one character error. ERIM's system was the most accurate of the eight systems tested, with error rates significantly lower than the nearest competitor.

More recently, ERIM has demonstrated a system for keyword-based retrieval of documents from a handwritten document database. The system makes use of the oversegmentation technology to prepare documents for entry into the database. The operator can query the database for documents containing key words or phrases. The system presents the operator with an image of the retrieved documents, allowing for verification and perusal of document contents. This technique is also applicable to recognizing machine-printed text. Currently, we are employing the oversegmentation-based matching in recognizing machine-printed Arabic words. This work is in progress, but is already showing very promising results.

4.0. Image-Based Retrieval

Yet another class of approaches that we have recently begun investigating bypasses OCR altogether by attempting to perform matching in the image domain. One early system we developed retrieved technical abstracts by converting query words into image surrogates, and then matching them against image surrogates of the words in the abstract. We are also in the preliminary stages of development of several systems that query image databases using image queries for medical and radar imagery. These systems allow the user to retrieve documents based on their containing sub-images like those provided in the query. Image-based retrieval must provide inexact matching to have any hope of success.

References

[Cavnar92] Cavnar, William B. and Vayda, Alan J., "Using Superimposed Coding of N-gram Lists For Efficient Inexact Matching", Proceedings of the Fifth USPS Advanced Technology Conference, Washington D.C., 1992, pp. 253-267.

[Cavnar93] Cavnar, William B., "N-Gram-Based Text Filtering For TREC-2," to appear in the proceedings of The Second Text REtrieval Conference TREC-2, ed. by, Harman, D.K., NIST, Gaithersburg, Maryland, 1993.

[Cavnar94] Cavnar, William B., and Trenkle, John M., "N-Gram-Based Text Categorization", to appear in the Proceedings of the 1994 Symposium On Document Analysis and Information Retrieval, University of Nevada, Las Vegas, April 1994.

[Ganz92] Ganzberger, Margaret J., Rovner, Richard M., Hepp, Daniel J., Gillies, Andrew M., Lake, CharlesD., and Wood, Cyrus R., "A System For Handwritten Address Interpretation",Proceedings of the Fifth USPS Advanced Technology Conference, Washington D.C., 1992, pp. 337-351.

[Gillies92] Gillies, Andrew M., "Cursive Word Recognition Using Hidden Markov Models," Proceedings of the Fifth USPS Advanced Technology Conference, Washington D.C., 1992, pp. 557-562.

Last Modified: