<html>
<head>
<title>
DL94: Intelligent Data Retrieval from Raster Images of Documents
</title>
</head>

<body>

<!--#include virtual="/DL94/header.ihtml" -->

<h1>
Intelligent Data Retrieval from Raster Images of Documents
</h1>

Sargur N. Srihari, Stephen W. Lam, Jonathan J. Hull,
Rohini K. Srihari, and Venugopal Govindaraju<p>

<em>Center of Excellence for Document Analysis and Recognition (CEDAR),<br>
State University of New York at Buffalo, UB Commons, 520 Lee Entrance, Suite 202,
Amherst, NY 14228-2567</em><p>
<b>Email:</b>  srihari@cedar.buffalo.edu, <b>Voice:</b> (716) 645-6164, 
<b>Fax:</b>  (716) 645-6176<p>
<p>
<p>
<p>
</i><b><p>
Abstract</b><p>
Documents on conventional media, such as books, newspapers and microfiche, can
be converted into the digital form of raster images by the use of scanners. A
digital library is a server on a computer network that can respond to user
requests by retrieving relevant data contained within raster image documents as
well as in other formatted ASCII documents. The task is to automate the
analysis of data contained in raster image documents for the purpose of
intelligent information retrieval in a digital library.<p>
The task is to develop computational theories and algorithms, with
contributions to the fields of <i>document understanding</i> and <i>intelligent
interactive information retrieval</i>. The limitations of a technology
necessary to convert books to text-searchable form, <i>viz.</i>, optical
character recognition, will be addressed. Specific research agenda items:
adaptive document understanding, robust page layout analysis, table
understanding, intelligent text recognition, graphics analysis, topic
categorization, content-based retrieval of captioned pictures and query
processing for information retrieval.<p>
<p>
<b>Keywords</b>: Document image understanding, OCR, pattern recognition,
artificial intelligence, page layout analysis, document scanning.<b><p>
<p>
<p>
<p>
Introduction</b><p>
Advances in computer technology during the past decade have resulted in
electronic means for generating, storing, retrieving and using data of all
types. Yet there remain large volumes of documents that exist only in paper
form, <i>e.g.</i>, books, or in the form of images of paper documents,
<i>e.g.</i>, microfiche. It is now straightforward to convert such paper
documents and their images into digital form by a process of scanning that
yields a raster image of each page. In a raster image the reflected light
intensities of discrete points on the page are stored as pixel values. Such
images not only consume enormous amounts of storage compared to their formatted
ASCII counterparts, but technology is seriously lacking for searching such
documents for content (<i>e.g.</i>, keywords). In the future, many documents
will continue to be available only in paper/raster form for reasons which
include convenience (paper is easy to carry around), non-standardization of
document preparation formats, and the global and contextual cues that images of
paper documents provide to humans that are absent in pure ASCII text
presentation.<p>
This paper describes the project of automating the task of analyzing data
contained in raster image documents for the purpose of intelligent information
retrieval from a digital library [DL]. A DL is envisioned as a server connected
to a computer network of users where the server has the ability to respond to
user requests by retrieving relevant information from document data stored on
an electronic storage medium. The specific research topics are determined by
the architecture of the DL, which is shown in Figure&nbsp;1. It consists of an
off-line process wherein document data is captured and prepared for storage,
and an on-line process where queries are received and processed for information
retrieval. There are three major computational processes shared between the
off-line and on-line processes of the DL: document data capture, data
integration/indexing and information retrieval.<p>
Three areas mentioned above are described more detail below.<b><p>
<p>
Document Data Capture</b><p>
Conversion of raster images of text into text-searchable form is the technology
popularly known as optical character recognition [OCR]. AT present, although
there are a half-dozen commercial software packages available, the technology
is well-known still to be lacking&nbsp;[1]. Recognition problems occur when
there are a variety of page layouts, fonts, print qualities and contextual
differences. Such variations are commonly found in a typical library archive. A
recent study suggested that having such capability in reading machines&nbsp;[2]
is still decades away. Several promising methods have been recently developed
at CEDAR. This research will focus on improving the methods for processing
large volumes of documents. <p>

<p>
<img src="figures/srihari1.gif"><p>
<i>Figure 1. Major components of the digital library.</i><p>

The algorithms necessary for conversion of raster image data into a form that
supports intelligent retrieval can be categorized as performing one of three
functions: analysis, recognition or understanding. Analysis algorithms map
images to other images. Recognition algorithms map images to symbolic classes.
Understanding algorithms map images to symbol structures. The following five
topics are of importance for DLs: adaptive document understanding, robust page
layout analysis, intelligent text recognition, tabulated data interpretation,
and graphics recognition. Success in these tasks will allow a majority of
documents to be processed within the DL.<b><p>
<p>
Adaptive Document Understanding</b><p>
A document understanding [DU] system begins with a raster image representation
and obtains a high-level "useful" representation of document content. DU
systems can be said to be either "closed" or "open," depending on whether or
not they are designed for a particular document type. DU systems in use today
are all closed systems, <i>i.e.</i>, designed for a particular document domain
(such as forms, specific journals and business letters). Closed systems employ
knowledge specific to the document type to be processed. They cannot be adopted
easily to other types of documents.<p>
An open system architecture has been developed at CEDAR. It is designed for
multi-page documents, i<i>.e.</i>, it generates a logical interpretation of a
document by combining information on different pages of the document. The
architecture (see Figure&nbsp;1) consists of three components: control,
knowledge base and tool box. The <i>Control</i> possesses some general purpose
document analysis strategies and the <i>Tool Box</i> contains a set of generic
document image processing tools that are applicable to different documents. The
system has no prior knowledge about document domain. The use of strategies and
tools relies solely on the knowledge of the document of interest defined in the
<i>Knowledge Base</i>. Since document-specific knowledge is not part of the
system, it can be viewed as input (in addition to the document images) to the
system. A prototype system based on this architecture has been developed at
CEDAR to process a variety of documents such as forms, IEEE journals and postal
mailpieces&nbsp;[3, 4]. The research focus is on how to use knowledge to adjust
the system reading strategies for handling different types of library
documents.<b><p>
<p>
Page Layout Analysis</b><p>
Layout analysis starts with block segmentation which decomposes the digital
image of a document page into regions. The process locates large streams of
white space (background analysis) running horizontally or vertically. Streams
that are considered region boundaries are used to partition the image into
regions. A region must be bound by two horizontal and vertical background
boundaries. Criteria used when considering a white stream as region boundary
are derived from the analysis of all the white streams in the page image.
Non-boundary white streams, which are usually smaller in size, separate
elements of the same region such as those between text lines. This global
statistical approach does not need to predefine the size of the white streams,
which varies from page to page. This local adaptive method increases the
robustness of the segmentation process. Figure&nbsp;2 shows the regions of a
document after block segmentation.<p>
The next step is to classify a region into one of the structural categories
such as text, line drawings, tables and photographs. The classification of a
region is performed by matching a set of features extracted from the region
against the predefined reference features of a category. This approach allows
the addition of new categories only if they contain distinctive features. It is
very flexible and is able to handle various types of documents.

<p>
<img src="figures/srihari2.gif"><p>
<p>
<i>Figure 2. Components of an adaptive document understanding system.</i><p>
<p>
Text Recognition Using IR Technology</b><p>
The recognition of word images is a solution to text recognition by which
images of text are transformed into their ASCII equivalent. Word recognition
algorithms are an alternative to traditional character recognition techniques
that rely on the segmentation of a word into characters. This is sometimes
followed by a postprocessing step that uses a dictionary of legal words to
select the correct choices.<p>
Errors in the output of a word recognition system can be caused by several
sources. When a noisy document image is input, the top choice of a word
recognition system may be correct only a relatively low percentage. However,
the ranking of the dictionary may include the correct choice among its top
<i>N</i> guesses (<i>N</i>=10, for example) in nearly 100% of the cases.<p>
Solutions to improving the performance of a text recognition system have
utilized the context of the language in which the document was written. An
observation about context beyond the individual word level that is used here
concerns the vocabulary of a document.
<p>
<pre>
   <img src="figures/srihari3a.gif">         <img src="figures/srihari3b.gif">
</pre>
<p>
<pre>               (a)                            (b)              
</pre>Figure
3. Result of block segmentation. <i>(a) </i>Original document image. <i>(b)
</i>Regions located by the block segmenter<i>.</i><p>
A methodology is proposed to predict the vocabulary of a document from its word
recognition decisions. The <i>N</i> best recognition choices for each word are
used in a probabilistic model for information retrieval to locate a set of
similar document in a database. The vocabulary of those documents is then used
to select the recognition decisions from the word recognition system that have
a high probability of correctness. Those words could then be used as "islands"
to drive other processing that would recognize the remainder of the text. A
useful side effect of matching word recognition results to documents from a
database is that the topic of the input document is indicated by the titles of
the matching documents from the database. The algorithmic framework discussed
in this paper is presented in Figure&nbsp;3. Word images from a document are
input. Those images are passed to a word recognition algorithm that matches
them to entries in a large dictionary. N<i>eighborhoods</i> or groups of words
from the dictionary are computed for each input image. The neighborhoods
contain words that are <i>visually</i> similar to the input word images.<p>
               
<img src="figures/srihari4.gif"><p>
<i>Figure 4. Word matching algorithm.</i><p>

Tabulated Data Recognition and Interpretation</b><p>
Tables are used in documents as an effective means of communicating
attribute-value information pertaining to several data items (keys). The
spatial layout of these items communicates the desired associations. In many
cases, tables are intentionally composed so as to allocate the logical items
and the relationships among items into the physical layout structures.<p>
The goal of table understanding is to recognize the data items in the table and
to derive a logical interpretation of the table based on contextual information
of the data items and the physical layout of these items. The task of table
understanding can be distributed over three tightly coupled modules: (i)
alignment detection, (ii) layout understanding, and (iii) data entry
recognition.<b><p>
<p>
Graph/Chart Recognition</b><p>
Input to this process is a graphic block and text from the caption. The
objective is two-fold: (i) classify the graphic block as one of several graphic
types: mechanical drawings, schematics, flow diagrams, intensity images, x-y
plots, bar charts, pie charts, etc. (ii) decompose the graphic block into a set
of primitives and capture the relationships between them (vectorization).
Classification of graphics is necessary to keep the graphics processing
streamlined and domain specific&nbsp;[5]. This knowledge can be obtained from
collateral textual sources like the caption or derived from preliminary image
processing as described below in brief.<p>
Graphics classification can be effectively performed using shape based
heuristics&nbsp;[6]. A histogram of greyscale values can discriminate between
intensity images and binary line drawings. A pair of distinct peaks indicates
an intensity image. Presence of circles, boxes, and diamonds indicates a flow
chart or a schematic diagram. Further discrimination between classes can be
accomplished by detecting presence of connecting lines between symbols. Global
properties like location, length, and number of all the lines are also useful
indicators.<b><p>
<p>
Multimodal Data Integration/Linking</b><p>
The task of multimodal data integration/indexing is to remove the physical
barrier imposed on the document content. It creates a logical interpretation of
the documents in the DL database by linking spatially and contextually
correlated document elements together. The hierarchy can be constructed by
following the hierarchical structure of the document description defined for
document understanding. Two objects will form a linkage when they satisfy
certain spatial or contextual constraints. For example, a spatial link is
formed between a photograph and the text below, and a contextual link is formed
between a paragraph of text which has an explicit figure callout (such as "see
Figure 1") and the photograph and caption which are indexed by that figure
number. There are two types of links: (i) generic links -- the links that apply
to all documents and (ii) specific links -- the links that are only applicable
to a particular type of document. Specific links must be predefined at the
knowledge acquisition stage.<p>
Research will focus on developing techniques to automate the creation and
organization of linkages in a way that will facilitate effective information
retrieval. Research subtopics are: (i) topic categorization, (ii) picture-text
integration, (iii) linkage detection.<b><p>
<p>
Topic Categorization</b><p>
Categorization of the topic of a passage of text is an important part of
entering it in the database of the DL system. Given an input document image and
a fixed number of categories, the objective is to assign the document to the
category that it best matches. This problem is compounded by the imprecision
that is present in OCR results (word decisions offered by the OCR may not be
correct).<p>
The following solution is proposed (see Figure&nbsp;4). Top <i>N</i> decisions
for each word are computed by postprocessing the OCR results. These words are
then input to a keyword selection algorithm that chooses a fixed number of
words that characterize the topic of the text&nbsp;[7]. Those keywords are then
matched to lists of keywords that have been pre-compiled for each of the
categories. The output is a ranked list of a small number of categories that
contain the input document.

<p>
<img src="figures/srihari5.gif">
<p>
<i>Figure 5. Topic categorization from OCR results.</i> <p>

Research issues are choosing the keywords from the OCR results and matching the
derived keywords to the category-specific keyword lists. Current work has shown
that a methodology based on the Salton method for keyword selection&nbsp;[8]
provides reliable performance in the presence of noise. This technique will be
extended and other algorithms for keyword selection will be tested.<b><p>
<p>
Content-based Retrieval of Captioned Photographs</b><p>
This research explores the interaction of textual and photographic information
in document understanding. Information contained in both pictures and captions
enhances overall understanding of the accompanying text, and often contributes
additional information not contained specifically in the text. The research
described here is designed to be applied at document processing time permitting
more intelligent retrieval of photographs than is currently available.
Specifically, the research has two main objectives: (i) natural language
analysis of the caption at document processing time designed to extract
significant attributes of the picture which can be used to automatically "tag"
the pictures, and (ii) content-based retrieval of photographs.<i><p>
PICTION: A Caption-Based Face Identification System</i>: A prototypical system,
PICTION&nbsp;[9,10], which understands photographs based on collateral text has
been developed at CEDAR. More specifically, when given a text file
corresponding to a newspaper caption and a digitized version of the associated
photograph, the system is able to locate, label, and give information about
objects which are relevant to the communicative unit. A common representation
for the information content from both the picture and the caption is
employed.<p>
A key module in PICTION is the face locator&nbsp;[11], which locates human
faces in arbitrarily complex photographs. PICTION is noteworthy since it
provides a computationally less expensive alternative to traditional methods of
face recognition in situations where pictures are accompanied by descriptive
text. Traditional methods&nbsp;[12] employ model-matching techniques and thus
require that face models be present for all people to be identified by the
system; our system does not require this. In PICTION, spatial constraints
obtained from the caption, along with some key discriminating visual
characteristics (e.g., male/female, color of hair) are sufficient for
eliminating false face candidates and correctly identifying faces.
Figure&nbsp;5 is an example of a digitized newspaper photograph and
accompanying caption that PICTION can handle.<b><p>
<p>
Linkage Detection</b><p>
A critical aspect of the DL project research is data linking -- the linking
together of items within a document or between several different documents.
Links will be used by the system to facilitate user navigation through the
documents and to support system reasoning to retrieve items from the database.
Important research issues include: (i) what should be linked? (ii) using the
information from DU module, caption understanding module, and topic
categorization module, how can link creation be automated as much as possible?
(iii) how can the links be grouped into concepts? and (iv) what relationships
should be maintained between concepts?
<pre>                                                                    
    <img src="figures/srihari6a.gif">   <img src="figures/srihari6b.gif">
              (a)                         (b)                               

    <img src="figures/srihari6c.gif">   <img src="figures/srihari6d.gif">
              (c)                 (d)                               
</pre>
<i>Figure 6. (a) photograph with caption "Actors Sean Penn, left, and Robert DeNiro pose
with Toukie Smith, sister of the late fashion designer Willi Smith, at a New
York celebrity auction Sunday in memory of Smith" (The Buffalo News, Feb. 27,
1989); (b) output of face locator; (c,d) output of PICTION.</i><p>

The system cannot create and explicitly store all possible links between all
low level items (for example, all numbers in a table). The storage required
would be immense, many of the links would be rarely used, and processing time
in following useful links will be slower. Instead, the system will use
knowledge of structural units and a hierarchy of concepts to extract detailed
linking information.<p>
Three types of links have been identified: structural, temporal, and
conceptual. All three will be stored external to the document. S<i>tructural
links</i> reflect structure in the original document and will be created from
the output of the DU module. A structural link may point from the ASCII
document text to a graphic located at that location in the original document.
Links will be created from the table of contents, section indices, section
headings, figures, figure references, figure caption, bibliographic references,
bibliographic references, footnotes, and internal page references. A
<i>temporal link</i> connects items by time. For example, all documents in a
series produced over a five year period will be connected by a temporal link.
Temporal links between whole documents are simple to implement once the time
relationship is known. A <i>conceptual link</i> will connect items containing
related information. Links may be internal to a document or may be between
documents. Conceptual links arise from understanding of the content (caption
understanding and topic characterization) and not exclusively from position in
a document.<b><p>
<p>
Information Retrieval</b><p>
The IR module is the mechanism by which a user accesses the DL. Intelligence of
the IR module is defined by its ability to retrieve correct information based
on imprecise and insufficient user descriptions as well as imperfect OCR
results in the document database. Retrieval systems can be evaluated by two
quality measures: <i>precision</i>, the proportion of retrieved documents which
are relevant, and <i>recall</i>, the proportion of relevant documents which are
retrieved. Research in IR will focus on two areas: (i) NLP techniques for query
analysis, and (ii) intelligent retrieval methods. Both of these will be
incorporated into a graphical, interactive user interface which assists the
user in accessing the DL. Experienced users will have the choice of entering
query phrases directly (consisting of search terms and boolean operators).
Other users will be able to express their queries in terms of English
sentences. Much of the research in intelligent IR assumes a text only document
database. The proposed research will consider the additional requirements posed
by a multimodal (text, tables, graphics, photographs) database. Statistical
methods such as vector-based models are proposed for use in the retrieval
system. To date, very little improvement has been demonstrated by using NLP
methods during the retrieval stage.<b><p>
<p>
Conclusions</b><p>
This paper presented three major processes which help to build a DL. Several
components of the system are shown to be useful for creating a DL database. The
design of these components stresses the importance of robustness and ease of
adaptation to the processing of different documents. An adaptive approach to
document understanding was presented in this paper. Its robustness was shown to
be crucial to the success in processing varied library documents. This paper
also presented an adaptation of the vector space model for information
retrieval to improve the performance of the text recognition module. Output
generated by the document data capture and data integration/linking modules
facilitates the IR mechanism to produce intelligent response to user
queries.<b><p>
<p>
<p>
References</b><p>
[1]	S. N. Srihari. "From Pixels to Paragraphs: The Use of Contextual Models in
Text Recognition". <i>Proc. Second International Conference on Document
Analysis and Recognition</i>. Tsukuba Science City, Japan. October, 1993.<p>
[2]	G. Nagy. "What Does a Machine Need to Know to Read a Document?" Proc.
Symposium on Document Analysis and Information Retrieval. Las Vegas, NV. April,
1992.<p>
[3]	S. W. Lam and S. N. Srihari. "Multi-domain Document Layout Understanding."
Proc. First International Conference on Document Analysis and Recognition.
Saint-Malo, France, September 1991.<p>
[4]	S. W. Lam and S. N. Srihari. "Frame-based Knowledge Representation for
Multi-domain Document Layout Analysis," Proc. IEEE International Conference on
Systems, Man, and Cybernetics, Charlottesville, VA, October 1991.<p>
[5]	S. N. Srihari. "Document Image Understanding." Proc. IEEE Fall Joint
Computer Conference. Dallas, TX. 1986.<p>
[6]	R. Kasturi, R. Raman, C. Chenubhotla and L. O'Gorman. "Document Image
Analysis: An Overview of Techniques for Graphic Recognition." Proc. IAPR
Workshop on Syntactic and Structural Pattern Recognition. Murray Hill, NJ.
July, 1990.<p>
[7]	J. J. Hull and Y. Li. "Interpreting Word Recognition Decisions with a
Document Database Graph ." Proc. Second International Conference on Document
Analysis and Recognition. Tsukuba Science City, Japan. October, 1993.<p>
[8]	G. Salton. Automatic Text Processing. Addison Wesley. 1988.<p>
[9]	R. K. Srihari. "PICTION: A System that Uses Captions to Label Human Faces
in Newspaper Photographs." Proc. 9th National Conference on Artificial
Intelligence. Anaheim, CA. 1991.<p>
[10]	R. K. Srihari. "Intelligent Document Understanding: Understanding Photos
with Captions." Proc. Second International Conference on Document Analysis and
Recognition. Tsukuba Science City, Japan. October, 1993.<p>
[11]	V. Govindaraju, S. N. Srihari and D. B. Sher. "A Computational Model for
Face Location based on Cognitive Principles." Proc. 10th National Conference on
Artificial Intelligence. San Jose, CA. 1992.<p>
[12]	R. Weiss, L. Kitchen and J. Tuttle. "Identification of Human Faces Using
Data-driven Segmentation, Rule-based Hypothesis Formation and Iterative
Model-based Hypothesis Verification." University of Massachusetts at Amherst,
COINS Technical Report 86-53. 1986.

<!--#include virtual="/DL94/footer.ihtml" -->
Last Modified: <!--#echo var="LAST_MODIFIED" --> <br>

</body>
</html>

