James Fullton
Clearinghouse for Networked Information Discovery and Retrieval (CNIDR)
P.O. Box 12889, RTP, NC 27709
Tel: 1-919-248-9247
E-mail: jmf@cnidr.org
KEYWORDS: spatial information, geographic information, search and retrieval, Z39.50, data discovery
Traditional library catalogs contain references to documents held in the library in conventional formats. Automated card catalogs and related search and retrieval systems have provided the ability to find documents based on field search of data elements such as title, author, and subject as per existing subject heading systems. Classification systems, by their nature, seek to organize information into hierarchies that allow users to search on more general or more specific terms to discover documents held by the library. A document may also have several subject headings associated with it because most information can be categorized or approached by different disciplines in different ways.
Representative geographic locations of documents, for the purposes of cataloging and retrieval, are not amenable to cataloguing within a hierarchical classification system, except where documents are referenced to well-defined political boundaries. Whereas one can conveniently subdivide the Earth into a set of (sometimes disputed) nested political subdivisions, this organization may not be necessarily appropriate for information discovery by the earth scientist, oceanographer, or climatologist interested in a user-defined region that has not been formally classified.
Geographic location can be described using mathematical constructs defining position with respect to an origin using latitude and longitude as measured in degrees away from the origin. Point locations on the Earth and other bodies can be described reliably with such a common spatial reference system. Moreover, bounding rectangles and complex chains of coordinates can be developed in this coordinate space to circumscribe edges or `footprints' of the coverage of a map, digital data set, or even an environmental publication that references a specific geography. The search of relevant documents with complex geographic footprints has been an operation traditionally restricted to geographic information systems (GIS) software. Its incorporation into the indexing and search capabilities of a digital library is a logical extension to accommodate geographically referenced data of all types.
There are a number of efforts on the Internet to catalog digital spatial data -- either as maps (pictures) or digital spatial data sets that may be printed or loaded into a desktop mapping or GIS, respectively [1]. Most of these activities communicate through the GeoWeb project supported by the U.S. Bureau of the Census and hosted by the State University of New York at Buffalo. GeoWeb, through on-line World-Wide Web pages linking to its participants and an Internet `list server' that reflects mail messages to all subscribers, provides a forum to discuss and implement geographic information retrieval systems on the Internet. Services such as the Virtual Tourist and an extended interface to the Xerox Map Browser allow Internet users to generate simple maps for immediate display. Other systems such as the U.S. Environmental Protection Agency and National Oceanic and Atmospheric Administration's Geophysical Data Center provide a point-and-click interface for the public to download large volumes of spatially-referenced data through custom interfaces.
As the volume of digital geographic information and the number of information providers on the Internet increase, the ability for a user to discover, evaluate, and download appropriate information is greatly hindered. The hypertext paradigm of the Internet allows every site to organize and connect its holdings to other holdings in a very random way, making browsing for specific data an unpredictable undertaking. Indexes of the entire World Wide Web, such as the Lycos system offered by Carnegie Mellon University provide some means to identify data, but are limited to pure text searching and cannot search for information with spatial or temporal extent in a consistent way. A systematic approach to serving geographic information on the Internet is required.
The Alexandria Digital Library (ADL), a project funded by the National Science Foundation Digital Libraries Initiative, is developing a comprehensive digital library capability for the Map and Image Library at the University of California, Santa Barbara. The project includes applied research on spatial data cataloging, scanning and metadata creation (ingest), data compression and enhancement, search and on-line service of raster and vector data for local and, eventually, remote data repositories. A primary goal of the ADL project is to Providing a means to search and retrieve data on text and spatial characteristics is a high priority for the project.
The Federal Geographic Data Committee (FGDC) has been developing a spatial data clearinghouse capacity over the past year. Member agencies are encouraged to develop metadata records, serve these records as searchable documents on the Internet, and link the records to on-line stores of digital spatial data, where available. The ability to search multiple data servers for data sets that are spatially relevant is a key element for success of this distributed clearinghouse concept.
There are a number of protocols relevant to the service of digital spatial information on the Internet. These include markup and cataloging conventions and data service protocols used by libraries and the wider Internet community.
The U.S. Library of Congress, through its MAchine-Readable Catalog (USMARC), implements a storage and classification system that provides for human-readable and machine-searchable characteristics of catalogued documents [2]. Where relevant, library holdings with an explicit geographic reference (e.g. `Geologic Map of the Golden Quadrangle, Jefferson County, Colorado') are catalogued using the USMARC Geographic Subject Subdivision (USMARC 65x subfields a and z, with searchable element 052). Additionally, maps in a catalog will be coded with the bounding latitude and longitude coordinates using the searchable USMARC field 034 subfields e,f,g,h (coordinates) and a human-readable counterpart, field 255 subfield c. The bounding latitude and longitude fields define a bounding rectangle that encloses the area of interest. These coordinate fields provide a limited capability for the description of map information but are not customarily applied to non-map data. In addition these coordinate fields do not provide for the encoding of complex geographic footprints (e.g. river basins, congressional districts, study areas) that describe the true, searchable extent of digital spatial data and related reports.
The FGDC, through Executive Order 12906 [3] signed by President Clinton in April 1994, has directed all federal agencies to use the `Content Standards for Digital Geospatial Metadata' (CSDGM) -- a federally-developed standard to establish a formal vocabulary for digital spatial data set descriptions. Among the approximately 300 data elements described in this standard are a set of bounding coordinates that correspond to the USMARC subfields given above, and the ability to encode one or more coordinate chains that describe the true footprint of a digital map document. Although the CSDGM only requires the bounding rectangle of a document to be recorded, it provides for the more complex footprints to be stored and searched for spatial relevance.
The bounding rectangle used in both USMARC and CSDGM is defined by bounding lines of latitude and longitude which makes it useful for describing many traditional maps, such as topographic quadrangles, that also follow such lines. Aerial photography, satellite imagery, and data sets whose edges are defined by political or other application-defined boundaries are examples of information whose footprints can be approximated through the use of a bounding rectangle but are more truthfully represented by a complex footprint defined by many points.
The World-Wide Web uses the Hyper-Text Markup Language (HTML) as the primary method for document linkage and presentation on the Internet. HTML is a simple, but still unofficial, subset of the Standard Generalized Markup Language (SGML) with adequate functionality to display text documents with in-line graphics. Both text and graphics can be used in HTML as a hyper-text link to another place in the current document or to another document on the Internet. In-line graphics, known as imagemaps, allow the user to click on regions within the bitmap and traverse a link to a specific document. This interface has been demonstrated as a retrieval mechanism for individual maps by a number of organizations, including the U.S. Geological Survey.
The American National Standards Institute Z39.50-1992 standard is being used within the library community for catalog and document search and retrieval [4]. The Z39.50 standard provides for the use of common attribute sets whose use and operations are well-known to both client and server. The latest version of the standard also allows a server to `explain' its searchable attributes and operators to a client to permit an intelligent query of non-common attributes. A geographic data profile (GEO) is being defined by the FGDC to incorporate the data elements of the CSDGM including bounding coordinate and footprint fields and is being implemented in a freely-available Z39.50 server (I-Site) developed by the Clearinghouse for Networked Information Discovery and Retrieval (CNIDR) in Research Triangle Park, North Carolina.
CNIDR was formed in 1992 through a three-year grant from the National Science Foundation to sponsor a development center for wide-area network search and retrieval software. Initially proposed as a maintainer for the public-domain version of the Wide-Area Information Server (WAIS) software the scope was expanded to focus on the integration of the various Internet access protocols (ftp, Archie, Gopher, World-Wide Web, and WAIS). Commercial and public-domain versions of the WAIS software are based on the 1988 version of the Z39.50 standard. The 1988 version is limited to free-text search of documents whereas the 1992 version of the standard supports fielded search. CNIDR developed a series of public-domain release of the WAIS software known under the freeWAIS name, versions 0.1, 0.2, and 0.3.
Figure 1. Configuration of I-Site Z39.50-compliant software developed by the Clearinghouse for Networked Information Discovery and Retrieval.
In 1994 CNIDR released server software that supports the Z39.50-1992 standard and an Application Programming Interface (API) that permits users to integrate the search engine or database of choice with the information server process. This ZDist software is a dramatic departure from the tightly coupled index and search provided through freeWAIS, allowing for extensibility as well.
The I-Site package was developed in late 1994 to include the ZDist server, a World-Wide Web (WWW) gateway, the search API, and a text search engine known as ISearch. Together these provide a complete information service that is accessible to Z39.50 clients and WWW clients such as Mosaic and Netscape without requiring a commercial database. Users requiring special search engines or databases can incorporate them, replacing the default search engine through use of the search API. The I-Site software package is described on the Internet at the URL:
http://vinca.cnidr.org/software/Isite/Isite.html
and may be downloaded using anonymous ftp to the following location:
ftp://ftp.cnidr.org/pub/NIDR.tools/Isite/
in which executables for SunOS, Ultrix, Solaris, OSF, and Linux are available. Source code is available from the same directory for other platforms. ISite is written using the GNU version of C++, called g++ which is required to compile on platforms not listed above.
The configuration of the I-Site software suite is illustrated in figure 1, including its interaction with a WWW server and multiple client types. The I-Site software includes the Zclient gateway, the Zdist server, and the search Application Programming Interface (API). Data are indexed using default free-text indexing (I-Index), an external data base management system, or other search engine supplied by the user. The configuration allows for multiple search engines -- including one being developed for spatial search -- to be coupled to perform a search. The server is connected to the search and retrieval systems through the search API.
The search API currently supports free-text indexing and search of text documents (Isearch) and a command-line based search protocol (Script) that allows one to define a search script to pass along query terms and perform a retrieval from a database or other organized collection of information. A simple C-based API for direct software integration is available for these basic functions (sapi.c) to enable programmers to make direct connections into databases that have embedded C interfaces.
The Zserver software, the core of Isite, is a Z39.50-1992 service implementation that is designed to accept a request from a Z39.50 client and translate the search request through the search API to one or more local or remote stores of information, and return a list of relevant documents. These documents may be returned in Standard Unstructured Text Records, Generalized Record Syntax -- a way to `wrapper' data objects, and USMARC records
Client access provided with the I-Site package includes a Zclient query program that can be used in building other interfaces or can be incorporated into a WWW server as a gateway script. Zclient is not an interactive client but can be used by programmers as an example of how the Z39.50 client library can be used. With this gateway installed, forms can be written in Hyper-Text Markup Language (HTML) to customize a WWW query interface. I-Site also supports Z39.50-1992 clients such as Willow, available from the University of Washington, and through an integrated gateway, clients using forms capable WWW browsers. An interface has also been provided to accept formatted requests using an electronic mail gateway for users without WWW or Z39.50 clients.
Z39.50-1992 supports a series of implementation profiles; the most commonly used profile is `bib-1' -- a field-level definition for cataloging of bibliographic entries. A profile includes a set of numbered attributes (field-like constructs) that may be queried, along with the operations or characteristics that apply to each attribute. This set of attributes may be registered as part of the standard to ensure that implementors can support well-known set objects in server and client software. Once an entry, or document, reference is located by a query the user may retrieve the document in one of several formats. The structural contents of a document, within a given profile, are defined by a schema and can be used within the server to convert documents from one format to another.
A prototype spatial search system was integrated into a version of the public-domain WAIS (version 8-b5.1) software in 1992 by CNIDR to index and retrieve documents based on text and spatial characteristics. The indexing routine was modified to recognize a string construct in text documents being indexed that contained a series of ordered coordinates defining a bounding chain, or footprint, of the document. All other words were indexed as searchable text in the dictionary. A mixed query using words and a spatial term would be processed such that documents were ranked based on the word score first (default behavior in this version of WAIS), the documents were separately flagged based on spatial relevance with respect to the search area. The two document scoring arrays were multiplied together to present a final set of relevant documents to the user -- those documents that had certain words and fell within the search region. A query in the general-purpose text window would have the form:
pipelines or roads inside(35,-83 36.2,-83.4 35.4,83.8 35,-83)
where the term `inside' was used to declare the string of latitude and longitude coordinates within parentheses that define the search footprint.
To avoid forcing the user to enter latitude and longitude values by hand, a map query tool was added to WAIS clients for the Windows, Macintosh, and X-Windows environments. This interface enabled the user to enter search points or regions graphically against an orthographic map of the world, and the software pasted the coordinate string into the text query window in the above format.
This prototype system worked well for small collections of geographic footprints, particularly those with a convex1 footprint. The use of a concave search or target footprint would yield unpredictable results because of the polygon overlay algorithm used. Also, the prototype software compared every target footprint with the search region which worked reasonably well on small collections but took a very long time on large collections. This serial, non-indexed search implementation was not suited to collections of more than hundreds of documents. As a result of the unpredictable spatial search behavior and its limited scalability, these features were not incorporated into the general distribution of freeWAIS.
In 1994 the Informatics Department at the University of Dortmund in Germany released an enhanced version of the freeWAIS product called freeWAIS-sf. This software added the ability to index discoverable portions of text documents as fields for direct query. Field types include text, date, and numeric data and permit queries more like those associated with a database than with free-text search of the entire documents, which is still supported. The FGDC adopted the freeWAIS-sf software for testing within the Clearinghouse and defined four consistently-named fields for the bounding coordinates to be used in Clearinghouse servers (ebndgcoord, nbndgcoord, wbndgcoord, and sbndgcoord for the East, North, West, and South-bounding coordinates, respectively). Through use of these coordinates and an intelligent entry form, users can specify a search rectangle and quickly identify target documents whose rectangles overlap, include, or are included within the search rectangle. This search is conducted using a single text query using a compound expression with `greater-than' and `less-than' constructs to rapidly find the targets against the indexed fields. Because the freeWAIS-sf (and all other versions of freely-available WAIS-derivative software) were not based on the current version of the Z39.50 protocol, an alternate solution was sought. A contemporary solution was required to provide interoperability with other Z39.50 services and to take advantage of new service features.
Work being conducted by CNIDR for both the U.S. Geological Survey and National Aeronautics and Space Administration (NASA) indicated a need to extend the I-Site Z39.50-1992 software suite to include a basic spatial search engine. Both organizations have large collections of documents and digital data sets that have a defined geographic extent. As a testbed, a collection of several thousand NASA data set descriptions in Directory Interchange Format were extracted from the NASA Master
Data Directory and were indexed for search in I-Site using a subset of the bib-1 registered attributes so they could be accessed by Z39.50 clients commonly used in the library environment. The bounding coordinates described in the DIF files did not have equivalent attributes in bib-1, so elements from the draft GEO profile of Z39.50 were used instead.
The collection was indexed using a parser provided with the I-Site software to recognize the location of the bibliographic and coordinate fields in the target collection and produce a searchable index that can be accessed using Z39.50 clients. A query form was generated in HTML to collect general text and field (attribute) query including the spatial coordinates and selection of a spatial operator to consider in the search as shown in figure 2.
In this example, a user is searching for all data sets using climate as a topical search term within the full document (DIF Full Text) and a bounding rectangle set of search coordinates, as entered under the Spatial Search Parameters. Only those documents whose footprint overlaps the query region will be returned. The first 15 documents selected will be provided to the user in the form of `headlines' or document titles.
Figure 2. HTML form interface to the NASA DIF collection viewed the NCSA X-Mosaic.
Figure 3 illustrates the result set returned to the user from the query. Several global data set references and one Alaskan weather reference were found as a result of this query. Clicking of the highlighted `Full' hypertext marker will retrieve the document in its full form. Yet to be implemented are summary records (a subset of all attributes) or variants such as a USMARC representation of the DIF entries.
Thus far, the prototype has demonstrated the use of bounding coordinates similar in approach to that taken in the freeWAIS-sf implementation used in the FGDC Spatial Data Clearinghouse. A library of spatial processing routines has been acquired from the Defense Intelligence Agency for use in the FGDC effort that includes indexing, processing of point-in-polygon and polygon overlap even in mathematically difficult regions of the poles and 180 degrees longitude. CNIDR and the FGDC will be working on integrating this indexing code into the I-Site implementation in the near future to provide robust spatial data search for documents with rectangular, concave, or convex polygon footprints. This software will be available for use by the Clearinghouse and the general public by mid-summer 1995.
Figure 3. The set of documents returned for the climate query for Alaska.
The forms-based interface to the I-Site server allows one to find information using several information fields in a format similar to that used to access a relational database. Interfaces to information collections that include spatially-referenced documents would benefit from having a map-based interface. Research into more complex mechanisms to visualize documents in multiple topical and temporal dimensions is underway [5], but the protocol support within HTML does not support the complexity and versatility needed for more advanced spatial and temporal searches. At present, even the imagemap linkages are restricted to a single click. This precludes a user from defining a complex search region with multiple points such as a polygon, rectangle, or circle -- basic features to a geographic user interface required by ADL and other projects.
Inclusion of a geographic query tool is being considered in two forms by the HTML developer community. Within HTML 3.0 (in draft)[6] is a feature called the `scribble widget' that allows the user to enter many coordinates over an existing bitmap and forward the coordinates to the server to perform an action such as data retrieval. This would allow the unmodified WWW client to access and interact with spatial information in a more sophisticated way. A second method of providing a WWW client with a map interface would be the inclusion of a geographic `helper application' that would display geographic information and allow for the preparation of a geographic query similar to the map query tool built within the prototype spatial WAIS software. Such an application would be launched when a spatial data file is received or a special instruction is given by the client.
The scribble widget option places most of the control and query burden on the server, whereas the client-side helper application lets the client do more of the interface work. For information providers it may be more desirable to focus resources on the development of robust servers rather than worry about both client and server software development and support.
The development of a very large number of spatial data services on the Internet -- either WWW or Z39.50 servers using a common protocol -- will at some point make gateways and referral services a bottleneck. The Harvest System from the University of Colorado employs the concept of automated information brokers that search the Internet for information resources and summarize the information for more rapid and relevant retrieval without placing the burden on a single computer or index[7]. Use of a system such as Harvest, that is not restricted to a specific information protocol, with well-known spatial and temporal attributes could complement the development of a network of Z39.50 servers with a high degree of interoperability. The success of any attempt to federate digital spatial information will require agreement on the searchable attributes to be posted to the Internet -- a task being undertaken by the FGDC.
Indexing of information of geographic interest by bounding coordinates is not commonly done for non-map data. The use of flexible, freely-available software that uses the Z39.50-1992 search and retrieve standard makes such indexing possible. As more digital spatial information, reports, and reconnaissance data come on-line it is necessary to provide reliable means of accessing it without being restricted to a geographic place names hierarchy.
The prototype spatial search demonstrated in this paper provides examples of how conventional stores of catalog information from a non-library setting can be indexed and presented using known Z39.50 attribute tags including elements that describe spatial characteristics of target data sets. Although only rectangular search has been demonstrated to date, the spatial data are accessible through standard Z39.50 clients and WWW clients. Spatial search capability will be provided with the I-Site software as part of the search engine toolbox.