<html>
<head>
<title>
DL94: Viewing the U.S. Government Budget as a Digital Library
</title>
</head>

<body>

<!--#include virtual="/DL94/header.ihtml" -->

<h1>Viewing the U.S. Government Budget as a Digital Library</h1><p>
<p>
R. L. Grossman[1], A. Sundaram[1], H. Ramamoorthy[1], M. Wu[1], S. Hogan[2],
J. Shuler[2], and O. Wolfson[1]<p>
<i>
[1] Laboratory for Advanced Computing, University of Illinois at Chicago, Dept. of
Mathematics, Statistics and Computer Science, 851 S. Morgan Street, 322 SEO
mail code 249, Chicago, IL 60607-7045, nisp@math.uic.edu<p>
[2] Main Library, University of Illinois at Chicago, 851 S.
Morgan Street,  1-280 LIB, mail code 234, Chicago, IL 60607-7045 </i><p>
<p>
<p>
<p>
<b><p>
Abstract</b><p>
We developed a prototype of a digital library designed to browse, query, mine
and visualize large amounts of scientific, numerical and statistical data. The
system currently provides access to the U.S. Government Budget for FY93, 94 and
95. Our point of view is to regard the data as collections of objects
distributed over a wide area network. We manage the objects using a high
performance, low overhead object manager we have developed called ptool. Ptool
interfaces to a hierarchical storage system including tape to provide the
potential of accessing terabyte size data sets. The system caches, migrates and
replicates collections of objects over a wide area network to achieve higher
performance. We have also developed specialized tools to query, analyze, mine,
and visualize the data. Additional economics and statistics data should be
available through the system soon.<b><p>
<p>
Keywords:</b> Digital library, object manager, scientific &amp; statistical
database, visualization.<b><p>
<p>
<p>
<p>
1.  Introduction</b><p>
We describe a prototype we have developed of a digital library designed to
browse, query, mine and visualize large amounts of scientific, numerical and
statistical data. The prototype exploits a hierarchical storage system
including tape to provide the potential of accessing terabyte size data sets.
We view the data as collections of objects distributed over a wide area
network; we use low overhead, high performance persistent object stores to
access the data; we cache, migrate, and replicate collections of objects over a
wide area network to achieve higher performance;  and we query, analyze, mine,
and visualize the data with a suite of modular software tools. Further details
are in [7].<p>
Our prototype provides distributed access to the U.S. Government Budget for FY
93 and FY 94. FY 95 data will be available shortly. The budget for each fiscal
year contains approximately 5000 tables and 50,000 line items, as well as a
modest amount of accompanying text.  The challenge was to provide distributed
access and analysis tools for tabular data of this type.  A typical query
retrieves all line items containing the keyword "research'' in which fiscal
year 94 outlays were over $100 million.<p>
The prototype uses a standard object oriented data model. This model provides
the data with enough structure  for queries such as the one just described. To
manage and query the data, we used two software tools developed by us for
related projects:  <i>ptool</i>, a low overhead high performance persistent
object manager and <i>qtool</i>, a companion tool which implements a subset of
the ODMG-93 emerging standard for OQL (Object Query Language) queries. To
provide wide area access to the data, we used the Forms Package in NCSA's
Mosaic to send OQL queries to a server, which then returned the requested data.
A variant of the prototype offers other analysis tools, such as spreadsheets,
for those clients who can access the server as a X-client.<p>
We have used the same technology to develop digital libraries for high energy
physics data [2] and [3].  We have also used this technology to implement data
intensive algorithms in high performance computing [5].<b><p>
<p>
2.  Related Work</b><p>
Our prototype is designed to handle digital libraries which contain numerical,
statistical or scientific data.  Some of the important differences between
digital libraries which contain textural or multi-media data and those which
contain numerical, statistical or scientific data are: <p>
<p>
	<i>Data Model.</i><p>
	Textural and multi-media digital libraries are 	usually "document based.''  By
browsing, 	navigating, or searching, one identifies the 	document of interest
and then browses or 	retrieves the document as appropriate.  Whether 	the
document is a compound document, multi-	media document or hypermedia document
does 	not fundamentally change this. Of course, some 	documents have a complex
or hierarchical 	structure and contain a variety of data types. In 	contrast
numerical, statistical or scientific data 	are usually organized into
attributes, which may 	themselves be further divided into additional
attributes. The typical access exploits the 	attributes to return the data of
interest, which 	often requires a statistical or numerical 	computation, as in
"return all lines items in<p>
	which there is a research related expenditure 	greater than $100 million
dollars.'' The objects 	returned are usually not from just one data set, 	but
more often from several.<p>
<p>
	<i>Searching.</i><p>
	Searching textural and multi-media digital 	libraries is usually by key word,
tag, or through 	some type of full text retrieval.   On the other 	hand,
searching numerical digital libraries is 	often done by applying a statistical
or numerical 	filter to the data.  For example, "return all line 	items from
the FY 94 budget which are more 	than 10% different than the estimates from the
FY 93 budget.''<p>
 <p>
	<i>Use.</i><p>
	Information from a textural or multi-media digital 	library is usually read or
viewed, while 	information from a numerical digital library is 	usually used as
the basis for further numerical 	analysis.  For example, after all line items
which 	involve more than $100 million of research are 	retrieved, the data is
usually further analyzed with 	a variety of statistical or visualization
tools.<p>
<p>
A variety of technologies have been used to build digital libraries.  Many are
document based and use the native file system to manage the data.  Others use a
database to manage the data.  Our prototype in contrast uses a low overhead,
high performance persistent object store to manage the data and World Wide Web
(W3) applications to provide wide area access to the data.  Since the data in
our digital library was historical, most of the additional functionality
provided by a database was not needed and a persistent object manager
sufficed.<b><p>
<p>
3. Design</b><p>
The design of our system was based on just a few basic principles.<p>
<p>
	<i>Objects and collections.</i><p>
	Our system is based upon objects and collections 	of objects.  For the Federal
Budget, we choose 	the fundamental objects to be the budget tables. 	The budget
tables have an internal structure so 	that one can query by row or column.  In
contrast, 	with a conventional document based system, it 	would be very
difficult to query by row or 	column.<p>
<p>
	<i>Object Manager.</i><p>
	We used a high performance, low overhead 	persistent object manager we
developed called 	ptool to manage these objects.  Ptool interfaces  	to an IEEE
compliant hierarchical storage system 	in order to provide transparent access
to data on 	secondary and tertiary storage.  Access to the data 	was with a
variant of a subset of OQL (object 	query language). The variant supported some
table-specific operations we found useful.<p>
<p>
	<i>Integrated Analysis Tools.</i><p>
	Rather than design stand alone applications, we 	designed a number of small
tools which accepted 	an input collection of objects and produced an 	output
collection by selecting some objects and 	computing derived objects.<i><p>
<p>
	Wide Area Access.</i><p>
	We provided wide area access to the data by 	using Mosaic Forms to send OQL
queries to a 	WWW server which accessed the required data.  	The query could
specify whether the objects 	themselves should be returned, so that they could
be further analyzed using local tools, or simply a 	file containing the
object's attributes.<b><p>
<p>
4. Implementation</b><p>
As already mentioned, the prototype was implemented using a low overhead, high
performance object manager we developed called ptool [4] and [6] and a
companion tool called qtool we developed which supports a subset of OQL.<p>
A major part of the implementation was to migrate the legacy data into a usable
form. The U.S. Government Budget, as published by the U.S. Government Printing
Office, is available as a print document, and in electronic form.. The
electronic form contains the data in a proprietary mark up language used by the
Government Printing Office called <i>Microcomp</i>. We reversed engineered the
Microcomp data, translated it into a set of files containing the data, and
matched files describing the logical format of the data, and then populated an
object store with this data using ptool.<p>
For this prototype, we used qtool to select tables, rows, columns, and fields
from the data.  Qtool supports a variant of OQL. We also wrote some specialized
functions for the statistical analysis of rows and tables.<p>
We developed a WWW server to provide distributed access to the data, which
could return either the requested objects themselves or ASCII files containing
the attributes of the data in html format.<p>
We also developed X-based client-server variants of the system which used the
commercial spreadsheet Wingz for the analysis of retrieved and selected data.
We put together a simple user interface for the X-based version using Tcl/Tk to
integrate the various tools.<p>
The architecture is illustrated in Figure 3.  Figure&nbsp;1 contains a typical
query.  The objects returned by the query, viewed as a spreadsheet, are
displayed in Figure&nbsp;2.<p>
<p>
<p>
               <b>select all<p>
               from * in FY93<p>
               where * = "outlays by function"<p>
</b><p>
Figure 1.  The query uses the software tool qtool to scan all tables in the
collection of tables FY93 and locates all rows containing the string <i>outlays
by function.</i><p>
Variants of the query allow just selected attributes of the row to be returned,
and either the entire table or just the selected rows in the table to be
returned.  This particular query retrieved 38 out of approximately 4000 tables
in the collection FY93.<b><p>
<p>
4. Conclusion and Future Directions </b><p>
Our prototype demonstrates the feasibility of building scalable digital
libraries for numerical, statistical or scientific data using wide area object
stores.   To make effective use of digital libraries of this type, further work
is required in a number of areas:  especially in developing more efficient
methods for migrating unstructured legacy data into object stores; in
visualizing large amounts of numerical data; and in providing better techniques
for mining such data.<b><p>
<p>
<p>
References</b><p>
[1]	"Mass Storage System Reference Model, 	Version 4'' edited by Sam Coleman
and Steve 	Miller, IEEE.<p>
<p>
[2]	C. T. Day, S. Loken, J. F. MacFarlane, E. May, 	D. Lifka, E. Lusk, L.
E.Price, D. Baden, R. 	Grossman,  X. Qin, L. Cormell, P. Leibold, D. 	Liu, U.
Nixdorf,  B. Scipioni, T. Song, <p>
	"Database Computing in HEP --- Progress 	Report,'' <i>Proceedings of the
International 	Conference on Computing in High Energy 	Physics '92, </i>C.
Verkerk and W. Wojcik, editors, 	CERN-Service d'Information Scientifique, 1992,
ISSN 0007-8328, pp. 557-560.<p>
<p>
[3]	R. L. Grossman, X. Qin, and D. Valsamis, and D. 	Lifka, E. May, and D.
Malon, and L. Price, "The 	Architecture of a Multi-level Object Store and its
Application to the Analysis of High Energy 	Physics Data,'' <i>Laboratory for
Advanced 	Computing Technical Report,</i> Number LAC 94-	R8, University of
Illinois at Chicago. December, 	1993.<p>
<p>
[4]	R. L. Grossman, D. Lifka, and X. Qin,<p>
	"An object manager utilizing hierarchical 	storage,'' <i>Twelfth IEEE
Symposium on Mass 	Storage Systems,</i> IEEE Press, Los Alamites, 	1993, pp.
209--214.<p>
<p>
[5]	R. L. Grossman, D. Valsamis and X. Qin,	"Persistent stores and Hybrid
Systems,''<i>	Proceedings of the 32nd IEEE Conference on 	Decision and Control,
</i>IEEE Press, 1993, pp. 	2298-2302.<p>
<p>
[6]	R. L. Grossman and X. Qin, "Ptool: a low 	overhead, scalable object
manager,''<i>	Proceedings of SIGMOD 94, </i>to appear.<p>
<p>
[7]	R. L. Grossman, X. Qin, A. Sundaram, M. Wu, 	and W. Xu, "Software Tools for
Working with 	Large Amounts of Complex Tabular Data: An 	Application to the
U.S. Government Budget,	<i>Laboratory for Advanced Computing Technical
Report,</i> Number LAC 94-R11, University of 	Illinois at Chicago, December,
1993<p>

<!--#include virtual="/DL94/footer.ihtml" -->
</body></html>