Dean Wilder
Library of Congress
ITS 9332, 101 Independence Avenue
Washington, D.C. 20540
Tel: 202-707-9629
E-mail: dwil@dwil.loc.gov
The Library of Congress was designated as the primary site for this information, of which all or part had previously been made available through various gopher and telnet sites at the Senate, the House, the Government Printing Office, and a variety of commercial services. Furthermore, the new congressional leadership wanted the new system (THOMAS) to incorporate advanced retrieval techniques that would give the general public simple access to the legislative information through the World Wide Web (WWW).
THOMAS is a joint project between the Library and the Center for Intelligent Information Retrieval (CIIR) at the University of Massachusetts. The CIIR is an NSF-funded consorotium involving the university , the State of Massachusetts, and both business and government partners. The CIIR does basic research and technology transfer in the area of text-based information systems. For THOMAS, the CIIR provided the retrieval software (INQUERY), and customized it as needed. The Library of Congress staff developed the interface and the extensions for converting the databases to HTML (the markup language for the WWW).
The Library has previously integrated INQUERY with its Web Server to make a variety of searchable databases available to the public, such as Civil War photographs, early motion pictures, sound recordings, POW/MIA records from the Vietnam War, and Country Studies from the Federal Research Division. This material falls naturally into the category of a digital library, and we believe that THOMAS is also a prototype of a digital library based on government information. The primary characteristics of a digital library from our point of view are providing access to a significant amount of valuable information over a network. The additional feature of "free" access to the general public makes THOMAS even more similar to a public library model.
The high usage of THOMAS from all parts of the United States, as well as many foreign countries, provides the basis for an invaluable testbed for studying how people use a network-based information system. A wide variety of people have accessed THOMAS, including experienced searchers at academic institutions and libraries, high school teachers, lobbyists, congressional staffs, and thousands of individual citizens interested in the process of government.
A full evaluation of the techniques that are being deployed in THOMAS will require recall/precision comparisons based on sample query sets and relevance judgments [8]. Although we intend to carry out this evaluation, at this early stage of the development of the system, new techniques are being introduced primarily on the basis of a small number of test queries and user feedback about specific problems. In this paper, we focus on these initial experiences with THOMAS, including descriptions of usage and query statistics and how query processing has changed to reflect user feedback. We also describe how current research on query expansion and morphological processing could provide techniques that will improve THOMAS and similar systems.
In the next section, we describe the THOMAS database. We then outline the basic features of the INQUERY retrieval system and their importance in the THOMAS system. Following sections give usage statistics for THOMAS, including query length and frequency, describe the query processing that is used to improve the retrieval performance for the type of queries and data in this system, and present some current problems and possible solutions.
The legislative database consists of all bills introduced in the House and Senate. These bills can exist in as many as 10 versions as they pass through the legislative process. Examples of versions are: "Introduced in the House", "Introduced in the Senate", "Engrossed in the House", "Enrolled Bill Sent to President", etc. Each version is treated as a separate document. In the 103rd Congress, there were about 13,000 separate documents comprising almost 300MB of data. Bills can range in size from 1K to 2MB.
The Congressional Record is published on a daily basis when at least one chamber is in session. Each record consists of a Daily Digest, which is a summary of the day's events, a House Section, a Senate Section, and an Extension of Remarks section, as appropriate. The Extension of Remarks section contains statements by members which were not actually made on the floor of Congress, but were inserted into the Record. The House and Senate sections are divided into debates or discussions on specific subjects, each with its own title, such as "Balanced Budget Amendment". These subdivisions form the basic documents in the database. As with the bills, there is a large variance in size of the basic documents, from about 1K to 700K. The Congressional Record for the 103rd Congress contains about 30,000 of these documents, for a total size of about 600 MB.
All legislative information is received via FTP from the Government Printing Office on a daily basis. Both the bills and the Record are passed through a preprocessing program which establishes document and title tags for INQUERY , converts GPO publication codes to HTML, and creates tables of contents where appropriate. In order to avoid unnecessary work-load for WWW browsers with low bandwidth connections, a maximum segment size of 10K has been established. Tables of contents are created for bills and Record segments which exceed 10K. These tables of contents are returned to the searcher when the full document display is requested.
Additional navigational aids are created for the Congressional Record. A hypertext table of contents is created for each day's Senate, House, and Extension of Remarks sections. The Daily Digest is converted to HTML and provided with page references linked to the data sections of the day' s Record. An overall calendar is created to reference the four sections of the Record for any given day of the year.
Once the preprocessing is complete, INQUERY is invoked to index the data. Bills are indexed by title, bill number, and the text of the bill. The Congressional Record is indexed by title, document identifier, date, speaker, and page number. Indexing using INQUERY requires approximately 1 hour and 15 minutes to process the 600MB Record for the 103rd Congress on an RS6000 Model 990. Due to the speed of indexing, the entire database is indexed on a daily basis, with no attempt at updating. In order to provide uninterrupted service, two copies of each database are maintained, and the indexing is performed off-line. When the indexing is complete, a current production pointer is switched to the updated database. This also provides immediate backup if the production run should fail. INQUERY also provides an incremental update capability to avoid re-indexing, and this can be done in parallel with queries using concurrency control. In the first version of THOMAS, however, these features were not fully available and were judged to be not necessary .
A better way to measure serious usage of the system is by examining searches. In the period between January 6 and March 20, 1995, there have been 2,302,589 WWW transactions, which encompassed 294,575 accesses to the THOMAS home page. Of these home page accesses, there were 196,724 accesses to query pages. From the access log, we can determine that there were 94,911 queries where at least one item was examined.
Query Count balanced budget 2,600 crime 1,057 gun\(s\ 994 balanced budget amendment 991 s 314 902 telecommunications 888 welfare 846 budget 753 abortion 678 line item veto 610 gun control 539 unfunded mandates 532 welfare reform 513 education 441 tax 415 term limits 401 crime bill 375 contract with America 366 public broadcasting 333 decency 333 immigration 316 balanced 315 health care 305 baseball 303 firearms 300 TOTAL 16,106Table 1: 25 most common queries in THOMAS
An examination of the text of these queries provides valuable information on user behavior when presented with the opportunity to enter free-form natural language queries. Many of the same queries are repeated many times. Table 1 shows some of the more popular queries with a count of the number of times they were entered. Of the 94,911 separate queries recorded, only 25,321 were unique.
Words Unique Queries 1 5,767 2 9,646 3 6,905 4 2,240 5 656 6 87 7 19 8 1 Total 25,321Table 2: Number of words in queries
The data recorded from THOMAS indicates that users tend to enter very simple queries. Table 2 shows the number of search terms (including stopwords) recorded in the 25,321 unique searches logged. The fact that 88 percent of all queries contain 3 or fewer words suggests that most queries in this application consist of a single concept expressed as a word or phrase. An examination of the searches most frequently submitted, shown in Table 1, tends to confirm this hypothesis. Although a number of studies have been done on the types of queries submitted to information services [9], there is not a large amount of data on what happens in systems with free-form or "natural language" queries. Examples of natural language searching of legal material at West Publishing [10] suggest that, in that environment, queries tend to be somewhat longer than those seen in THOMAS. Some of the common test collections used in information retrieval research, together with the average number of words in the natural language queries, are Cranfield (9.2), CACM (13), Time (8.9), NPL (7.1), INSPEC (15.6), and West (9.6). An additional factor that tends to constrain query length in real environments is that many computer users have been trained by the text search capabilities in many systems to believe that longer queries will fail to retrieve any documents.
The first problem noted in THOMAS was that searchers looking for bills with specific titles would not find those bills ranked near the top of the search results. For example, searchers looking for the "Defense Appropriations Act" would not find the bill in the first 20 items in the retrieved list. Many bills have the words "defense", "appropriations", and "act" in great profusion, while the bill itself often does not include its own name or subject matter within the text. This is a phenomenon common to many documents, where the actual subject matter of the document is assumed by the author and is not mentioned in the text.
After some experimentation, it was found that weighting words (using #WSUM) occurring in the title of the bill by a factor of 20 over words in the text produced good results for the queries that had been pointed out by users to have this problem. As mentioned earlier, query processing techniques should be evaluated using recall/precision experiments, but these have not yet been done.
The next factor which appeared to have an important influence on the success or failure of searches was word proximity. Since it was evident that the vast majority of searches consist of a single phrase, it would seem reasonable to give additional credit to documents that contain the query words in close proximity . For example, a document containing "state department" would be more likely to be relevant than one containing the words "state" and "department" in completely different sections.
Adding increased weight to the occurrence of the search terms in ordered proximity to each other resulted in a considerable improvement in the relevancy of bills returned. Since the sizes of documents in the database varied to such a large degree, very large documents with highly significant passages would be ranked lower than very small documents with widely-scattered search terms. After considerable experimentation, a weight of 90 (using #WSUM with a base weight of 1) was assigned to any occurrence of all of the search terms occurring in an ordered proximity of 3 words from each other.
A further improvement was obtained by adding weight to the unordered occurrence of all search terms within a given window of a specified size within the document. This was accomplished using the INQUERY #UWn . The value of n was arbitrarily set to 10 times the number of search terms. The weight of any occurrence of the terms within the specified unordered window was set to 50% of the weight given to terms found in ordered proximity to each other.
Some users complained that even with the above weighting, documents containing all of their search terms would be ranked below documents containing only a subset of their search terms. This is a problem with the method used to weight terms when documents show a great variance in size and queries are short and simple. The term weighting in INQUERY is a variation of that used in other systems such as SMART [8], and is known as tf.idf weighting. The tf (term frequency) component of this weight depends on the within-document frequency of the term, and the idf (inverse document frequency) component varies inversely with the frequency of the term in the corpus.
When queries such as "pressler public broadcasting" are entered, the proximity rules set forth above are of little use. The searcher is particularly interested in Senator Pressler and public broadcasting, but the high frequency of the words "public" and "broadcasting" cause the relevant documents to occur far down on the list behind documents that contain many occurrences of "pressler". This problem has been rectified to some extent by adding a high weight to the occurrence of the Boolean AND of the search terms in any document. The INQUERY #BAND operator was used with a weight of 90 to achieve this effect. This problem with the ranking algorithm indicates that further research is needed on methods of combining evidence that emphasize the number of matching terms more strongly.
#WSUM ( 1.0 balanced 1.0 budget 1.0 amendment 90.0 #3(balanced budget amendment) 45.0 #UW30(balanced budget amendment) 90.0 #BAND(balanced budget amendment) 20.0 #FIELD(TITLE #WSUM(1.0 1.0 balanced 1.0 budget 1.0 amendment 20.0 #3(balanced budget amendment) 10.0 #UW30(balanced budget amendment) 1.0 #BAND(balanced budget amendment) ) 10.0 #PARSUM200(balanced budget amendment) )Figure 1: Transformed query for "balanced budget amendment"
An example of the current weighted search algorithm is shown in Figure 1. Note that the entire query is repeated inside the #FIELD operator, where it is restricted to the Bill title. The #PARSUM or passage operator is also used as part of the standard INQUERY query processing.
The effectiveness of the above query processing techniques is based upon the assumption that (1) queries are not of an arbitrary length, but rather, rarely exceed 4 terms, and (2) that most queries contain a single concept or phrase. This is a pragmatic approach to query processing which results in a significant improvement in the ranking of relevant documents in the vast majority of cases, but makes no difference in a small number of cases. If an experienced searcher enters a complex search statement, these techniques will have little impact. Of course, these queries are precisely the ones which perform well without further processing.
In order to correct the misunderstanding of the process by the user, THOMAS now generates several informative messages following a search. The following header is printed:
IMPORTANT: Read the following before examining the list of bills:
followed by one of the following messages:
The following words were not found in the database: word1, ..., wordn. All of your search terms did not occur in any single bill. All of your search terms did not occur within 50 words of each other in any one bill. The phrase "[search terms]" did not occur in any bill.
In the case of the Boston Globe search, the second message would have been printed.
The search for "elderly black Americans" points out another problem - the need for some kind of thesaurus substitution. The concept of "elderly" could occur as "older" or "aged", while "black Americans" could occur as "African Americans", and be related to "minorities". We are currently investigating the possibility of integrating a pre-existing thesaurus into THOMAS, or making use of an automatic association thesaurus such as INQUERY's PhraseFinder [4]. When a version of PhraseFinder built using newspaper databases was used with this query, it found the following phrases that were automatically added to the query:
retired persons, poverty line, poverty rate, elders, health statistics
Making that change to the query improved the retrieval performance so that 9 of the top 10 bills were related to the elderly and poor. The potential confusion of the name "Elders" with "elderly" is related to the issue discussed next.
The use of automatic stemming in THOMAS has both positive and negative aspects. Stemming provides a powerful method for linking different word forms into a single concept cluster. However, it can cause problems when the exact form of a word is required. Since INQUERY indexes only word stems, a search for "Representative Franks" will also return references to two other representatives named "Frank". A search for "Billy" will return references not only to everyone named "Bill", but to all references to "bills". This came up, for example, when someone wanted to find references to Senator Byrd's dog Billy in the Congressional Record.
This problem can be rectified to some extent by substituting the new stemming algorithm KSTEM developed at the University of Massachusetts for the standard Porter Stemmer supplied with INQUERY [5]. KSTEM is a more conservative, dictionary-oriented system that also permits the database designer to set up exception lists of words which are not to be stemmed. This would solve the two problems above, but requires the designer to anticipate all stemming problems which may occur. A better solution might be to store exact word forms in the database, and carry out stemming at query time. For example, the query-based stemmer could decide that a query word "bill" should be expanded to #SYN( bill bills ), but this could be overruled by the user specifying that only the form "bill" is acceptable. Research is now in progress to add such a feature to INQUERY [2].
Further experiments based on formal relevance judgments are planned to determine the level of effectiveness improvement from the new techniques and to rank performance against similar systems.
Overall, our experience with the THOMAS system shows that it is very important to tune an information system to the user population. The query processing algorithms used in the current system would probably not be appropriate for expert searchers. An advanced retrieval engine such as INQUERY provides the query processing and indexing flexibility to accommodate rapid tuning while retaining efficient performance. The lessons learned from the large variety of users in THOMAS emphasize different issues than can be studied in a formal experimental environment. This has motivated more research in query processing, term weighting, relevance feedback and stemming algorithms.
2. W.B. Croft and J. Xu. Corpus-specific stemming using word form co-occurrence. In Fourth Annual Symposium on Document Analysis and Information Retrieval, pages 147-159, 1995.
3. D. Harman. Overview of the Third Text REtrieval Conference (TREC-3). In D. Harman, editor, Proceedings of the Third Text REtrieval Conference (TREC-3), pages 1-120. NIST Special Publication 500-225, 1995.
4. Y. Jing and W.B. Croft. An association thesaurus for information retrieval. In Proceedings of RIAO 94, pages 146-160, 1994.
5. Robert Krovetz. Viewing morphology as an inference process. In Proceedings of the 16th International Conference on Research and Development in Information Retrieval, pages 191-202, 1993.
6. M. Putzel. Room for doubting Thomas. Boston Globe, page 92, January 27 1995.
7. T.B. Rajashekar and W.B. Croft. Combining automatic and manual index representations in probabilistic retrieval. Journal of the American Society for Information Science, 46(4):272-283, 1995.
8. Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.
9. Karen Sparck Jones, editor. Information Retrieval Experiment. Butterworth, 1981.
10. Howard Turtle. Natural language vs. Boolean query evaluation: A comparison of retrieval performance. In Proceedings ACM SIGIR 94, pages 212-220, 1994.
11. H.R. Turtle and W.B. Croft. Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3):187-222, 1991.
12. H.R. Turtle and W .B. Croft. A comparison of text retrieval models. Computer Journal , 35(3):279-290, 1992.