Management of the National HPCC Software Exchange - a Virtual Distributed Digital Library

The work described in this paper was sponsored by NASA under Grant No. NAG 5-2736

Shirley Browne
Computer Science Department
University of Tennessee
Knoxville, TN 37996-1301
Tel: 1-615-974-5886
E-mail: browne@cs.utk.edu

Jack Dongarra
Computer Science Department
University of Tennessee
Knoxville, TN 37996-1301
Tel: 1-615-974-8295
E-mail: dongarra@cs.utk.edu

Ken Kennedy
Center for Research on Parallel Computation
Rice University
Houston, TX 77005
Tel: 1-713-285-5188
E-mail: ken@rice.edu

Tom Rowan
Mathematical Sciences Section
Oak Ridge National Laboratory
Oak Ridge, TN 37831-6367
Tel: 1-615-574-3131
E-mail: rowan@msr.epm.ornl.gov

ABSTRACT

The National HPCC Software Exchange (NHSE) is a distributed collection of software, documents, and data for the high performance computing community. Our experiences with the design and initial implementation of the NHSE are relevant to a number of general digital library issues, including publication, quality control, authentication and integrity, and information retrieval. This paper describes an authenticated submission process that is coupled with a multilevel review process. Browsing and searching tools for aiding information retrieval are also described.

KEYWORDS: electronic publication, information retrieval, high performance computing, quality control

INTRODUCTION

The National HPCC Software Exchange (NHSE) (1) homepage is is an Internet-accessible resource that provides access to software and other information related to High Performance Computing and Communications (HPCC). The NHSE facilitates the development of discipline-oriented software and document repositories. Furthermore, it promotes contributions to and use of such repositories by members of the high performance computing community, via a common World Wide Web interface. The NHSE is also a valuable resource for technology transfer and educational purposes.

The effectiveness of the NHSE depends on discipline-oriented groups having ownership of independently maintained repositories. The information and software residing in these repositories is best maintained and kept up-to-date by its developers, rather than by centralized administration. Developers may wish to provide specialized access methods or services, a remote execution capability for example. Central administration is used instead to handle interoperation and to meet common needs.

Although the different disciplines maintain their own software repositories, users should not need to access each of these repositories separately. Rather, the NHSE provides a uniform interface to a virtual HPCC software repository built on top of a distributed set of discipline-oriented repositories, as shown in Figure 1. The interface assists the user in locating and retrieving relevant resources.

Figure 1: Virtual Repository Architecture.

In order for the NHSE to provide an information retrieval interface to the distributed collection of materials, it must have the raw material available from which to build indexes and other searching and browsing aids. Various techniques for collecting and indexing descriptive material are used in the NHSE, including manual construction of catalog records, collection and indexing of unstructured text, and computer-assisted construction of a hypertext roadmap.

Users of the NHSE need to have confidence that the software they obtain is high quality and well-tested. If the software is experimental or untested, they should be made aware of this. The NHSE has developed a review process that allows authors to submit software for consideration at different levels of review classification, with the rigor of the review process increasing with increasing levels.

A contributor to the NHSE makes a contribution available by placing it on a file server accessible via the FTP or HTTP file access protocols and informing the NHSE of its existence. The NHSE can then provide a pointer in the form of a URL, along with a description of the contribution. For review, version control, and tracking of software contributions it is important to ensure fixity of publication -- i.e., that the software has not been changed since the time of submission unless the NHSE has been informed of the change. Because of copyright, liability, and other legal issues, it is also important that someone not be able to masquerade as someone else or to make unauthorized changes to someone else's contributions. For these reasons, the NHSE has developed authenticity and integrity checking mechanisms for software submissions based on file fingerprints and a public-key cryptosystem.

SOFTWARE SUBMISSION AND REVIEW

Contributors submit software to the NHSE by filling out an HTML form using a forms-capable WWW browser such as Mosaic or Netscape (2). This form explains the submission and review process, including the authentication procedures, and gives an example of a completed submission form. The form asks the user to fill in values for several attributes, some required and some optional. These attributes form a subset of those specified in the Reuse library Interoperability Group (RIG) Basic Interoperability Data Model (BIDM) [RIG-BIDM]. The remaining BIDM fields are generated by the NHSE librarian or from default values. The RIG has been chartered by the IEEE to develop standards for reuse library interoperation. Use of the BIDM standard by the NHSE will facilitate interoperation with other reuse libraries adopting this standard, including a number of existing government and industry reuse libraries (e.g., ASSET, CARDS, DSRS, ELSA).

Some contributors may have large collections that are already indexed using a different data model. The NHSE will provide assistance to such contributors in converting their indexing information to the form required for submission to the NHSE and in submitting such collections en masse.

Review Levels

Currently three levels of software are recognized in the NHSE, described as follows:

To receive the partially reviewed rating, software submitted to the NHSE should conform to the following guidelines:

To be accorded the reviewed status, the software must first have been accorded the partially reviewed status. This precondition ensures that reviewers will be able to access all the information needed to carry out the review over the National Information Infrastructure.

Software submitted for full review is reviewed according to the following criteria:

After software has been submitted for full review, it is assigned to an area editor, who recruits two to six reviewers to peer review the software according the above criteria. To qualify for full review, an author must provide sample data and the output from or a description of results from each sample. Each reviewer is asked to read the software documentation and try the software on some of the data sets provided by the author. In addition, it is recommended that a reviewer test the software on inputs not provided by the author. If source is available, the reviewer examines the source to ensure that the methods and programming methodology are of acceptable quality. Each reviewer prepares all comments in electronic form and returns these, along with a recommendation to the editor in charge of the review. After the peer reviews are returned, the editor makes the final decision as to whether to accept the software and informs the author of the decision. If the software is accepted, the area editor prepares a review abstract for use by the NHSE.

Once the software has been reviewed, one of two things happens. If it is not accepted, the author will be so informed and anonymous copies of the reviews will be provided. The author may then choose to address the reviewers' comments and resubmit the revised software. If the software is accepted, the author will be shown a review abstract summarizing the reviewer comments. This abstract will be available to anyone who accesses the software through the NHSE. An author who finds the abstract unacceptable may withdraw the software and resubmit it for review at a later date.

Authentication Procedures

After a contributor fills out the NHSE software submission form and submits it, a program is invoked at an NHSE server that checks the form for any obvious errors, such as omission of required attributes, incorrectly formed email addresses, or unretrievable URLs. If no errors are found, a plain-text version of the catalog record is returned to the client program, along with instructions to save the plain text version to a file and to carry out one of the following authentication procedures:
  1. PGP Authentication. [Zimmerman-PGP] The author uses his public, NHSE-certified PGP key to sign the catalog record and then mails it back to a designated address. The mail server at that address verifies the PGP signature and processes the submission only if the signature is valid.
  2. Notarization. The author prints out the plain text form, signs it, has the signature notarized, and sends the document back via surface mail. When the form is received, the NHSE librarian PGP-signs the electronic version of the form (using a special proxy key reserved for this purpose) on behalf of the author.
Before using Method 1, the author must have PGP installed locally and must have obtained a PGP key pair. The author's public key must have been certified by the NHSE librarian. An author may obtain this certification either in person, via a trusted third party who signs the author's key, or by a method similar to 2 above: print out the key fingerprint, sign it, have it notarized, and surface mail it to the NHSE librarian.

We considered other authentication methods, such as email addresses and userid/password based accounts, but rejected such methods as providing insufficient security.

Identification, Cataloging, and Integrity

Once an author's software submission has been authenticated, it is processed before being placed in the NHSE on-line software catalog. This processing involves retrieval of the files specified by the author as making up the contribution, fingerprinting these files, assigning the contribution a unique identifier, and additional cataloging by the NHSE librarian. If the software has been submitted for partial review, the NHSE librarian also inspects the submission for adherence to the NHSE software guidelines.

After the files making up a contribution have been retrieved, each file is fingerprinted using the MD5 secure hash function [Rivest-MD5]. The (URL,MD5) pairs for the files are then placed in another file which is itself fingerprinted. This top-level fingerprint is used to construct a unique identifier for the submission, which we call a LIFN, or Location Independent File Name. The submission can subsequently be retrieved from the NHSE software catalog by specifying its LIFN.

The LIFN concept is part of a more general naming structure that is being developed to provide for transparent mirroring of files and to address other scalability and reliability problems that will result from the expected growth of the NHSE [ssr95].

As part of the processing, the NHSE librarian categorizes the software submission into one of four main categories: application libraries and programs, data analysis and visualization tools, numerical libraries and routines, and parallel processing tools. Software falling under parallel processing tools is categorized further into one of eight subcategories. The NHSE librarian also assigns keywords drawn from the HPCC thesaurus (currently under development) and, for mathematical software, from the GAMS classification scheme [BoHK91].

The NHSE provides a form, called the LIFN verification form, that allows a user to verify the integrity of a submission (3). A contributor may use this form to check whether any of the files have changed since their submission. To use the form, the user or contributor enters the LIFN to be verified and presses the Verify button. This action causes a program to be invoked on an NHSE server that carries out the following steps:

  1. retrieves the fingerprint file that was constructed when the LIFN was assigned and that contains the URLs and the stored fingerprints for the files making up the submission,
  2. retrieves the files using the designated URLs,
  3. computes the MD5 fingerprint for each of the retrieved files and compares it with the stored fingerprint that was previously computed for the same URL,
  4. flags any file that has been changed since the LIFN was assigned and gives the user the option of retrieving the original file as archived by the NHSE.

Updating a Previous Submission

A contributor may update or withdraw a previous submission by using the NHSE software submission change form (4). This form asks the contributor to enter the LIFN for the previous submission. A contributor who does not know the LIFN can search for the submission in the NHSE software catalog in order to determine it. After entering the LIFN, the contributor presses a button that causes the catalog record for the LIFN to be retrieved and displayed in a second form. The contributor may then specify any files that have been changed or added, describe changes made to the files, and/or update cataloging information.

After filling out the change form and submitting it, the contributor authenticates the change request using one of the two authentication procedures described in the previous section. Note, however, that if the submission was initially authenticated using PGP, the NHSE will be extremely cautious about accepting updates authenticated using the notarization method.

INFORMATION RETRIEVAL AIDS

Depending on the size, rate of change, and nature of the underlying software or document database, the NHSE uses different techniques for assisting the user in searching and browsing the information. Small or fairly stable collections permit labor-intensive indexing and abstracting, with resulting benefits of improved recall and precision for searches. Large or rapidly changing collections require the use of less precise automatic indexing techniques.

The current NHSE software catalog (5) is fairly small, with fewer than 300 entries. Thus, it has been possible to manually abstract and index this collection. The cataloging process has been carried out jointly by the software authors and the NHSE librarian, with the software authors providing the title and abstract fields, and the NHSE librarian categorizing each entry and assigning thesaurus keywords. The NHSE software catalog is available in the following formats:

  1. An HTML version that can be browsed by category.
  2. A searchable version that allows the user to search separately by different attributes or to do a free-text search on the catalog records. A link to an on-line copy of the HPCC thesaurus is provided so that users can select controlled vocabulary terms for searching. The current interface requires users to cut and paste thesaurus terms into the search form. We plan to develop a hypertext version of the thesaurus that will statically link thesaurus terms to scope and definition notes and to related terms (also broader terms and narrower terms), as well as dynamically link thesaurus terms to indexed catalog entries.
  3. A PostScript version that can be downloaded and printed.
A number of sites involved with the NHSE maintain collections of technical reports on numerical or high performance computing. These collections are frequently already indexed and abstracted, although they may use different indexing formats. One such collection is maintained at the University of Tennessee Computer Science Department (UTKCS). UTKCS is joining the Computer Science Technical Report (CSTR) project, and other NHSE sites will be encouraged to do likewise. The CSTR project is developing standards and technologies for digital document repositories (6). The Dienst server software available from Cornell University facilitates searching for and retrieving documents from a repository and linking together different repositories so that all may be searched from any site. Dienst also provides utilities that assist sites with installing the document database and converting from other indexing formats (7).

In addition to the software catalog, the NHSE has a distributed hypertext structure that contains a variety of information on high performance computing. Most of this information is in the form of HTML pages, but there are also links to documents in other formats, such as plain text and PostScript. Links are provided to various HPCC programs and activities, to descriptions of Grand Challenge applications, and to other software repositories. Because the collection of information has grown very large, a search interface has been provided. This search interface currently uses the Harvest system [Bowman-Harvest] to collect information from remote sites, index that information using WAIS, and process queries from users. The Harvest system worked satisfactorily at first, but the underlying database has now grown so large and diverse that 1) the gathering takes on the order of several days to a few weeks, and the search interface becomes out-of-date in the meantime, 2) extremely large result sets are returned by many searches. Work is underway both by the Harvest development group and by NHSE researchers at Argonne National Laboratory to address these scalability problems.

Hypertext roadmaps are being developed at Syracuse University to provide guided tours to HPCC software and technologies (8). The roadmap consists of encyclopedia-style articles written by experts in the field, with links to relevant software and technologies. Because construction of such a guide is labor-intensive and because the resulting structure is static, the roadmap can encompass only a portion of the available information. However, we hope to use semantic indexing techniques such as LSI [Deerwester-LSI] to simplify the work by automatically inferring relationships between the roadmap and new material.

RELATED WORK

Digital Libraries

Digital libraries have been identified as a National Challenge by the Information Infrastructure Technical Application (IITA) component of the HPCC program. A joint initiative by NSF, ARPA, and NASA has funded six four-year research projects to develop new technologies for digital libraries (9). The goals of this initiative are to advance the techniques for collecting, storing, and organizing information in digital forms, and for searching and retrieving the information over communications networks. Each project is centered at a university and is focused on a particular area. For example, Carnegie Mellon University is developing an on-line digital video library system, while the University of California at Berkeley is developing a digital library focused on environmental information.

The NHSE is an example of a digital library that is focusing on a particular type of resource, software, in a particular subject area, high performance computing. Software-specific issues include the following:

An ARPA-funded project led by the Corporation for National Research Initiatives (CNRI) is developing the network infrastructure for a distributed digital library system (10) [Kahn-CSTR]. CNRI is addressing only the network-based aspects of the infrastructure, and not the content-based aspects, which are expected to be addressed by specialized communities. The infrastructure defines the basic components of a distributed digital library system, including digital objects, repositories, naming authorities, and properties records. Repositories provide access to digital objects and also provide value-added services such as organizing, cataloging, searching, and evaluating. Each repository is responsible for providing meta-information about its own collection of digital objects in the form of a set of properties records.

The NHSE is an example of a repository, but it is a virtual repository that provides access to resources maintained by a distributed collection of autonomously maintained physical repositories. The physical repositories store and provide access to files, and the NHSE provides the value-added services. For example, software available through the NHSE is maintained on local file servers by its contributors, but is searchable and retrievable through the NHSE software catalog. The NHSE software catalog is an example of a set of properties records.

CNRI's infrastructure includes a system for assigning globally unique names to digital objects and for using these names to retrieve objects. An object's name is called its handle. A distributed system of handle servers maps handles to the repositories that contain the objects. The NHSE developers have designed a similar system for name assignment and resolution [ssr95]. The use of LIFNs described earlier in this paper is a forerunner of the full deployment of our naming system. We are currently investigating the possibility of merging our naming system with CNRI's system.

Software Reuse

Software reuse is the process of creating software systems from existing software rather than building software systems from scratch [Krueger-survey]. Software reuse may be more broadly defined as the use of engineering knowledge or artifacts from existing systems to build new ones [Frakes-success]. Reuse of software components ranges from black-box reuse of domain-specific components to white-box reuse through modification and adaptation of existing components [Prieto-Diaz-status]. A problem domain is defined as a class of problems considered to be significant and related by members of a particular applications community [Arango-domain-analysis]. Systematic reuse requires domain engineering, which consists of the following two phases:

  1. domain analysis, the process of discovering and recording the commonalities and variabilities of the systems in a domain, and
  2. domain implementation, the use of the information uncovered in domain analysis to create reusable components and new systems.
Parallel high performance computing will realize its full potential only if it is accepted and adopted in the real world of industrial applications. Cost-effective parallel computing will require widespread reuse of parallel software and related artifacts. The NHSE will support software reuse by providing access to the following resources:

Software Repositories

A number of software repositories have been established over the last decade that provide access to reusable software components. These include the Netlib and GAMS mathematical software repositories, as well as government sponsored reuse libraries such as Ada-IC, CARDS, DSRS, ELSA, and STARS. Information about all these repositories in available from the NHSE.

Netlib began operation in 1985 to fill a need for cost-effective, timely distribution of high-quality mathematical software to the research community [Dongarra-netlib]. Netlib is accessible through an email interface or from a World Wide Web browser such as Mosaic or Netscape (11). The number of Netlib servers has grown from the original two, at Oak Ridge National Laboratory (initially at Argonne National Laboratory) and Bell Laboratories, to servers in Norway, the United Kingdom, Germany, Australia, Japan, and Taiwan. A mirroring mechanism keeps the repository contents at the different sites consistent on a daily basis [Grosse-mirroring].

The Guide to Available Mathematical Software (GAMS) project of the National Institute of Standards and Technology (NIST) studies techniques to provide scientists and engineers with improved access to reusable computer software components available to them for use in mathematical modeling and statistical analysis. One of the products of this work is the GAMS system, an on-line cross-index and virtual repository of mathematical software (12) [Bois94a]. GAMS performs the function of an interrepository and interpackage cross-index, collecting and maintaining data about software available from external repositories and presenting it as a homogeneous whole. GAMS currently contains information on more than 9800 problem-solving software modules from about 85 packages in four physically distributed software repositories (three maintained at NIST and Netlib).

The NHSE is similar to GAMS in that both are virtual repositories, but the NHSE encompasses a much larger number of physical repositories than GAMS. GAMS indexes the contents of a handful of repositories, while the NHSE provides access to software residing at hundreds of sites. Netlib and GAMS both specialize in the fairly narrow domain of mathematical software. Although the NHSE collection includes mathematical software written for high performance machines, the coverage of the NHSE is much broader, ranging from data visualization and parallel processing tools to software for individual application areas. The NHSE uses the GAMS classification scheme to classifying mathematical software, as does Netlib. A new classification will need to be devised for the general area of high performance computing, however, and portions of it will need to be refined by sub-communities and specialists in different areas.

The Reuse Library Interoperability Group (RIG) was founded in 1991 for the purpose of developing standards for interoperability between software reuse libraries. The RIG has developed and approved the Basic Interoperability Data Model (BIDM) as a minimum standard data model for interoperability, and the BIDM has been submitted for balloting as an IEEE standard [RIG-BIDM]. The NHSE is working with the RIG to develop and promote standard data models for software repositories. There is close correspondence between BIDM concepts and the digital library framework proposed by CNRI [Kahn-CSTR]. Ideally the software reuse library community and other digital library communities should work together to promote interoperability between all types of digital libraries.

CONCLUSIONS

We have described a digital library of software and related artifacts that is being developed for the HPCC community. Rather than being a single central repository, this library provides a uniform interface to a distributed collection of autonomously maintained repository sites. Although some of our concerns are specific to software repositories, much of our work will be applicable to management of other types of digital data. In particular, we have constructed a mechanism that allows individuals to contribute material from a World Wide Web browser. We have implemented an authentication mechanism that uses public key cryptography and file signatures to prevent impersonation and unauthorized changes to contributed material. The review process we have set up is similar to the peer review process used by refereed journals, and our experiences in applying this process to software will help determine whether the peer review concept generalizes to non-document and electronically available resources. The solutions we devise for providing searchable access to a large quantity of diverse information available from geographically dispersed sources will be applicable as well to other distributed digital library systems.

REFERENCES

[RIG-BIDM] Standard reuse library Basic Data Interoperability Model (BIDM). Technical Report RPS-0001, Reuse Library Interoperability Group, 1993.

[Arango-domain-analysis] G. Arango. Domain analysis methods. In W. Schafer, R. Prieto-Diaz, and M. Matsumoto, editors, Software Reusability, chapter 2, pages 17--49. Ellis Horwood, 1992.

[Bois94a] R. F. Boisvert. The architecture of an intelligent virtual mathematical software repository system. Math. & Comp. in Simul., 36:269--279, 1994.

[BoHK91] R. F. Boisvert, S. E. Howe, and D. K. Kahaner. The Guide to Available Mathematical Software problem classification system. Comm. Stat. -- Simul. Comp., 20(4):811--842, 1991.

[Bowman-Harvest] C. M. Bowman, P. B. Danzig, D. R. Hardy, U. Manber, and M. F. Schwartz. Harvest: A scalable, customizable discovery and access system. Technical Report CU-CS-732-94, Department of Computer Science, University of Colorado - Boulder, Aug. 1994.

[ssr95] S. Browne, J. Dongarra, S. Green, K. Moore, T. Pepin, T. Rowan, and R. Wade. Location-independent naming for virtual distributed software repositories. In ACM-SIGSOFT 1995 Symposium on Software Reusability, Seattle, Washington, Apr. 1995.

[Deerwester-LSI] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshamn. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407, Sept. 1990.

[Dongarra-netlib] J. J. Dongarra and E. Grosse. Distribution of mathematical software via electronic mail. Commun. ACM, 30(5):403--407, May 1987.

[Frakes-success] W. B. Frakes and S. Isoda. Success factors of systematic reuse. IEEE Software, pages 15--19, Sept. 1994.

[Grosse-mirroring] E. Grosse. Repository mirroring. ACM Trans. Math. Softw., 21(1), Mar. 1995.

[Kahn-CSTR] R. Kahn and R. Wilensky. Accessing digital library services and objects: A frame of reference, draft 4.4 for discussion purposes. Available on-line at http://www.cnri.reston.va.us/home/cstr/arch.html, Feb. 1995.

[Krueger-survey] C. W. Krueger. Software reuse. ACM Computing Surveys, 24(2):131--183, June 1992.

[Prieto-Diaz-status] R. Prieto-Diaz. Status report: Software reusability. IEEE Software, pages 61--66, May 1993.

[Rivest-MD5] R. Rivest. The MD5 message-digest algorithm. Internet Request for Comments, 1321, Apr. 1992.

[Zimmerman-PGP] P. Zimmerman. PGP user's guide. PGP Version 2.6.2, Oct. 1994.