IFLA

As of 22 April 2009 this website is 'frozen' in time — see the current IFLA websites

This old website and all of its content will stay on as archive – http://archive.ifla.org

IFLANET home - International Federation of Library Associations and InstitutionsAnnual ConferenceSearchContacts

60th IFLA General Conference - Conference Proceedings - August 21-27, 1994

The Retrospective Conversion of the Handbook of Latin American Studies, Volumes 1-49

Sue Mundell
Assistant Editor
Handbook of Latin American Studies
Hispanic Division, Library of Congress


ABSTRACT

The Handbook of Latin American Studies, edited by the Hispanic Division of the Library of Congress and published by the University of Texas Press, is the bibliography most widely consulted by Latin Americanists throughout the world. The Handbook's current data resides in MARC format in one of the Library of Congress' MUMS files and includes annotated records for books, serial articles, book chap ters, and conference proceedings. The Handbook staff is currently working toward retrospective conversion to electronic format of volumes 1-49 of the Handbook, a project that would entail digitizing about 215,000 annotated citations and the author and subject indexes to the past 49 volumes. This paper discusses the Handbook staff's decision-making process for this project. The ideal solution wo uld be to convert the retrospective volumes directly to MARC format and merge the ensuing records with the current Handbook file. However, because of fiscal and time constraints, we have decided to approach this project in a more modest way, creating a simple ASCII database on CDROM. As funds become available, we hope to merge the current database with the retrospective one, and provide Internet access to all the volumes.


PAPER

I. Introduction to the Handbook content

The print edition of the Handbook of Latin American Studies is an annual annotated bibliography of about 5,000 items of scholarly interest, alternating yearly between the social sciences and the humanities. Published continuously since 1936, the Handbook is considered to be the most important bibliographic reference work for Latin American studies.1 It has been edited for more than 50 years by the Hispanic Division of the Library of Congress. Under the editorship of Dolores Moyano Martin since 1977, the Handbook is presently published by the University of Texas Press.2 Today over 130 scholars from the United States and several foreign countries evaluate material selected by the Handbook staff or gathered from other sources around the world. About 60% of Handbook entries refer to mono graphs; the remainder are citations to articles selected by the Handbook staff from over 1,600 serials, as well as to chapters from books, papers from conference proceedings, etc.

II. Automated Technologies for Producing the Handbook

With the publication of volume 50 in 1991, the Handbook's editorial process is totally automated. The Handbook's four staff members now work on the Library of Congress' mainframe, using the record input application, called MUMS, to create and edit a separate MARC record for each bibliographic entry in the Handbook. At the end of the yearly production cycle, five additional files, produced using WordPerfect macros, are uploaded to the mainframe and are merged with the online data to produce proofs and a computer tape containing the data for that year's print volume. This form of the data contains computer-generated document markup codes of a Data Type Definition (DTD) that the Library defined following the Standard Generalized Markup Language (SGML) standards.

III. Current Electronic Products of the Handbook

From its inception in 1988, the Handbook file has been accessible from any Library of Congress terminal. Since April 1993 this online working file has also been available over the Internet.3 This file now contains about 49,000 bibliographic records for the Handbook, some 25,000 of them annotated. The Handbook file is growing by about 11,000 records annually; about 5,000 of these are eventually annotated by the Handbook's contributors. For those preferring "cleaner" data than our working file permits, the verified MARC records for each published volume of the Handbook are made available by the Library of Congress' MARC Distribution Service.4 This tape includes the 5,000 annotated entries which appear in the print edition of the Handbook, as well as any additional bibliographic citati ons for works not selected for annotation. Since the Handbook records have been input in MARC format, they are suitable for mounting on an online utility,5 a library's online public access catalog,6 or for producing other electronic products, such as a CD-ROM.7

IV. The Handbook's Retrospective Project

The next major goal for the Handbook is to provide machine-readable access to volumes 1 through 49 for cumulative searching of some 215,000 annotated bibliographic citations. This project is complicated somewhat by the fact that over the years the Handbook has been published by three different university presses: Harvard University Press (v. 1 13); University Presses of Florida (v. 14 40); and the University of Texas Press (v. 41 to date). In addition to taking into account incremental changes in the Handbook's bibliographic and editorial style which have evolved over the years, the retrospective conversion project must also be able to deal with the distinct formats and typefonts used by these three publishers of the Handbook. For this reason, we are approaching the project in three stages. For the University of Texas volumes (v. 41-49), the original typesetting tapes still exist; we hope to use them to extract the Handbook data for electronic publication. The earlier volumes (v. 1-40) will require either optical scanning, or some other means of retrospective conversion. If optical scanning is used, the optical character recognition (OCR) software will have to be finetuned for both presses, since the printed volumes published by University of Florida and Harvard vary greatly in format, layout, typeface, etc. In addition, before each phase of the project begins, the Handbook needs clearance from the corresponding publisher. As of March 1994 only the University of Florida has agreed to waive its rights to this scanning project, although discussions with representa tives from University of Texas Press and Harvard University Press indicate that permission will soon be worked out.

We have investigated three different formats for the Handbook's retrospective database: 1) a MARC formatted file, either keyed or scanned; 2) scanned image files, with indexes based on the binary representation of the images' ASCII text; or 3) scanned ASCII textual files.

Because the Handbook's current database is in MARC format, ideally the retrospective data would also be in MARC. This could be accomplished by contracting with traditional retrospective conversion services such as OCLC, Retro-Link, or Library Systems and Services which would copy records already contained in their conversion databases, create new records for any items not found, and then input t he Handbook's annotations and subject index terms into each record. This approach seems somewhat tedious, quite costly, and very time-consuming, especially for other than monographic records, so it has not been pursued at this time. A second, more innovative, approach to MARC conversion has been developed by Verba Logica, a software development team at the Universidad Complutense de Madrid. Ve rba Logica uses artificial intelligence to scan, perform optical character recognition, and insert correct MARC tags, fields, and subfields into the scanned text. Their program uses contextual information to accomplish this; it has been used successfully for conversion of several university card catalogs in Spain. Although this option is somewhat less expensive than using a traditional MARC con version service, it is still perceived as too costly.

Another option we have considered is using high tech software to eliminate the need to correct scanning errors that might otherwise affect the accuracy of the search results. For example, one of the software tools we have examined is Excalibur Technologies' PixTex/EFS software which maintains both an image and ASCII text for each page sca nned. Excalibur builds its indexes on the binary representation of the ASCII text, not on the ASCII text itself, and the searches examine the binary files, rather than the ASCII files. By assigning probabilities as to whether binary patterns match the search criteria, the PixTex software is able to retrieve words even if there are OCR scanning errors in them. Once an item is retrieved, the user has the option to read from the original image file, thus avoiding seeing OCR scanning errors in the ASCII file. The advantage to this type of retrospective conversion process is that it eliminates the need for costly and time-consuming correction of OCR errors, since the user always has the option to consult the scanned image directly.8 Although several years ago this seemed like the Handbook's best solution, given the recent dramatic improvements in accuracy of scanners and OCR software, this no longer is such an attractive alternative. For one thing, it seems unwise to have to maintain the cumbersome, large image files as the primary text, with "dirty" OCR'd text as the finding aid, since this requires tremendous amounts of disk space, more expensive hardware, etc. For another, it seems inappropriate to create a database which would only be effective if the user has access to the PixTex/EFS software - an extremely expensive item which would be likely to increase mediated searches rather than improve direct end-user access, another goal of the Handbook staff.

Because of the relatively high costs associated with the above alternatives, we have decided to create a simple ASCII database for the retrospective records by scanning, OCR'ing, and correcting OCR errors, paying particular attention to the Handbook's author and subject indexes and author/title information for each citation. Although the resulting database would not be immediately compatible wit h our current one, once funds are available we could either convert our current MARC database to ASCII format, or use artificial intelligence to convert the retrospective ASCII data into MARC format, thereby merging the retrospective and current files into one single database.

Several issues have arisen during the planning phase for this project. One of the most persistent details which requires attention involves how to treat the serial abbreviations found in analytic citations. Should they be expanded out automatically during the conversion process to full titles, or should we somehow link each analytic record to the corresponding volume's key to abbreviations? Sinc e these abbreviations have changed over the years, and indeed, non-unique abbreviations may exist, we have decided to simply use automated technologies to replace the serial abbreviations found in each volume with their corresponding full serial title. This should eliminate any confusion over which serial is being cited.

Another problem is how to deal with each volume's author index and subject index. Should we integrate the actual data into each record, simply discard this data, or maintain the data elsewhere with automatic links back and forth to and from the record referred to? Since we view the Handbook's indexes as an integral part of the volume, discarding this data has never been seriously considered. A nd, because at the present time MARC conversion is not feasible, there is little point to actually integrating the indexes with the bibliographic citations and annotations. Instead, we hope to save time and expense by simply linking each index entry with its corresponding annotated citations. In the cases of volumes 1-15 (which do not include a subject index) and volumes 16 17 (which include ru dimentary subject terms in the author index), we have decided to include macro-level indexing based on the volume's chronological order (i.e., History-Mexico). In this way, we will maintain some degree of standardized subject access to all records.

As part of the project, we have examined several possible media for the electronic publication of this data. Because of the quantity of the data (about 500 megabytes) and our interest in keeping the project as simple and economical as possible, we have decided to first publish the retrospective data on CDROM. However, worldwide distribution via Internet is also of great interest, and could be a ccomplished in several ways. Once the retrospective ASCII database is available, we hope to mount it on a shared file-server at the Library of Congress (or elsewhere). Later, if and when the database is converted to an acceptable MARC format, it would then be technologically possible to batch load the retrospective records into our current Handbook file, a shared file which is presently available over the Internet. Furthermore, if the data is converted to MARC format, we would also be interested in publishing the retrospective records as part of the Library of Congress' MARC Distribution Service, hopefully merged with the current Handbook records.

Obviously the media we select will determine to some extent our options as to search software. For the CDROM publication, we plan to use hypertext linking to connect the bibliographic citations with their corresponding author/subject index entries. At this date, we have not yet determined whether we will use an off-the-shelf commercial software package or have someone write new search software s pecifically for this project. Since the additional cost per CDROM for software rights adds significantly to the cost of the final product, we must analyze this option closely before making our decision. As for Internet access, searching unfielded ASCII text may pose some problems for users. In its present incarnation, GOPHER would be very cumbersome for cumulative searching. And, although WAIS provides full-text searching capabilities, it would currently be unable to smoothly link the indexes to the actual bibliographic citations. Thus, the data from the CDROM product would not fit well in the Internet environment. This problem could be eliminated if we were to contract with WAIS to add hypertext linking capabilities. Finally, if the data were to be converted to MARC format and batch loaded into our present file, the current MUMS search software would be adequate. One major drawback to relying upon current MUMS search software, however, is that it does not permit online searches of the annotation field (tag 520). For the earlier volumes which contain little or no subject indexing, this would reduce the researchers' ability to conduct subject searches.

One final, and perhaps deciding factor in the planning process for this project involves its funding. Given the severe budget constraints forecast for the Library of Congress and the United States federal government for the next several years, we assume that the Handbook's retrospective conversion project will not be funded by government appropriation. However, there is widespread support for th is project among professional librarians, academics, and scholarly organizations alike. For instance, in June 1991, at the Seminar on the Acquisition of Latin American Library Materials (SALALM), the Latin American Database Interest Group urged the Editor of the Handbook and LC management to actively pursue retrospective conversion. In September 1992 and March 1994, the Task Force on Scholarly R esources, a joint task force of SALALM and the Latin American Studies Association (LASA), also expressed strong support for such a project. In November 1992, the Handbook's Advisory Board moved that the Handbook staff begin studying the project immediately, with the expectation that such a project be completed by January 1995.

An international organization has also expressed interest in collaborating with the Handbook on this project. In July 1993 this organization supported a trip to Madrid by Handbook personnel to discuss how retrospective conversion might be accomplished cooperatively. This organization has generously offered to underwrite much of the cost of the project, providing that matching support can be foun d. We are currently attempting to secure this additional funding.

V. CONCLUSIONS

The Handbook staff is committed to making the Handbook's retrospective data available electronically as soon as practical in order to facilitate access to this vast wealth of historical information. As envisioned today, there will be at least three separate projects: 1) to scan, perform optical character recognition, and clean up OCR'd text to produce a simple ASCII database on CD ROM; 2) to pro vide Internet access to the retrospective data; and 3) to merge the retrospective database with the current one for cumulative searching of the entire Handbook.9

The first project -- to produce a simple ASCII database on CDROM with hypertext links to the indexes for each volume -- will begin as soon as we receive matching funds. We plan to begin with the volumes published by the University of Texas Press (v. 41-49) and work our way backwards, assuming that the more recent volumes have greater interest to scholars today. We will take care during this fi rst project to permit possible later conversion of the ASCII text to MARC format.

As to the second phase, we plan to provide Internet access to the retrospective data as soon as possible. Once that is accomplished, we anticipate working toward the final goal of merging the two databases harmoniously - whether by converting the retrospective data to MARC or converting the current database to ASCII format.

We are dividing this project into multiple stages in order to make the retrospective data available relatively quickly, rather than waiting until full funding is available for MARC conversion.10 Because we feel it is imperative to provide cumulative electronic access immediately, we have selected this more modest approach. We hope this decision will provide researchers with electronic access to volumes 1-49 of the Handbook sometime in 1995.

NOTES

1. For instance, one Latin Americanist commented that "the Handbook of Latin American Studies is, without a doubt, the basic tool without which this vibrant and multifaceted interdisciplinary field would be entirely different or even unthinkable." (See "Latin American Databases: Technological Resources for Scholars in the Nineties," by Nicolás Hernández, Jr., a paper delivered at th e CALICO Conference held at the College of William and Mary, March 11-13, 1993).

2. To purchase the print edition of the Handbook, contact University of Texas Press, P.O. Box 7819, Austin, TX 78713, USA. Telephone (512) 471-4032; Fax (512) 320-0668.

3. To access the Handbook's file, telnet to marvel.loc.gov (login=marvel) and choose Library of Congress Online Systems"(#4) and then choose "Connect to LOCIS" (#5). Once in LOCIS, choose the LC Catalog(#1) and then choose books cataloged since 1968 (#1 again). For documentation, from the main menu choose "LC Online Systems," choose "Quick search guides to LOCIS" and then select the Handbook gui de. If using gopher, point to marvel.loc.gov on port 2070, and the main menu will appear.

4. For information on subscribing to the MARC Distribution Service for the Handbook, please write to: Library of Congress, Cataloging Distribution Service, Customer Services Unit, Washington, DC 20541 5017, USA. Or call (202)-707-6100 or FAX (202) 707-1334.

5. In December 1993, the Research Libraries Group purchased the Handbook tape to add to its EUREKA! service.

6. The Princeton University Library has purchased the Handbook tape and mounted it on their mainframe using SPIRES software.

7. The National Information Services Corporation (NISC) has combined the data from the Handbook, the Hispanic American Periodicals Index (HAPI) and the computerized catalog of the University of Texas' Benson Latin American Collection on a CD-ROM under the NISC-DISC title Latin American Studies, Volume 1. For further information, contact NISC, Suite 6, Wyman Towers, 3100 St. Paul Street, Baltimore , MD 21218 USA.

8. For a good analysis of PixTex/EFS, see "Excalibur's PixTex: A Retrieval Alternative," in Seybold Report on Publishing Systems, Vol. 20, No. 13, March 25, 1991.

9. I should point out that this paper is being written in March, 1994, five months prior to its presentation in August. There will doubtlessly be many changes once the project is formally launched. My oral presentation will include an update on the status of the project.

10. Estimates for MARC conversion we have received thus far range from 11 to 21 times higher than the matching funds required to assist in converting volumes 1-49 to a simple ASCII database.