IFLA

As of 22 April 2009 this website is 'frozen' in time — see the current IFLA websites

This old website and all of its content will stay on as archive – http://archive.ifla.org

IFLANET home - International Federation of Library Associations and InstitutionsAnnual ConferenceSearchContacts

63rd IFLA General Conference - Conference Programme and Proceedings - August 31- September 5, 1997

Cataloging in SGML : from tagging to markup

Catherine Lupovici
Jouve Systèmes d'Information
clupovici@jouve.fr
http://www.jouve.fr
http://www.jouve.com


PAPER

Introduction

SGML (Standard Generalized Markup Language) is a ten years old International ISO Standard (1) coming from the Publishing community. Its primary objective was to provide a standard language for marking up the logical structure of textual documents independently of the software and hardware used, in order to facilitate the exchange of documents through the publishing processing chain. It was then used to produce several different output products from a single database.

SGML is generic for defining a logical data structure and marking up instance documents following the generic format. This language can be applied to any document type such as a book, a journal article, the aircraft technical documentation, a dictionary and of course bibliographic records.

SGML documents are coded in a platform independent and standard format, guarantying the permanence and reusability of the information for a very long time which is very important in the library environment.

SGML is currently used for data creation, for data exchange, for data storage, for indexing , searching and retrieving, for printing out and for data viewing. Commercial and rather professional tools are available covering the whole range of possible applications.

SGML is under implementation for data handling in large corporate and publishing companies for data production and storage and sometimes for data delivery. It is also considered by Library projects building digital documents programs as a generic format that can cover both the electronic textual document logical structure and the bibliographic information associated to this document.

The Text Encoding Initiative is an international application of SGML in the area of humanities and language industries. It allows researchers to extend the format for his or her own usage using the SGML language and coding to markup the primary textual document.

As SGML is often cited as a potential competitor or successor for the MARC formats, it is important to understand how it works, the different level of the possible usage for the bibliographic information and the corresponding interest for the libraries who are rethinking the cataloging process against analysis of the functional requirements of the bibliographic information in the electronic environment.

SGML

General principles

If we consider SGML as a data logical format to be compared with the MARC format we already have for years, the main features are as follows :

Characters coding

In the declaration of the Document type you specify the basic character sets you will use in the document. Other characters can by added as external character entities with a specific notation written in basic ASCII. Such a notation can be easily captured when keying with basic capabilities and transferred in data exchange including the networks we have today. They just have to be properly interpreted when printing or displaying or transcoding in specific applications.

Example :

é   for the letter e with an acute is displayed   é

The SGML document and the marking up process

The SGML document is basically a text with beginning and ending tags marking up the logical elements or the attributes or the call for entities that can be textual information or images or video. An article coded in SGML has the following aspect :

	 <!DOCTYPE ARTICLE PUBLIC "-//USA/AAP//DTD ART-1//EN" "article.dtd"
[
 <!ENTITY Darc CDATA "SoftQuad Explorer(tm)">
 <!ENTITY nbsp CDATA " ">
]>
 <ARTICLE> <FM> <TIG> <ATL>Flexible Management of SGML-encoded Documents  </ATL>
 <SBT>Design Principles in &Darc; </SBT> </TIG>
 <AU> <FNM>Donald </FNM> <SNM>Broady </SNM> <DEG>Ph.D. </DEG>
 <AFF> <ONM>University of Stockholm </ONM>
 <ODV>Project manager </ODV>
 <EAD>broady@nada.kth.se </EAD> </AFF> </AU>
 <AU> <FNM>Hasse </FNM> <SNM>Haitto </SNM> <DEG>M.Sc. </DEG>
 <AFF> <ONM>Royal Institute of Technology </ONM>
 <ODV>Project coordinator </ODV>
 <EAD>haitto@nada.kth.se </EAD> </AFF>
 </AU>
 <ABS> <P>&Darc; is a multi-user, cross-platform (PC/Windows 3.1 and Sun SPARC/X11) database and information retrieval application designed primarily for documents marked-up with SGML . Among its features is a  full-text document browser, in which markup-based  hypertext linking is complemented by interactive, on-line linking and annotation facilities through concurrent webs. Cooperative work is supported through a novel hierarchical user group mechanism </P> </ABS> </FM>
 <BDY> <SEC> <ST>Keywords </ST>
 <L1> <LI> <P>SGML </P> </LI>
 <LI>
 <P>Hypertext </P> </LI>
 <LI>
 <P>Databases </P> </LI>
 <LI>
 <P>Information Retrieval </P> </LI> </L1>

 </SEC>

You can capture this information or insert the tags to an existing ASCII file with a regular text processing system or with an SGML tool offering WYSIWYG display, interactive structure control and contextual help for the allowed tags at a specific place of the structure.

table

The SGML tools also allow to view the SGML document with the tags hidden, using a style sheet for the layout. Of course different style sheets can produce different layouts for the same SGML document.

Rationale for using SGML to handle bibliographic data

SGML has been considered since the origin by the library community as capable to handle the bibliographic information at different level. The Electronic Manuscript Project initiated in 1984 in United States already intended to consider the document and the bibliographic information at the same time with possible application into CIP(Cataloguing In Publication) process.

Currently SGML is looked at either an exchange format or a creation and handling format, depending on how the cataloguing process and the objectives are seen and depending also of the type of the document.

Exchange format

Exchange of bibliographic records

SGML is obviously a standard exchange format for any structured data and can be applied to the exchange of the bibliographic information. It is possible to write an ISO 2709/MARC DTD reflecting the structure of an ISO 2709 record associated to specific MARC format and character sets.

Several USMARC DTDs are already available. The most detailed one is made by the Library of Congress with an alpha test version available at the Library of Congress Network Development and MARC Standards Office web site. There is one DTD for the bibliographic data and one DTD for the authority data. The objective of this project is to create a standard SGML DTD to support the conversion of cataloging data from the ISO 2709/USMARC data structure to SGML (and back) without loss of data. The project also includes the development of software utilities capable of converting between the two encoding standards. Here is the general structure of the bibliographic DTD down to the subfield level.

The U.C. Berkeley University Library also offers an ISO 2709/USMARC DTD less sophisticated together with the conversion tools. This DTD is designed for use in an on-line catalog employing SGML as its underlying record format at U.C. Berkeley Campus.

Bibliographic link with the publishers

Large publishers, mainly in STM (Science, Technique and Medicine) area are moving their production chain towards SGML. This allows them, from a single input, to create several products.
For instance Elsevier Science offers to libraries and data bases the bibliographic records of articles coded in SGML through the CAP-CAS electronic service.
One can think to ask for the same information for the books using the SGML format.

Creation format for cataloging the electronic document

You can catalogue with an SGML tool using a standard MARC DTD but there is no real benefit to do it if the cataloging function remains exactly the same, except if you plan to replace your catalogue and OPAC by an SGML system.

I think that the main interest to use SGML rather then the traditional format is :

In the second case all the descriptive information of the textual document, copied directly in the source document, can be used directly in the document if it is properly marked up by the publisher or during the cataloguing process.

Several approaches are under development. One is the Text Encoding Initiative (TEI) where a header is added to the document itself in order to carry the bibliographic information. Another is the metadata initiatives starting from the Web HTML needs.

But they are all a result of the same analysis : there is a need to carry the descriptive data and the organizing data (access points) along with the electronic document.

Conclusion

As a conclusion on the current situation we can say that cataloguing in MARC format is creating a bibliographic record with MARC tags as a translation in format of an ISBD like card or entry of a printed catalogue. Cataloguing in SGML can be simply doing the same thing with SGML MARC tags using an SGML tool, but it has no real benefit. It can also be a new approach to consider the full electronic document to which an SGML editing process is applied in order to qualify the descriptive information and to add organizing information to facilitate the access to the document.

References

Gaynor, Edward, 1996. From MARC to Markup : SGML and Online Library Systems.
http://www.lib.virginia.edu/speccol/scdc/articles/alcts_brief.html

ftp://library.berkeley.edu/pub/sgml/marcdtd

ftp://ftp.loc.gov/pub/marcdtd

Footnotes

  1. ISO 8879 : Standard Generalized Markup Language, 1986