As of 22 April 2009 this website is 'frozen' in time — see the current IFLA websites
This old website and all of its content will stay on as archive – http://archive.ifla.org
| |||||||||||||||||||||||||||
65th IFLA Council and General |
Script | Character Category | USMARC/ UNIMARC | JIS X 0208 7 | Unicode Standard Version 3.0 |
---|---|---|---|---|
Cyrillic | Letters | 102 | 66 | 237 |
Latin | Additional unaccented letters | 21 | 0 | 163 |
Arabic | Letters | 124 | none | 141 |
East Asian ideographs | Ideographs | 13,469 (86% of EACC 8) | 6,353 | 27,484 |
But please don't assume that the Unicode Standard and ISO/IEC 10646 will do everything for transcription:
Which is not to say that you should reject these standards - I just want you to understand reality.
The good news is that, with the addition of Sinhala, Ethiopic and Mongolian, all the major scripts of the world are now encoded. Version 3.0 of the Unicode Standard is to be published later this year, and the second edition of ISO/IEC 10646 is scheduled for next year.
Growth of the repertoire has not ended: various scripts for minority languages are still outstanding, more symbols could be added, and significant extinct scripts such as hieroglyphics and cuneiform are pending. (There may not be many libraries which collect and catalogue papyri and clay tablets, but the extinct scripts are significant for scholarship in general and certain museums in particular.
A single font for even the current Unicode character repertoire would be very large, and it's more practical to have fonts only for the scripts your library has in its collections. What is more likely to occur as you catalogue is not lack of a script, but lack of a particular character, e.g., if the title of a work on mathematics includes a symbol that isn't in the Mathematical Operators block. So occasionally you can't transcribe 100% of what is on the source of information.
But, you protest, I thought the Universal Character Set would have everything that I could possibly need! The response is no, for various reasons.
Two Unicode design principles are particularly significant in determining what should be encoded as a character: Characters, not glyphs and Unification across languages. In addition, the Unified Repertoire and Ordering of Han ideographs ("Unified Han"), developed by the Ideographic Rapporteur Group, has rules which determine uniqueness for ideographs.
Characters, not glyphs means that some high-level typographical aspects are not significant when it comes to determining the character repertoire. Examples of typographical aspects are:
Unification across languages means that:
These design principles and rules determine what is to be uniquely encoded. And as a result, not everything that appears on a source of information is eligible to be directly encoded as a defined character. This limitation on what can be encoded directly as defined characters is not a failure of the Unicode Standard. It comes about because of a different and more sophisticated vision of what should be encoded in a character set.
The original approach to the representation of text in machine-readable form was to give a unique code to each discrete mark on paper, although there was unification for generally accepted cases (the lower case forms of Latin letters a and g, for example). Character sets for East Asian languages assigned individual codes to different ways of writing what is fundamentally the same ideograph. Library character sets generally exhibit this "encode what you see" approach too, except for the use of non-spacing marks to encode accented Latin letters, where a letter with a diacritical mark is encoded as two characters. (Critics would say the letter is "broken apart.")
The Unicode Standard introduced a layered approach to the representation of text. "The design for a character set encoding must provide precisely the set of code elements that allows programmers to design applications capable of implementing a variety of text processes in the desired languages."9 One result is that the characters in encoded text do not necessarily correspond 1:1 with the elements of that text in eye-readable form.
The simplest type of text representation is plain text a pure sequence of character codes. Unicode data is plain text. But to render what is wanted exactly, it may be necessary to use higher level protocols, such as language identification or layout instructions, to produce fancy text or rich text . USMARC and UNIMARC also use only plain text, but their character sets may provide separate encodings for things that are unified in Unicode/ISO 10646.
So we need to consider these issues:
So this brings us to consider the issue of exactitude of transcription. How exact does transcription have to be? Why? What exceptions do we make (perhaps without conscious decision-making)? What "work-arounds" do we use when we don't have the necessary typographical facilities?
We need exactitude in transcription in order to represent the item being identified uniquely and so make it accessible. Notice, however, that we don't always transcribe the information from the item with 100% fidelity.
One reason for the lack of fidelity is that cataloguing rules or the interpretation of them by a cataloguing agency do not always require, and sometimes do not allow, specific data to be transcribed. Here's an example. The Hebrew language is normally written unvocalized, that is, without vowel points and other marks of pronunciation. But sometimes these pronunciation guides are printed on the source of information; for example, when the author or publisher wants a word to be pronounced in an uncommon way. The Library of Congress, in its guidelines for cataloguing Hebrew,10 builds on Rule 1.0G, Accents and other diacritical marks, and interprets it (incorrectly, in my opinion) as forbidding the transcription of vocalization marks that appear on the source of information.
One exception to exactitude is necessitated by lack of typographical facilities, a problem recognized in Rule 1.0E. The solution that this rule allows is the description of an unavailable textual element. This introduces an issue for intersystem searching - should the interpolation be ignored in searching, or treated as a "wild card" that matches anything, or…? The user cannot be expected to know the exact description written by the cataloguer.
There are also unwritten rules for exceptions to exactitude. Except for antiquarian and other precious books, we routinely ignore font features, calligraphy, etc. when transcribing, without any attempt to note such features. This is based on practicality, since for most modern works, distinctions at a very detailed level aren't needed.
When typographical facilities for a whole script are lacking, there are various options. When the language of cataloguing uses Latin script, the chosen solution is often romanization: transliteration or transcription into a Latin script form of the original text. Wellisch11 reported in 1976 that the LC romanization tables (now ALA/LC) were most widely used, followed by those of ISO. When the language of cataloguing is Russian or another language written in Cyrillic script, cyrillicization is sometimes done. But not all languages use an alphabet or syllabary, and other solutions are to translate the information into the local language, or maintain card catalogues by script.
Access is impeded by all these alternatives. Where a library uses romanization or cyrillicization, the searcher must know that fact, know which conversion scheme is used for a particular language, and be able to apply that scheme correctly to create a search argument. A searcher may not know about the library's practice and use a completely different scheme. For translations, the searcher's translation may not match that of the cataloger. Card catalogs, unless they have been published in book form, cannot be searched remotely.
These problems will be alleviated considerably through the introduction of Unicode/ISO 10646 into USMARC and UNIMARC. But the use of a greatly expanded script repertoire does not mean that everything may be transcribed exactly. I now want to look at situations where even Unicode/ISO 10646 won't bring about 100% fidelity.
Historically, a primary reason for exactitude in transcription was to provide a surrogate of the bibliographic entity with as much detail as possible. The detail was needed because we had no other way to present the item in a card or book catalogue.
Problems of exact transcription are usually pointed out for ideographs, but this is not exclusively the case. If you're cataloguing a sound recording, what do you do about the name symbol used by "the artist formerly known as Prince"?
One source of difficulty is mathematics, where 2-dimensional formulas must be forced into a 1-dimensional field. Sargent has described how to represent mathematical formulas using Unicode.
Problems with ideographs arise because either the ideograph is not yet encoded, or when variant forms of an ideograph are represented by a single coded value (as noted by Zhang & Zhen).12 Unavailable ideographs include both truly unique ideographs (used for personal names) and those in common use in a particular environment but is not yet in Unified Han (e.g., some of the government-sanctioned ideographs used in Hong Kong, or ideographs occurring in geographic names). In this situation:
When a particular typographic form has been unified with others, yet the cataloger wants to use only that particular form, these are possible solutions.
A general solution to the problem of inexact transcription in bibliographic records is to use hyperlinking. In a Web-based catalog, we can have a link to a picture (scanned image) of the actual source of information. The disadvantage of a scanned image is that it cannot be searched for a specific occurrence of a particular glyphic form, but this is an operation that is more likely to be applied to full text than to cataloging.
The editors of cataloguing rules should review the rules on transcription to determine whether changes are needed due to the new technical environment. The new technical environment includes not only use of Unicode/ISO 10646 but also the ability to search remote catalogues via Z39.50.
Those in charge of the various MARC formats have to work with cataloguers to determine whether it is necessary to re-evaluate the "plain text" of the current formats. It isn't just a case of declaring Unicode/ISO 10646 as an approved character set (as has been done for UNIMARC14) or specifying the necessary changes in detail (as is underway for both USMARC15 and UNIMARC). That is the first and essential step, but cataloguing requirements may call for something beyond the "plain text" of the Unicode Standard and ISO/IEC 10646. If this is a requirement, then the various MARC formats will need to specify a methodology to provide this.
The question that has to be answered is: Is cataloguing data "plain text" or does it need to be a little fancier?
1 Anglo-American Cataloguing Rules, prepared under the direction of the Joint Steering Committee for Revision of AACR2; edited by Michael Gorman and Paul W. Winkler. 2nd ed., 1988 revision. (Chicago: American Library Association, 1988).
2 The Unicode Standard, Version 2.1 consists of:
Unicode is a trademark of Unicode, Inc. and may be registered in some jurisdictions.
3 International Organization for Standardization. Information Technology -- Universal Multiple-Octet Coded Character Set (UCS), Part 1: Architecture and Basic Multilingual Plane, Geneva, 1993. (ISO/IEC 10646-1:1993).
4 RLG East Asian Studies Community. http://www.rlg.org/eas/index.html
5 USMARC Specifications for Record Structure, Character Sets, and Exchange Media, prepared by Network Development and MARC Standards Office, 1994 ed., Cataloging Distribution Service, Library of Congress, Washington, D.C, 1994.
USMARC Format for Bibliographic Data, including Guidelines for Content Designation, prepared by Network Development and MARC Standards Office, 1994 ed., Cataloging Distribution Service, Library of Congress, Washington, D.C, 1994 -
USMARC Format for Authority Data, including Guidelines for Content Designation, prepared by Network Development and MARC Standards Office, 1993 ed., Cataloging Distribution Service, Library of Congress, Washington, D.C, 1993 -
For additional USMARC documentation see the Library of Congress' Web site.
6 UNIMARC Manual: Bibliographic Format, B. P. Holt and S. H. McCallum, eds., 2d ed., Saur, Munich, 1994.
UNIMARC/Authorities: Universal Format for Authorities, Saur, Munchen, 1991. (ISBN 3-598-10986-5)
7 Japanese Standards Association. Code of the Japanese Graphic Character Set for Information Interchange. [English translation of JIS X 0208-1983] Tokyo, 1987. (JIS X 0208-1983)
8 American National Standards Institute, East Asian Character Code for Bibliographic Use, Transaction, New Brunswick, NJ, 1990. (ANSI Z39.64-1989).
9 The Unicode Standard, Version 2.0, p. 2-2.
10 Library of Congress. Descriptive Cataloging Division. Hebraica Cataloging: a guide to ALA/LC Romanization and Descriptive Cataloging, prepared by Paul Maher (Descriptive Cataloging Division). Cataloging Distribution Service, Library of Congress, Washington, D.C, 1987.
11 Wellisch, Hans H., "Script Conversion Practices in the World's Libraries," International Library Review 8:55-84 (1976).
12 Zhang, Foster J. and Zeng, Marcia Lei , Multiscript information processing on crossroads: demands for shifting from diverse character code sets to the Unicode Standard in library applications (Paper at 64th IFLA General Conference, 1998) http://archive.ifla.org/IV/ifla64/058-86e.htm
13 International Organization for Standardization. Information processing -- Text and office systems -- Standard Generalized Markup Language (SGML), Geneva, 1986. (ISO 8879:1986)
14 UNIMARC Manual: Bibliographic Format, 2d. ed., Update 2 (1998).
15 Unicode Identification and Encoding in USMARC Records, submitted by MARBI Unicode Encoding and Recognition Technical Issues Task Force, 1998. (MARBI Proposal No: 98-18) http://lcweb.loc.gov/marc/marbi/1998/98-18.html