As of 22 April 2009 this website is 'frozen' in time — see the current IFLA websites

This old website and all of its content will stay on as archive – http://archive.ifla.org

64th IFLA Conference Logo

64th IFLA General Conference
August 16 - August 21, 1998

Code Number: 058-86-E
Division Number: 0
Professional Group: Contributed Paper Session II
Joint Meeting with: -
Meeting Number: 86.
Simultaneous Interpretation: Yes

Multiscript information processing on crossroads:
demands for shifting from diverse character code sets to the Unicode™ Standard in library applications

Foster J. Zhang

The Dialog Corporation , USA

and

Marcia Lei Zeng,

Kent State University, USA

Abstract:

An essential component of any library application is an encoding methodology that allows computers to process characters and symbols used to represent language information in written form. For years the encoding mechanism was not developed under a unified umbrella nor did it reach various languages equally. Without a standard unified character code, users have to use different software and terminals to display or enter data in different languages, especially when dealing with more than a few scripts.
The development of the Unicode Standard is a milestone in international computing because it supports the creation of global software that can be easily adapted for local needs. It brings good news for library professionals. However, as observed by the authors, the implementation of the Unicode Standard has not received full attention or strong support from some library communities, such as the Chinese library communities in Asia (Mainland, Hongkong, Taiwan, as well as other multilingual regions and countries in Asia), that present special and unique issues to multiscript information processing. It is the purpose of this paper to analyze and explain those issues to both the librarians and the Unicode developers in order to encourage the shifting from diverse character code sets to the Unicode Standard in library applications.

The authors of this paper will focus on what they perceive to be obstacles to using the Unicode Standard in library applications. While examples and observations are from current CJK (Chinese, Japanese, Korean) information processing practice, librarians from other regions, especially third-world countries, may find them interesting. It is the belief of the authors that Unicode is the best solution for truly multiscript processing for library applications; however, librarians worldwide need to work together with the Unicode Consortium and ISO for the further development and implementation of such a unified language character set.

Paper

Libraries as information dissemination centers have a long history processing multilingual data and serving users who speak different languages. During the past two decades library applications such as integrated library systems and online bibliographic databases have significantly contributed to the global resource sharing and enhanced library services to serve multicultural and multilingual populations.

An essential component of any library application is an encoding methodology that allows computers to process characters and symbols used to represent language information in written form. For years the encoding mechanism was not developed under a unified umbrella nor did it reach various languages equally. The computer industry has not fully supported libraries serving various multilingual populations. In the past, software developers concentrated on office automation or other computer applications that only dealt with one or two languages at a time. Librarians' pleas for systems that process multilingual data simultaneously receive little support, and libraries have been forced to build their own multiscript applications because commercial computer software has not supported the variety of scripts needed. Additionally, even though various standards for individual scripts were developed, there was a lack of international standards for a unified language character-coding table which would support general-purpose software and operating systems. Library professionals cannot solve such problems by themselves; yet, to many companies, the library is not a high profit market.

The Internet and World Wide Web applications revolutionized the world of information exchange. Web client software and web search engines that handle multilingual data have pushed the demands for operating system support and international standards for a unique and comprehensive character set for all languages to a new high. A significant joint effort of developing international standards to meet such a need can be traced back to 1991 when the Unicode Consortium was formed with members including many major computer and high tech companies such as Apple Computer, Xerox, HP, IBM, etc. The Unicode Worldwide Character Standard is a character coding system designed to support the interchange, processing, and display of the written texts of the diverse languages of the modern world. In addition, it supports classical and historical texts of many written languages. It covers the principal written languages of the world, as well as technical symbols in common use. The Unicode Standard is the international standard used to encode text for computer processing. It is a subset of the International Standard ISO/IEC 10646, Universal Multiple-Octet Coded Character Set (footnote 1).

This development is a milestone in international computing because it supports the creation of global software that can be easily adapted for local needs. It brings good news for library professionals. However, as observed by the authors, the implementation of the Unicode Standard has not received full attention or strong support from some library communities, such as the Chinese library communities in Asia (Mainland, Hongkong, Taiwan, as well as other multilingual regions and countries in Asia), that present special and unique issues to multiscript information processing. It is the purpose of this paper to analyze and explain those issues to both the librarians and the Unicode developers in order to encourage the shifting from diverse character code sets to the Unicode Standard in library applications.

The authors of this paper will focus on what they perceive to be obstacles to the use of the Unicode Standard in library applications. While examples and observations are from current CJK (Chinese, Japanese, Korean) information processing practice, librarians from other regions, especially third-world countries, may find them interesting. Meanwhile, efforts to enhance standards never stop and newer and enhanced standards may have offered solutions to the problems stated below by the time this paper is presented at the IFLA Conference.

Major reasons for lukewarm reception to Unicode in part of East Asia can be explained from the following technological aspects, while the unfamiliarity with the content of the Unicode Standard among the majority of librarians might be a major cause behind various of these obstacles:

1. No bibliographic records or MARC database that use Unicode

Currently major MARC databases are either using ASCII (including extended ASCII) or a local character set. For example, when dealing with CJK scripts, USMARC uses EACC (footnote 2) and CNMARC uses GB (Chinese National Standards). There is no national or regional bibliographic utility willing to convert MARC databases into an Unicode-based database or to generate records using Unicode. For instance, both RLIN and OCLC bibliographic files contain over 30 million titles in over 360 languages. Over 1.5 million records in the RLIN database contain CJK, Cyrillic, Hebrew and/or Arabic scripts (footnote 3). In the OCLC's OLUC (Online Library Union Catalog), records in 45 languages had over 14,000 items (footnote 4). But, a plan to convert any of these languages into Unicode appears impossible due to the immense workload involved.

Bibliographic databases using Unicode can be created today, as demonstrated by products from DRA, VTLS and CGI (to name but three companies) and Carl/UnCover's experimental practice. However, what is still missing is probably the ability to exchange MARC records containing Unicode values. To exchange, the internal Unicode data has to be converted to records complying with the 8-bit USMARC or UNIMARC formats.

2. No bibliographic standards have adopted Unicode until very recently

We cannot blame bibliographic utilities for not creating data in Unicode because no MARC standards support Unicode currently. The most popular USMARC only defined some conventional language characters sets-for example EACC-as part of its standards. Apparently now both U.S. and European organizations are working on updating the specifications. For USMARC, MARBI has approved the first proposal submitted by one of its subcommittees to add Unicode; and for UNIMARC, the CHASE group has published its recommendations on format change to UNIMARC to include Unicode. However, it is noticed by the authors that there is no report on how Unicode data should be entered into those MARC formats. We would highly encourage librarians in the East Asian regions to participate in the discussions regarding this issue.

Few library systems in the market fully support Unicode

Some of the library system vendors riding on the technology development are currently developing Unicode versions of their systems (e.g., DRA, VTLS, CGI), while some others are experimenting with such an approach (e.g., Carl/UnCover, and Sirsi Corp). However it is also true that a few other system vendors who developed CJK systems based on EACC/CCCII have attempted to sell their existing software. In fact, none of the current installation of CJK systems is using the Unicode. This has led to confusion among librarians in East Asia about whether it is feasible to switch to the Unicode CJK and has added hazy understandings about the relationship between Unicode CJK subset and EACC.

Requirement for an ideographic repertoire that meets library needs

The biggest concern that librarians in Asia have regarding Unicode is that the number of Han characters is far from enough for processing library materials. This concern exists to all the CJK-standards that need to expend the number of characters to the maximum number found in an authoritative dictionary such as Kang-Xi Zi Dian (Kang Xi Dictionary). The Unified Repertoire and Ordering (URO), also know as "Unified Han", which is in both the Unicode Standard and ISO/IEC 10646 includes about 20 thousand ideographs (footnote 5). In comparison, some Chinese software packages already have over 60 thousand characters. A small-sized set which contains the most commonly used characters may be good enough for office automation software, but it is far from the number of characters needed by libraries to deal with rare books and full text documents. For many librarians, it looks as if making a revision of a local standard is much easier than requesting extension of an international standard such as the Unicode Standard. In fact, ISO/IEC JTC1/SC2/WG2 is in the process of adding six thousand more characters to ISO/IEC 10646, and these will also be part of the Unicode Standard; however, many librarians are not aware of it.

Furthermore, in Asian countries, librarians' opinions on their needs are not strong enough to pass important messages to the ISO SC2/WG2 Ideographic Rapporteur Group (IRG) which has successfully developed Unified Han and carries an ongoing task of identifying and collating additional ideographic-characters (footnote 6). Librarians' needs can be demonstrated by two examples. First is the number of characters: many Chinese personal names contain characters that are rarely used in texts; therefore, they are not included in "Unified Han." Second is the various forms of the same character: for rare books cataloging, to record the information on a title page information exactly, a librarian often needs to use a variation of a character that is not considered as a standard abstract format and therefore not included in the Unicode Standard or ISO/IEC 10646 because of the rules governing the content of the Unified Repertoire and Ordering. (Here we are not talking about glyph/font issue.) It is the authors' wish that a "character thesaurus" be made available so that a standard character can be used to retrieve all the materials that have used various forms of this character. Meanwhile, the character can be displayed in desired forms that match its appearance on title pages or rare book full texts.

Special script problems like this can be solved only when CJK library communities become actively involved in the study and development of the Unified Han repertoire. Version 2.0 of the Unicode Standard provides an extension mechanism using surrogate code pairs to encode extremely rare characters, and other encoding strategies may be developed. "The task of identifying, collating, and verifying ideographs becomes progressively more difficult as more focus is given to historical and rarely used characters" (footnote 7). Librarians should play an important role in this task. Librarians can take various approaches such as to work with the appropriate group within their national standards body, or to communicate via ISO TC 46 or directly to ISO/IEC JTC1/SC2/WG2 and the IRG to make their needs known.

Ordering of CJK characters for displays remains a problem

The Unicode Standard points out that relying on the order of characters in a character set for sorting is usually inadequate for a culturally expected result. For the Chinese characters (which occur in C, J, and K), sorting is a more difficult job because the set combines several different national/regional standards and their different versions. In most cases, a sort table would be needed. In Unicode, the Unified Ideographs are predominantly ordered according to each character's radical and stroke count. If a different order is required, then the application (or system) must maintain a separate "order weight" table which permits mapping each unified ideograph to its "order weight" (footnote 8). Library system vendors would be required to develop methods that employ a sort table to sort the display based on Pinyin- or Radical/stroke-order for Chinese, for example. This also means the systems using CJK ideographs need a special process for sorting.

Lack of awareness of Computer software and operating systems that support Unicode

One of the important rationales for CJK librarians to consider using Unicode is web applications and the support to the applications from Microsoft Windows and popular web browsers. Currently, both Netscape and Internet Explorer support Unicode/UTF8 display, as does MS Office97 and Windows NT. By using Unicode, all the information being displayed on a library web site can be seen by all users in CJK speaking regions without requiring installation of specific software, as long as a font or fonts with the appropriate character repertoire are installed. Without Unicode support, one needs a specific font for each coding set; for instance, not long ago, one would have needed Twinbridge to read big5-coded web pages, and Richwin for web pages that use GB code. The text generated by one software package would not be read by another; conversion would often fail or be incomplete; and format and font style would be lost during the conversion. It has to be noted that some of the multi-language add-on enabler software have a high crash rate on Windows operating systems or web browsers due to their own quality problems, such as incorrectly using control characters in code sequence and lacking of compatibility. There is no doubt that Unicode-support features in operating systems, databases, and general applications provided by the computer industry will make it easier for companies to build systems for libraries which require multiscript support.

No word segmentation implemented in library systems

This problem applies to text in a number of languages, e.g. Thai and Chinese, where words are not written separately. For Chinese data, there is no natural way for computers to tell where the end phrase or word is. Multilingual processing systems need to implement an intelligent way to automatically segment a word from a paragraph so that a keyword index can be generated and a better search can be performed. So far, there is no CJK library system meets this very basic requirement.

In conclusion, without Unicode, users may have to use different software and terminals to display or enter data in different languages, particularly when dealing with more than a few scripts, especially non-Roman scripts. This may be okay for some computer applications, but it is certainly not acceptable for library users. We could go on, adding scripts one at a time, but a better way is to build global software capable of handling all scripts. Information and human knowledge should not be separated because of languages; library systems should support users to retrieve information across languages, and users should be able to access such information online from anywhere in the world.

It is time for librarians worldwide to work together with the Unicode Consortium and the International Standards Organization for better implementation of a unified language character set. Library system vendors should develop, or keep progressing in the development of, systems based on the true Unicode Standards. Vendors must also improve specific script data handling, including sorting and word segmentation.

Presently, the most recent, and most comprehensive, encoding for textual information is the Unicode Standard. Unicode is the best solution for truly multiscript processing for library applications, but it needs librarians' input to develop a better character set. (Please refer to footnote 9 for the list of academic and library-related members of the Unicode Consortium.)

This paper only represents the authors personal opinions and has no relation to the organizations which the authors are affiliated with.

Footnote:

Unicode is a trademark of Unicode, Inc. and may be registered in some jurisdictions. The Unicode Standard, Version 2.0 is code value for code value equivalent with of ISO/IEC 10646. As the UCS-2 subset of ISO 10646, the Unicode Standard's 65,536 code values are the first 65,536 code values of ISO 10646. These code values contain all of the characters currently defined by ISO 10646. All other ISO 10646 code values are reserved for future expansion. ISO 10646's full codeset is called Universal Character Set, four octets form (UCS-4).
(Source: http://www.unicode.org/unicode/standard/principles.html)
East Asian Character Code (EACC) is an American national standard that RLG developed in conjunction with the Library of Congress. It covers traditional and simplified Chinese characters as well as Japanese variants, the Japanese hiragana and katakana characters, and the Korean hangul characters. The base standard from which RLG worked - CCCII, Chinese Character Code for Information Interchange, is a 3 byte standard published in Taiwan. RLG adapted CCCII and added characters from other standards:
- CCCGSII, Code of Chinese Character Graphics Set for Information Interchange, a national standard in China (GB2312-80),
- JIS, the Japanese Industrial Standard, and
- KIPS, the Korean Information Processing System
to create the RLIN East Asian Character Code (renamed East Asian Character Code when it was adopted as an American National Standard in 1988). This is the only East Asian character set that incorporates all character graphics lists in the four major East Asian character set noted above and internally links all their common variant forms. These links enable users of diverse linguistic backgrounds to search for one character from and retrieve all its related forms. RLIN CJK was released in September 1983. OCLC adopted REACC and its character set, and released its implementation in 1986. In 1988, the RLIN East Asian Character Code was approved by the National Information Standard Organization as an American National Standard Z39.64-1989.
(Source is partially from: http://www.rlg.org/jackphy.html)
Eight files organized by bibliographic material type are accessible through the RLIN library and archival support system, and are available for searching as one combined "BIB" file through Eureka and Zephyr. The files contain millions of records in more than 365 languages. RLIN is the only online catalog to support all the scripts used in the LC-designated "JACKPHY" languages (Japanese, Arabic, Chinese, Korean, Persian, Hebrew, and Yiddish) plus Cyrillic.
(Source: http://www.rlg.org/databases.html)
OCLC's WorldCat offers more than 36 million bibliographic records, representing 370 languages. As of June 30, 1997, records in 45 languages in the OCLC's OLUC (Online Library Union Catalog)had over 14,000 items. Records in three languages (French, German, and Spanish) already exceeded 1.5 million; records in 12 languages were counted to have 100,000--680,000; records in other 24 languages were counted to be 14,000--90,000. (Source: OCLC Annual Report, 1997:10)
The Unified Repertoire and Ordering (URO 2.0) is a result of unified Han characters from disparate character standards. It successfully represents all of the standard Han ideographs with just over 21,000 unique characters instead of 121,000 code points if simply combining existing ideographic-character standards set up by different countries and regions. The Unicode Standard uses this character set to represent Han characters in Unicode codespace. The URO also appears in ISO/IEC 10646-1:1993 as the set of Unified CJK Ideograph characters. (Source: Han Unification.
http://www.unicode.org/unicode/faq/cjk/unification.html)
ISO SC2/WG2 Ideographic Rapporteur Group (IRG) - formerly called the Chinese/Japanese/Korean Joint Research Group (CJK-JRG) is an international body of experts set up under the International Organization for Standardization. It is formed with representatives from China, Japan, Korea, the United States, Vietnam, HongKong, and Taiwan who have worked together to identify, categorize, and set the order of the Han based ideographs and developed the Unified Repertoire and Ordering (URO 2.0) which is used in both the Unicode Standard and ISO/IEC 10646-1:1993. (Source: see note 5.)
(Source: see note 5.)
(Source: see note 5.) It should be noted that different cultural uses of Han ideographic characters have developed different conventions for ordering characters.
The academic and library-related members of the Unicode Consortium
- The Research Libraries Group, Inc. (RLG)
- Columbia University
- Data Research Associates
- The Getty Information Institute
- Innovative Interfaces, Inc.
- Language Technologies Institute, Carnegie Mellon University
- New Mexico State University, Computing Research Laboratory
- OCLC, Inc.
- Royal Library, Sweden
- University of Washington
- VTLS, Inc.
  (Source: http://www.unicode.org/unicode/consortium/memblist.html)

Acknowledgment:

The authors would like to thank an anonymous researcher for the most updated reference information and her very constructive comments on improving the focus of this article, as well as her great help in editing our many manuscript versions.

64th IFLA General Conference August 16 - August 21, 1998

Multiscript information processing on crossroads: demands for shifting from diverse character code sets to the Unicode™ Standard in library applications