   64th IFLA General Conference
   August 16 - August 21, 1998


Russian-Language Database of Universal Decimal Classification: Creation and Implementation in Library Automation

Yakov L. Shraiberg


Ekaterina M. Zaitseva

Russian National Public Library for Science and Technology
Moscow, Russian Federation


The paper analyses the problems of UDC application in scientific and technical libraries and information centres while replacing traditional cataloguess by the electronic catalogue. It describes the UDC database created in Russian National Public Library for Science and Technology in accordance with the forthcoming 4th edition of UDC. The paper discusses the current state of preparation of this database and the problems and decisions on its maintenance and usage in classification practice and retrieval


In Russia, the Universal Decimal Classification (UDC) is a mandatory language for the classification of scientific and technical publications. It serves as a basis of all classified catalogues of scientific and technical libraries, including the Russian National Public Library for Science and Technology (GPNTB), which is the leading scientific and technical library of Russia.

In 1997, after the decision to discontinue the traditional card catalogues, GPNTB faced the acute task of using UDC in the environment of the in-house electronic catalogue and automated library system, which was developed by the library's staff. To resolve this task, GPNTB has, firstly, to generate the UDC database in accordance with the actual UDC Tables and, secondly, make a formal description of the grammar rules of the traditional UDC language and the UDC information retrieval language, to develop a translator for converting traditional UDC numbers (in the table, document, or query form) into their images in the UDC information retrieval language, use this translator in the UDC database for the whole diversity of the main numbers in the entries of the UDC Tables. The above task and the solutions proposed will be understood better if we dwell in more detail upon the current situation, which serves as a background of all the processes and problems GPNTB is now facing.

The classified catalogue is one of traditional catalogues of the GPNTB. It is based on the UDC Tables and has a service classification scheme. This catalogue was put into operation in 1959, three years before UDC was declared the mandatory classification language of scientific and technical libraries of the former USSR. UDC has preserved its status till today. In 1963, GPNTB became a theoretical and methodological UDC centre and a supervisor of the network of scientific and technical libraries. As a result, all scientific and technical libraries, many branch centres of scientific and technical information, and publishing companies remain GPNTB-oriented as far as UDC's application is concerned.

UDC was regularly developing both theoretically and methodologically till 1988. In the years that followed, this work came to a decline, and, in 1992, it terminated. Along with this, a centralized translation of E&C issues into Russian and their distribution within the network of scientific and technical libraries were terminated as well. For these reasons, the updating of the UDC Tables of the classification services and reclassification of the library's classified catalogues proceeded slowly and defectively. Two years ago we began to pin our hopes on a new Russian edition of the UDC Tables whose purchasing could drastically change the situation. Alas, these hopes will hardly come true in the near future because today, two years after the date stipulated, only the third class of this edition has seen light (the one, which is, by the way, not of primary importance for GPNTB). Guesses only can be made as to the year all old Tables (3d edition, complete) will be replaced by the new ones. The similar situation is in other Russian scientific and technical libraries.

As for the electronic catalogue, its linguistics includes two languages that function as information retrieval languages, i.e., the language of the State Classification for Scientific and Technical Information and the language of keywords. As is known, UDC with its traditional form of search image representation cannot be regarded as information retrieval language for electronic catalogues. Search images in the language of keywords can be obtained through either direct indexing of documents in this language or automatic translating of search images from the language of subject headings into the language of keywords. For this very reason, we expect no particular problems in the planned discontinuing of the subject catalogue of GPNTB.

According to the UDC status, its numbers are included in the records of the electronic catalog. UDC is included in the linguistics of the GPNTB automated system as well, though its role is limited to that of a mediator language, i.e., a language that ensures information connection between the electronic catalog of GPNTB and external (mainly, foreign) systems.

Currently, GPNTB's databases represent the documents that have been acquired approximately within the last 5 years. In the traditional catalogue, the same layer of documents receives no more than 50% of all subject queries. This value corresponds to the famous law of the query fall («half-life» of publications, J. Bernal, 1958).

In this situation, the discontinued classified catalogue would signify the discontinued use of UDC as a classification language. In fact, what is the reason of classifying documents in a language, which cannot be used for retrieval in the electronic catalog, and the classified catalogue itself no longer exists? On the one hand, such outcome is inadmissible (not so much as it does not correspond to the UDC status but for practical reasons mainly). On the other hand, if the traditional catalogue is preserved, it signifies the preservation of factors that increasingly hinder the automation of library routines and, consequently, the general efficiency of GPNTB.

In view of these obviously conflicting circumstances, GPNTB has taken into account the results of the previous research, conducted relevant information analysis, and formulated the task: specialists in linguistics have to work out the tools for timely and efficient UDC updating (on the basis of E&C and data as additions to the classification scheme), distribution of the results of updating through the network of scientific and technical libraries, and mass utilization of UDC in the electronic catalogs for the classification of documents and their retrieval.

To resolve this task, GPNTB has purchased and examined all text files of all classes of the new Russian edition of the UDC Tables. The files have been purchased in the Scientific and Technical Center "Rector" (licensed by the UDC Consortium). Having finalized this work in September 1997, GPNTB adopted the following plan:

Stage 1:
To develop a formal structure of a maximally evolved entry of the UDC Table, which is adequate in content to the table form of the entries and allows for the algorithmic representation of entries in the records for the CDS/ISIS environment.

Stage 2:
To develop and implement the technology of automatic marking of data elements of the entries from the initial files.

Stage 3:
To visually control the results of marking and structural editing of the files marked so that the initial entries correspond to the formal representation.

Stage 3a (parallel to Stage 3):
To develop an algorithm and a program for file-by-file transformation of marked entries to the respective fragments of a Small UDC Database (which reflects a limited number of table data elements).

Stage 4:
To reproduce, file by file, newly-acquired fragments in the technological database in order to restock the Small UDC Database, control these fragments visually in the CDS/ISIS environment, correct the mistakes in the initial files, reproduce the fragments again, and add them to the final Small UDC Database.

Stage 4a (parallel to Stage 4):
To develop an algorithm and a program for generating a Large (full-element) UDC Database.

Stage 5:
To form, file by file, the Large UDC Database, control it visually, correct the mistakes in the CDS/ISIS environment, restock the final Large UDC Database, and incorporate both Databases into the Automated System of Dictionary and Linguistic Processor Maintenance of Documentary Databases of the GPNTB's electronic catalogue (ASSO).

Stage 5a (parallel to Stage 5):
To develop formal grammar rules for the traditional and computer UDC information retrieval languages and, using these rules as a basis, to form an algorithm and a program for translating search images and statements from the traditional to the computer UDC information retrieval language.

Stage 6:
To form, in the records of the Large UDC Database, the fields of search images of entries in the UDC information retrieval language, to control them visually, and correct them in accordance with the ASSO technology.

Stage 7:
To evaluate experimentally the efficiency of the UDC information retrieval language (information retrieval characteristics mainly).
Later, these operations will be continued along the following guidelines:

In this respect, we deal with both the Small and Large UDC Databases. A decision was taken to create a Small Database first because it takes less time and can be rapidly implemented to provide the classification service with actual (though not complete) UDC entries. In turn, it allowed librarians to use actual table numbers for document classification and to start a fully-fledged reclassification of the classified catalogue, which had proceeded rather slowly for a number of years. In addition, the creation of the Small Database was like reconnaissance in force, whose task was to reveal technological bottlenecks and drawbacks that could be avoided in the Large Database and make its creation less laborious and time-consuming.

Having noted that the Small UDC Database has active entries only, and that its records unambiguously correspond to the active file entries, we shall give a breakdown of data elements of its record:

Main numbers are presented in the dictionary file with a differentiating prefix u=, reference numbers are presented with a prefix c=. All word forms in the dictionary file are given without prefixes.

Large (full-element) UDC Database includes the records of both active and passive (discarded) entries (the expected amount is around 140 thousand records). Two or more records correspond to large active entries: the first of which is the leading record and the following ones are continued.

The elements which compose the leading record are as follows:

Data elements of continued records:

Data elements of the records of passive (discarded) entries:

According to the plan, the Small UDC Database (127,5 thousand records) was generated to the middle of December 1997. As expected, the most laborious were structural transformations of the initial entries. The key point in this part of the work was to discontinue the «perplexity» of the adjacent entries in the table files, including the addition of data elements, e.g., superordinate headings; division of large entries into the leading and continued parts; relocation of examples without numbers from the main headings to the corresponding extensions and relocation of examples with numbers to the methodological instructions, etc. At this stage, an unpleasant surprise for the staff were numerous mistakes in the main numbers and lots of grammar mistakes in the main headings and their extensions. Among other mistakes were nonconcurrences between some reference headings (outdated notations) and the corresponding main headings (actual notations).

In the second half of December 1997, the Small Database was tested for retrieval purposes. It had to be followed by the training of the main database user, i.e., the staff of the classification service. However, the testing revealed that the headings contained an enormous amount (thousands) of Russian word forms, which carried Latin letters graphically similar to the Cyrillic, and Latin word forms with Cyrillic letters. Much time was spent for the development and introduction of a special program that could detect and correct such mistakes (in a virtually unattended way), which could seriously hinder the retrieval. In January 1998, the Small Database was tested for the second time and proved to be ready for operation in the initial CDS/ISIS interface and in the library's ASSO interface.

A program is being developed to avoid similar mistakes in the Large Database. This program is expected to be more complicated due to a more complicated text situation in the Large Database. In addition, GPNTB is refining a program for the transformation of files into the records of the Large Database without the translation of the traditional main numbers into the corresponding search images in the information retrieval language. This program is scheduled for May 1998. The final stage envisages a development of a program for translating traditional numbers into the information retrieval language. The Large UDC Database and translators are expected to be finalized before the end of 1998.

In conclusion, we would like to note that the above work is important because we manage (1) to preserve UDC during the transition from traditional to electronic catalogues and, at the same time, to eliminate the obstacle on the way to library automation, and (2) to modernize UDC usage and maintenance technologies.