IFLANET home - International Federation of Library Associations and InstitutionsAnnual ConferenceSearchContacts

63rd IFLA General Conference - Conference Programme and Proceedings - August 31- September 5, 1997

Characteristics of Web Accessible Information

Edward T. O'Neill
OCLC Online Computer Library Center, Inc.
Dublin, Ohio 43017


ABSTRACT

The rapid growth of the World Wide Web has created vast, new information sources that rival, and even succeed, those held by our great libraries. However, despite its growing popularity of the Web, little is known as to what type of documents are available on the Web. This paper describes a methodology to collect a representative sample of publicly accessible Web documents; and suggests analyses which will benefit libraries and the Internet community in trying to characterize these information sources.


PAPER

In just five years, the World Wide Web has become an important source worldwide for scholarly literature and a major force in reshaping the way information is distributed and used. The resources available on the Web exceed, at least in number, those of our great libraries. Last year, Inktomi [1] estimated that there were a total of 50 million documents on the Web with an aggregate size of about a half a terabyte. The sheer size of the Web, and its explosive growth which shows no signs of slowing down, leaves little question regarding its importance to libraries and library patrons.


BACKGROUND

The World Wide Web (Web) is a form of Internet access. Using special browser software (such as Netscape or Internet Explorer), users can access a range of Internet services. Additionally, the Web has its own special protocol, Hypertext Transfer Protocol (HTTP), which permits the transmission of hypertext documents. The flexibility of the Web and its ability to deliver hypertext, graphics-oriented documents has made it the most visible segment of the Internet today.

Although the Web is well understood from a technical standpoint, very little is known about what type of information is available and about the collection of documents that constitute the Web. One thing which is clear is that the Web is a collection of documents contributed from any authors and publishers who buy a Web server. The Web does not have a selection policy such as libraries where conscious decisions are made about what works will be acquired and maintained in the collections. In contrast to a libraries' clearly defined development policies, the Web looks more like the result of a fantastically successful gifts program : it includes the good, the bad, and the ugly.

From the content prospective, the Web remains somewhat of a mystery. We know very little about the sources of its materials, the types of documents available, the authoritativeness of the documents, the languages in which information is available, the age or longevity of the documents, the scope of subjects available, and other descriptive characteristics of an information collection. Few studies have been done on content due to the unpredictability of this area and the lack of overall guidance to development of the Web. There is even wide disagreement on the size of the Web. General Magic [2], the source of statistics recently used by Time magazine [3], has estimated the number of Web sites to be 400,000 while Gray [4] estimated this number to be closer to 650,000.

Users can access the Web from their offices, schools, homes, and local libraries with public terminals for Web access. To continue providing patrons with high-quality reference assistance and usage support, libraries need reliable statistics describing the huge, invaluable information resource which the Web represents. As a preliminary step to the study described herein, OCLC searched the Web and print sources for complete, documented and reliable statistics describing the content of Web pages. No statistics were found which provided any useful information for member libraries and their patrons. This study, thus, is being undertaken to rectify this situation to characterize the contents of the Web and produce statistics useful to the library community.


WEB PAGES

The Web uses its own terminology to describe the storage, maintenance, and dissemination of its information documents. Those terms most important to the study described herein are home pages, static and interactive Web pages, and Web sites.

Entry to the Web site usually starts at the Home page, which is roughly equivalent to the title page in the print environment. The Home page commonly provides general information about the site, and may also function as a table of contents.

Following the Home page, the most fundamental bibliographic unit on the Web is the Web page (Web document, HTTP file). The Web page is a distinct entity that is identified by a unique universal resource locator (URL). There are two types of Web pages: static and interactive.

A static Web page is a document that can be read from top to bottom without leaving the document. Unless explicitly modified, the static Web page presents identical information to all viewers.

An interactive Web page is a customized document which uses external programs to perform specified functions. Interactive Web pages allow users to submit forms, query databases, format results, structure displays, and access password-protected areas of a site. A good example of an interactive Web site is the Delta Airlines' Web site [5] Rather than searching through tables of published airline schedules, users simply enter the relevant information needed to produce their customized information document (i.e., flight schedule).

A Web site is a collection of Web pages linked together and which exist on a particular server.


SAMPLING WEB PAGES

The vast size of the Web prohibits an exhaustive analysis of its content. The next best approach is to collect a sample of Web pages. This sample must be large enough to represent the diversity of information on the Web, yet small enough to be manageable. The sample must also be unbiased, permitting extrapolations from the Web as a whole.

The Web includes sites on Intranets behind firewalls, Web pages which impose a fee for access, Web pages that require prior authorization, and other instances of restricted access. Only Web pages which are publicly accessible without restrictions or fees will be included in the sample.

The study will use cluster sampling in which the Web site becomes the primary sampling unit, and the Web page is the subunit. Cluster sampling is well suited to sampling Web pages since no list of subunits are available. A random sample of the Web sites will be taken, and data will be taken from each of the Web pages found at the site. The methodology for cluster sampling with clusters of unequal size is well-documented by Cochran [6]. The IP (Internet Protocol) address will be used to identify Web sites. Each site has a unique, 32-bit numeric identifier, its IP (Internet Protocol) address. The address is divided into 4 octets of 8 bits each, usually shown separated by dots (e.g., 132.174.1.5). Since each octet is 8 bits, it can range in value from 8 to 255, creating over 4 billion potential addresses in the total address space.

While every Web site has a unique IP address, not every IP address corresponds to a Web site. Many IP address are associated with other Internet services, such as e-mail or FTP, some sites are not publicly accessible, and some IP addresses have simply not yet been assigned. The small proportion of IP addresses currently associated with Web sites complicates data collection, but does not impact the validity of the sample. Each Web site has an equal chance of being selected for the sample. However, the number of Web sites in the resulting sample will be much smaller than the number of IP addresses selected.

The sampling itself will be done in three phases. First, a random sample of IP addresses will be generated. Second, an automated program will attempt to connect to port 80 (the standard port for Web servers) at each IP address to determine if the address serves a public Web site. Third, the contents of each sampled Web will be harvested by downloading all HTML files from that site.

Data collection will begin in June 1997 and continue into the summer. The preliminary analysis is expected to be completed during the summer so that detailed results will be available prior to the conference. We anticipate that the analysis will provide accurate statistics on both the magnitude and characteristics of Web-accessible information.


ANALYSIS

Libraries and the Internet community need reliable statistics regarding the size of the Web and the content of the information on Web pages. These statistics must be based on a well-documented, valid methodology. At a minimum, statistics are needed for:

The nature of static and interactive Web pages demands that these be treated differently from one another. For example, while estimates of the average size of a static page is meaningful, the average size of an interactive page is meaningless. Generally, the service provided by an interactive page is more important than the text. A small interactive page may be the equivalent of several volumes of tables, or provide services for which there are no print equivalents.

These estimates are more like those maintained by publishers than those collected by libraries in that they reflect what is published on the Web. To assess the nature of this published information, a categorization of information types is necessary. For the study, each Web page pulled for the sample will be categorized into one of the following five broad groups:

These categories are mutually exclusive. Initial pretesting showed the categories to be viable, although they have not yet been proven to be comprehensive. Most likely, additional categories or subcategories will need to be added to this list. These will be identified during the analysis.

Several other statistics will also be estimated from the sample data. These include:

Although the sample is explicitly limited to publicly accessible Web pages, we will still collect a significant amount of information about nonpublic Web pages. Except for the Web sites on the Intranets, many of the nonpublic pages are accessed through a gateway page. These gateway pages are usually public pages and will be included in the sample. The gateway pages will provide sufficient information to estimate the amount of nonpublic material and to identify the common types of restricted pages.


CONCLUSIONS

The World Wide Web is an important and rapidly growing information resource. However, relatively little is known about the collection characteristics of the Web pages. Reliable statistics are rare. The sampling procedure described in this paper is based on clustering sampling methodology and can be used to collect a representative sample of publicly accessible Web pages. The resulting analysis of the sample is expected to yield comprehensive and accurate statistics on both the size and characteristics of Web accessible information.


NOTES

1. Inktomi Corporation, "The Inktomi Technology Behind HotBot", A White Paper, 1996, (5-23-97)

2. Rutkowski, Tony, "Internet Trends," General Magic, February 1997, (5-14-97)

3. Wright, Robert. "The Man Who Invented the Web," Time, Vol. 149, No. 20, May 19, 1997, pp. 64-68.

4. Gray, Matthew. "Web Growth Data," March 19, 1997, (5-15-97)

5. Delta Airlines, May 9, 1997, (5-23-97).

6. Cochran, Willian G. Sampling Techniques. Third edition, John Wiley & Sons, New York, 1977.