THE LIBRARIANSHIP OF THE WEB: Options and Opportunities Managing Transitory Materials

Wallace Koehler Internet Consultant
Maryville, TN, USA
E-mail: willing@usit.net

The Internet as an information storage and communication medium lies somewhere between the broadcast media and the traditional print and electronic media. Unrecorded broadcast information is extremely ephemeral in that it ceases to exist at the moment it is created. Ownership and distribution are centralized at the point of publication. Print materials, on the other hand, have long lifetimes and continue to exist even when newer versions are created. Distribution and access are diffuse and localized. The WWW is more like broadcast media in that the publisher manages distribution. WWW material persists longer than broadcast, but is highly ephemeral since it disappears as it is superseded.

Non-recorded broadcast material by its very nature cannot be managed, stored, archived, etc. (Once recorded, the material becomes "traditional.") The WWW can be subjected to bibliographic control but the approaches the librarian or information scientist must take have necessarily differences from the approach taken to traditional materials. This paper addresses some of those differences, explores problems unique to the WWW. and offers recommendations based on experience developing WWW catalogs. Librarians perform a variety of services. These services include selection and collection, reference and retrieval, weeding, information dissemination, cataloging, and myriad other services. Management of Web resources creates new challenges for librarians, but challenges well within their traditional scope. One of if not the key challenge the Web brings to the information community is the requirement that bibliographic referents maintain their integrity in an ever-changing information environment.

 
1. INTRODUCTION

The world of information creation, dissemination, retrieval, and use is in constant flux. New technologies and new methodologies are being created to manage these new information technologies in libraries and other venues (Chen, 1996). The last two decades has been marked by remarkable changes in document creation. As late as the early 1970s, the electric typewriter was considered the state-of-the-art "word processor." Documents were edited with scissors and paste. Print was virtually the only means for formal transmission and communication of ideas. Libraries were physical places with physical collections of books and periodicals and "hardcopy" card catalogs. Other vehicles for communication were available, but many were in their infancy. The VCR was under development. The CD-ROM did not exist. The Internet was limited to a small number of universities and government agencies in the United States.

2. THE INFORMATION "REVOLUTIONS"

Information storage and dissemination is today in its third "revolution." The first began with the development of language and was and remains the most common form. It is centrally owned and distributed and is largely ephemeral. Interpersonal conversation, body language, unrecorded broadcasts are all examples of the first revolution. It is centralized because the communication is owned and controlled solely or almost solely by the information creator and publisher. It is ephemeral because once uttered, the communication ceases to exist.

The second revolution began with the written word and with storage and redistribution of that information in something other than human memory. The highest form of storage and redistribution is the library. Communications are distributed and "permanent." They are distributed because copies reside in multiple repositories. They are permanent because the original and its copies are likely to persist for the foreseeable future.

There are those who suggest that the Internet and particularly the World Wide Web offer a new paradigm for information creation, storage, and dissemination. The Internet does not represent a new paradigm. This third revolution represents a form of information creation and management that lies in form between the first two. Ironically, and despite its high technology aspects, the Web as an information storage and communications medium is a less sophisticated and less complex concept than the second revolution.

The Web revolution is not even a revolution in the speed of communications. That began with the invention and dissemination of the telephone, telegraph, and wireless technologies. The WWW began as a centralized and highly ephemeral medium. If the WWW were to remain an ephemeral, centralized, and unrecorded medium, it would be of little interest to the library community.

3. THE EPHEMERAL CHARACTER OF THE WEB

Unlike the spoken word, Web documents are stored and recorded. They are easily accessed and reproduced. WWW documents also differ materially from "second revolution" media in that once they are edited, changed, replaced, recalled, superceded, or eliminated, they are gone. The Greek philosopher Heraclitus taught us that we could step into the same river but once. For all intents and purposes, the same can be said of Web documents. We can access the same document but once. As new "editions" are created, the old cease to exist. Both temporary and permanent solutions are under consideration. Temporary solutions include distributed caching ranging from proxy to client caches. These raise many practical and ethical issues including currency, copyright, and control (Tewksbury, 1998). Various formulae for archives have been offered as more permanent solutions. Both comprehensive Web archives (Kahle, 1997) as well as a wide variety of specialized collections have been recommended (Feldman, 1997) as a means to capture changing Web documents over time.

The transitory nature of the WWW is well recognized but inadequately documented. Chankhuntod et al (1995) observed that WWW object lifetimes are on average 44 days and that text and graphic objects have mean lifetimes of 75 and 107 days respectively.

Object lifetimes are less useful for librarians than Web document lifetimes, including both Web sites and Web pages. Koehler (1997) has reported a more complex lifecycle for Web sites and pages. Web pages and sites not only change, they also disappear. Moreover, Web documents can be classified according to publisher, size, object dominance, and other factors. These independent variables help predict Web document cycles. Koehler (1998) reports preliminary data that suggest that Web page and Web site half-lives are on the order of 1.5 and 2.4 years respectively. The attrition rates for both Web page and sites are shown in Figure 1. Both Web pages and sites exhibit intermittent behavior (they disappear for a variety of reasons but reappear). About 5 percent of the WWW are gone but will return at any given time. He also charted Web page change rates and finds that between 20 and 80 percent of Web pages undergo some type of change weekly. This is shown in Figure 2.

Cognizant of Web site changes, at least one digital Web library has developed selection criteria that consider URL stability as one factor. The Scholarly Societies Project at the University of Waterloo (1998) is a collection of URLs for professional organizations worldwide. They have found that "canonical" URLs are more stable than the non-canonical. Stability is defined as longevity. A canonical URL is one that contains the organization name without reference to other hosts or servers: e.g. www.orgname.org. They now have a preference for canonical URLs.

Figure 1. Web site and web page attrition




4. WEB GROWTH

The WWW is not only experiencing document change; it is also increasing in size. Not only is the number of URLs, hosts, and server-level domains increasing at near geometric rates, a phenomenon documented in a number of quarters (e.g. NetCraft, 1998; NetWizards, 1998), but the size of existing Web pages and Web sites are also increasing (Koehler, 1997). The amount of WWW activity is also increasing, more packets are being sent and received than ever before (Matrix Information & Directory Services, 1998).
 
 


 

Figure 2. Web page weekly change




5. OWNERSHIP AND ACCESS

Although the WWW has been described as an anarchic, highly decentralized medium, which in fact from one perspective it is. From a librarian's perspective, the content of the Web should be considered highly centralized. It is largely centralized because its content is largely owned, stored, and controlled by its authors and publishers at single locations. That information can also be accessed for the most part by end users at that single location. The question of ownership and access is one of the more interesting debates facing the librarian community. The WWW has exacerbated the debate. By definition, the native Web document cannot be owned by libraries, but can be accessed from them.

What is interesting and most pertinent for those of us in the library and information science community is the evolution of the WWW from a centralized, ephemeral medium, to one that is more distributed, recorded, and permanent. Digital libraries, jumppages, and gateways have created inchoate forms of Internet bibliographic control (McDonnell, Koehler, and Carroll, 1997). Archives and caches are establishing recorded Web document histories, distributed collections, and to a very limited degree permanence. There are a wide variety of efforts others to provide pointers to Internet information (e.g., URx, PIC, DOI, and Dublin Core) and the development of historical archived collections. These range from the work of OCLC, the US Library of Congress, SOSIG/DESIRE, the World Wide Web Consortium, the Coalition for Networked Information, and many others. Some are more sophisticated than others are. The Dublin Core is perhaps the least complex and least structured in that it envisions Web authors assigning pre-coordinate terms to Web documents. The W3C PICs allow Web authors to assign pre-coordinate alphanumeric codes to Web documents or others to provide post-coordinate codes to local or network caches. OCLC's NetFirst is a post-coordinate search engine with both descriptors and Dewey Decimal Codes.

Unfortunately, these classification schemes, as sophisticated as they may be, fail in one very important aspect. They do not manage change well. Thus, as Web documents change or disappear, the classifications either lag those changes or fail to reflect them. One of if not the key challenge the Web brings to the information community is the requirement that bibliographic referents maintain their integrity in an ever-changing information environment.

6. BIBLIOGRAPHIC CONTROL OF THE WEB

Librarians perform a variety of services. These services include selection and collection, reference and retrieval, weeding, information dissemination, cataloging, and myriad other services. Management of Web resources creates new challenges for librarians, but challenges well within their traditional scope.

What librarians do may not change, how and when they do it, however, does. First, the Web requires librarians (and others) to gain new and sometimes complex computing and online skills. These skills may range from the management of LANs and installation of TCP/IP stacks to the manipulation of a wide variety of constantly changing search engines.

Librarians excel at bringing order to chaos. Without question, the WWW as it is now arrayed is true chaos. There are librarians who despair ever organizing information and metadata for the WWW so that it can be effectively and efficiently used (Ardito, 1998). The library community is concerned with mechanisms to bring order to the WWW, including addressing, indexing, and cataloging schemes. A number of information scientists are examining patterns amongst URLs, top-level domain names, site structures, page types, hypertext link types, and other indexable markers to identify content, quality, timeliness, and other factors (Chu, 1997; McDonnell, Koehler, and Carroll, 1998; Koehler, 1997a; Urgo, 1997).

Librarians are also concerned with developing effective catalogs of Web material. It is clear that traditional methods no longer hold simply because of the degree of change and the labor-intensive nature of those methods. Catalog records and indexes must be updated as Web materials change if the WWW is to be subject to bibliographic control. These bibliographic changes can only be managed through automated systems that significantly reduce human inputs.

The process has two major steps: (1) the creation of the bibliographic collection and (2) its maintenance. The first step, the creation step is fairly complex. Web collections, like all other collections must meet the needs and requirements of the end users for whom they are designed.

The maintenance phase is more straightforward. The WWW must be periodically researched to harvest previously undiscovered yet relevant materials, eliminate those that have become irrelevant, and capture new material.

7. CONCLUSION

World Wide Web materials are different from the "traditional," which include both print and other electronic media. There are at least seven significant differences between the two:

• First, WWW material is ephemeral and impermanent while traditional media are far more permanent.

• Second, WWW material is inconstant. It changes. Once published, traditional media do not change (or at least, not often).

• Third, authority and quality issues are, at least for the moment, vague on the WWW, more clear with traditional materials.

• Fourth, metadata are often uncertain or vague and often ignored on the Web. Systems are far more developed and accepted for traditional materials.

• Fifth, Web materials point to as well as incorporate other native Web materials both as bibliographic references and as integral parts. Traditional materials more often utilize bibliographic surrogates.

• Sixth, a collection of Web documents requires little or no physical maintenance. Libraries of traditional media require much.

• Seventh, catalog maintenance of Web documents requires much attention, traditional materials far less.
 
 

We continue to work toward an understanding of the dynamics of the WWW as a medium for information transfer and use. I offer the following preliminary observations for consideration. • The medium is no more the message any more than pen and paper is Hamlet or the word processor that this was written with is the study. Yet, first editions are collectors’ items. There is beauty in presentation.

• We should not become too preoccupied with the technology, but rather with the role of that technology and the implications of that technology for communicating ideas. We need to be aware of our traditional and new roles as information scientists and librarians. We are concerned with providing access to the intellectual content of what we are charged to preserve. We are not changing what we do but how we do it.

• We need to conceive of Web sites and Web pages in ways other than as surrogates for the traditional. A Web page is not a page in the traditional sense of the word. Web pages consist of a series of Web objects. A Web page is like one of Alexander Calder’s mobiles. Objects hang from a central object. Each object performs an individual task. At their base is usually but not necessarily a text object. A text object is a Web object that is created and edited using a specific set of tools based on a specific set of standards. The base object can be graphic object. That too is a Web object created and edited using a specific set of standards. A text object and a graphic object can carry the same message. Objects are attached to the base object. Objects can be added, eliminated, or changed at will. Manipulation of the objects can and does impact the message as much as editing the base object. If the page is built well it has balance.

• The Web is different from traditional presentation media. Traditional implies that everything is "fixed" and immutable once published. A Web document is fluid. Once changed, the old ceases to exist. In the traditional world, the changes become new editions.

• All storage media are transitory. Some media are more transitory than others.

• Is it worth the trouble to try to develop bibliographic access to the WWW? Should we just fall back and rely on the search engines with their inadequate indexes to do a poor job of it?
 
 

We should not bemoan the fact that the WWW is inherently different from traditional media. Nor should we throw our hands up and declare the new reality too complex to manage. We are not facing a paradigm shift. We are adding a new to information management, communication, and processing. Librarians and information scientists have not lost their mission. That mission has been expanded and, I believe, enhanced by the Internet.

REFERENCES

Ardito, S. (1998). The Internet: Beginning or End of Organized Information? Searcher, 6 (1): 52-7.

Chankhuntod, A., P. Danzig, C. Neerdeals, M. Schwartz, and K. Worrell (1995). A Hierarchical Internet Object Cache. http://excalabar.usc.edu/cache.html. Updated November 6, 1995.

Chen, C. (1996). Global Digital Library: Technology is ready, how about content? In Proceedings of NIT (New Information Technology) '96:, 9th International Conference, Pretoria, South Africa, November 11-14, 1996. West Newton: MicroUse Information.

Chu, H. (1997). Hyperlinks: How well do they represent the intellectual content of digital collections? Digital collections: Implications for users, funders, developers and maintainers. In Proceedings of the American Society for Information Science. Medford, NJ: Information Today. pp. 361-69.

Feldman, S. (1997). "It was here a minute ago!": Archiving the Net. Searcher, 5 (9): 52-64.

Kahle, B. (1997). Preserving the Internet. Scientific American, 276 (3): 82-3.

Koehler, W. (1997a). An end user's view of mining the Web: Focused and satisfied Internet search and retrieval strategies. In Proceedings of the Internet Society Meeting, The Internet: Global Frontiers, CD-ROM and http://www.isoc.org/isoc/whatis/ conferences/inet97/proceedings/D3/D3_3.HTM.

Koehler, W. (1997b). Web Site and Web Page Persistence and Change: A Longitudinal Study. MS Thesis, The University of Tennessee, Knoxville, TN.

Koehler, W. (1998). An analysis of Web page and Web site constancy and permanence. Journal of the American Society for Information Science, forthcoming.

McDonnell, J., W. Koehler, and B. Carroll (1997). Automating the dynamic development and maintenance of a distributed digital collection. Digital collections: Implications for users, funders, developers and maintainers. In Proceedings of the American Society for Information Science. Medford, NJ: Information Today. pp. 244-59.

McDonnell, J., W. Koehler, and B. Carroll (1998). Cataloging challenges in an area studies Virtual Library Catalog (ASVLC). Journal of Internet Cataloging, 1 (3), forthcoming.

Matrix Information & Directory Services. (1998). http://www.mids.org.

NetCraft. (1998). http://www.netcraft.co.uk/Survey.

NetWizards. (1998). http://www.nw.com.

Tewksbury, R. (1998). Is the Internet heading for a cache crunch? On the Internet, 4 (1): 16-22.

University of Waterloo, Scholarly Societies Project. (1998). URL-Stability Index for the Scholarly Societies Project.

http://lib.waterloo.ca/society/URL_stability_index.html.

Urgo, M. (1997). Analyzing company Web sites. InfoManage, 4 (3):7.