FULL-TEXT CD-ROM LIBRARIES FOR INTERNATIONAL DEVELOPMENT

Bibliotecas de Texto-Completo en CD-ROM
para el Desarrollo Internacional

Matilda Butler

Knowledge Access International
Mountain View, CA 94043, USA

Keywords: Full-Text Retrieval, CD-ROM, International Development, Cooperation, Document Delivery, Developing Countries, CGIAR, Agricultural Information.

Abstract: Most CD-ROM products currently in use in libraries around the world are bibliographic databases. However, the future impact of this new information techno-logy on international development will be dominated by full-text collections that provide entire libraries on CD-ROM. These library-on-a-disk products will range in size from a few books to hundreds and even thousands of books.

This paper describes a 25-disc collection being produced by Knowledge Access Inter-national and Saztec International for the Consultative Group for the International Agri-cultural Research (CGIAR) for the World. This full-text CD-ROM library, known as the Compact Agricultural Library, consists of 5,500 books and reports and represents one of the largest integrated CD-ROM collections under production.

Resumen: La mayoría de los productos CD-ROM en uso en las bibliotecas alrededor del mundo son bases de datos bibliográficas. Sin embargo el impacto mayor de esta nueva tecnólogia en el desarrollo internacional será la provisión del contenido textual de la obra, el artículo, de bibliotecas enteras.

Esta ponencia describe una colección de 25 discos producidos por Knowledge Access International y Saztec International para Consultative grupo por el International Agricul-tural Research (CGIAR). Esta biblioteca en CD-ROM conveida como Compact Agricul-tural Library consiste de 5,500 libros y informes siendo así una de las mas grandes colecciones integradas en CD-ROM actualmente en producción.

 
 
1. INTRODUCTION

CD-ROM, since its inception, has been of considerable interest throughout the world. The benefits of CD-ROM over paper include:

• Cost Savings

• Improved Access

• Expanded Distribution

The benefits of CD-ROM over online include:

• Cost Savings

• Access to More Primary Information

• Greater Reliability

These benefits are significant both to users of information who want better access at a lower cost and to owners of information who want to distribute their information more widely.

2. CD-ROM AND INTERNATIONAL DEVELOPMENT

The development of CD-ROM originally mirrored the growth and content of online. Bibliographic databases, the back bone of online, were the first to move to CD-ROM. Then CD-ROM quickly overtook and passed the online services in providing access to many forms of information within a unified retrieval system. Structured records, full text, monochrome images, color images, numeric data, and audio can all be retrieved and displayed interchangeably.

The advent of bookshelf CD-ROM collections has brought these forms of information together on one disc. The wise decision to implement most CD-ROM applications on personal computers rather than minicomputers or mainframes promises to make bookshelf collections available around the world at low cost.

These two waves of CD-ROMs -- bibliographic and full text -- can now be put to the service of international development. The first wave, bibliographic, is significant because international access to online resources has been difficult and expensive. However, even when available locally,bibliographic collections often provide information about resources that cannot be obtained. The second wave, full-text, is significant because it makes the primary literature available inexpensively at any location -- even rural areas that lack major library resources.

The following list exemplifies some of the bibliographic databases on CD-ROM that are now available for international development.

Agricola

AGRIS (International Information System for the Agricultural Sciences and Technology)

CAB Abstracts

Aquatic Science and Fisheries Abstracts

Energy Library

KIT Abstracts (agriculture, environment, and geography)

Popline (family planning & population, biomedicine)

Selected Water Resources

Sesame (agriculture)

The next list exemplifies some of the primary literature, referred to as full-text information, on CD-ROM that is now available for international development.

AIDS Information and Education

Encyclopedias (3 editions)

Food, Agricultural and Science

Languages of the World

Le Monde En Chiffres (economic data)

Maps (3 editions)

The Pesticides Disc

Women: Partners in Development

World Weather Disc

As bookshelf collections increase in size from a few books to hundreds and thousands of books, the problems of managing multiple-disc retrieval are added to the problems of managing different forms of information on one disc. In a case study, this paper describes a 25-disc collection being produced by Knowledge Access International and Saztec International for the Consultative Group for International Agricultural Research (CGIAR) of the World Bank. This collection, known as the Compact Agricultural Library, consists of more than 5,000 books and represents one of the largest integrated CD-ROM collections under production. The agricultural research centers served by the CGIAR, located principally in developing countries, are eager to acquire an entire working library that occupies less than a foot of shelf space and is electronically searchable throughout. In print format, the collection occupies more than 100 feet of shelf space. The text and image contents of these discs can be printed, saved on magnetic disk, telecommu-nicated to field stations, or faxed. Local document collections can be added to the Compact Agricultural Library in the same format.

Although the adoption of CD-ROM versions of bibliographic databases such as AGRICOLA, AGRIS, and SESAME is growing rapidly because of cost savings and searching power, the num-ber of such databases is limited. Yet each of these bibliographic databases caps a literature large enough to generate hundreds and sometimes thousands of bookshelf CD-ROMs containing the primary documents. Therefore the still-developing science of producing bookshelf CD-ROM collections has become a important challenge for publishers of this new medium.

3. MAKING INFORMATION MORE USABLE

More than 50 years ago I.A. Richards wrote, "Books are machines to think with." His abiding concern was the fidelity with which meaning could be conveyed from writer to reader. Over a span of five hundred years, books have become the principal medium for conveying meaning through writing. Writers, editors, and publishers draw not only on their own knowledge but also on well-tested practices to create books that convey meaning effectively.

Electronic information began to supplement printed information when online retrieval ser-vices were established in the 1970s. Beginning in the mid-1980s, optical disks in the compact disk read-only memory (CD-ROM) format extended the shift from print to electronic information. One or both of these forms of information are now available in thousands of research libraries, univer-sities, governmental agencies, and corporations around the world.

At first, almost all electronic information consisted of bibliographic files, financial quota-tions, and other brief records with little need or opportunity for editorial improvement. Until recently, the limited amount of full text offered by the online services has been typified by news wire dispatches that supersede each other hourly. Not even typographical errors are consistently corrected in most electronic information.

Two perspectives seem to account for the poor editorial preparation and presentation of elec-tronic information. The first perspective is that electronic information is too ephemeral to merit proper editing. The second perspective is that "This is not the real information; this is only an index or summary of the real information, which is found in books and journals."

However, many information archiving and publishing programs are now focusing on optical disks rather than printed publications because of significant cost differences. A one-ounce CD-ROM can store more information than a thousand pounds (100,000 sheets) of paper, is inexpen-sive to "printing" and mail.

"The real information" now is electronic in many cases, and electronic publications require much better editorial preparation and presentation than they have received in the past.

The problems of publishing bookshelf CD-ROM collections arise not only because the quan-tities of information are large but also because we want to enhance the usability of the information at the time of production. The information could be transferred to CD-ROM without enhanced features at lower cost. In the least expensive case, digitized page images rather than computer-searchable text pages could be transferred to CD-ROM for only a fourth of the conversion cost of the text pages. Although page images require more storage space and hence more discs, it can be demonstrated that the increased cost of mastering and replicating the discs still yields a net savings of more than half over full text conversion. An important factor in this cost comparison is that an average of 30% of all text pages need to be imaged anyway because of illustrations and other inconvertible material such as equations.

However, an image-only collection would not provide the enhanced features that are be-coming the hallmark of bookshelf collections. Almost all enhancements that are added to a book-shelf collection at the time of production involve computer-readable text.

Generally, at the present time, page images are used only to capture illustrations and incon-vertible material. All other text is converted to computer-readable form, which is viewed as essential both for precise retrieval and for flexible use of the retrieved information. Computer- readable text can be manipulated, printed, and transferred to other programs like word processors in many ways.

The goal of making information more usable, combined with the remarkable ability of CD-ROM to assimilate different types of information, also creates opportunities and problems for design with respect to audio, color images, and the inclusion of information that is related to the full text but usually appears elsewhere, such as bibliographic records and primary data.

Color images have not been incorporated in many bookshelf collections because of a lack of display standards for high-resolution color. Super VGA has been proposed as a compromise standard, combining wide adoption with relatively high resolution. However, when color images are needed at all, as in medicine, science, art, or marketing, both resolution and color accuracy are usually of the essence. Excellent color displays on standard personal computers will be commonplace in the near future.

The historic relationship between electronic bibliographies and electronic full text is being reversed. Initially, full text was added to bibliographic records as an additional field, as in the case of the MAGAZINE INDEX online. Now bibliographic records are being added to full-text data-bases because the cataloguing information and subject descriptors of a good bibliographic record are often more helpful than the forematter of a book itself. The user should not have to know that bibliographic records are fielded while full text is not. The retrieval system should take advantage of these differences without drawing attention to them.

Some books are based on extensive primary data in the form of survey data, laboratory findings, field notes, economic analyses, etc. It has never been practical in the past to publish more than a fraction of the primary data that books address. In CD-ROM publishing, however, primary data is not a problem, even if it occupies ten or a hundred times the space of its resulting book. In fact, it will probably need little processing and will go on the disc more cheaply per megabyte than the book.

Thus CD-ROM's versatility creates many opportunities to enhance the information and make it more usable. In turn, each enhancement brings initial problems of design and implementation. Since most enhancements focus on the most confusing part of a bookshelf collection, the text itself, the remainder of this paper will discuss the ways in which text requires enhancement and how the enhancements are done.

4. THE DESCRIPTIVE PROPERTIES OF FULL TEXT VERSUS FIELDED INFORMATION

The fielded information found in bibliographies and directories has two important advantages over full-text information. First, the field labels guide the user in choosing an attribute for search-ing such as date, title, or geographical location. Second, the field contents are usually entered with some care to ensure accuracy and consistency; often a controlled vocabulary governs all of the possible entries for a field.

In comparison, full text usually lacks zones where information has a predictable meaning, nor are most writers as disciplined as bibliographers or directory editors in ensuring that terms are used consistently on page 5 and page 500. The literalness and all-inclusiveness of electronic searching only magnifies the ambiguity that a reader of the printed text must overcome through the use of context and prior knowledge.

In computer searching, homonyms are retrieved as false positives, while synonyms that the user forgot to enter are unretrieved and thus become false negatives. Many ambiguities are so specific to a text that an editor can scarcely anticipate them all. For example, on one recent CD-ROM a frequent reference to "pre-1970" was allowed to remain unedited, causing those sections to be retrieved incorrectly in a search of the year 1970.

Most fielded databases such as bibliographies and directories have an editorial team that is responsible for the consistency of the entire database. Full-text bookshelf collections cannot re-ceive the same editorial attention for two reasons. First, the books already exist and may have originated in many sites. Second, the size of the collection may exceed any time budget that a publisher can consider giving to it. For example, if each of the 450,000 source pages for the Compact Agricultural Library were to receive only one minute of editorial review, the time budget would be 7,500 hours. With a work year of 240 8-hour days, four person-years would be required to finish the task.

The vocabulary within a bookshelf collection often ranges from the popular to the rigorously technical. As a consequence, synonyms and related terms are scattered not only on one vocabulary plane but on several. In one collection pertaining to pharmaceuticals, "faintness" was related to "dizziness" on the same plane but also related to "syncope" across planes.

An international bibliothek of libros can deliver the coup de grace to editorial efforts toward consistent vocabulary. For example, in the recent CD-ROM on Acid Rain: Canadian Government Documents, the English and French halves of the collection must be searched separately. There is no satisfactory dictionary for the automatic translation of search terms. A phrase such as "hy-draulic ram" often becomes the equivalent of "water goat" in another language.

National vocabulary differences within the same language are less apparent but will upset a search just as effectively. Agronomists in two English-speaking countries might entitle the same report, Crop rotation of peanuts and corn or Intercropping of ground nuts and maize. Good bibliographic records would rescue the search in this example, because a thesaurus of subject headings would conform each pair of synonyms to one term or the other.

Of course, terminological inconsistency is a problem in all databases, as an online searcher can quickly verify by examining the index around any search term (using, for example, the expand command in DIALOG). Variant terms, misspellings, and bad grammar are common. In a fielded database, field labels and controlled vocabularies help to minimize the effects of these problems. In a full-text collection, the computer should be able to overlook inconsistencies and resolve ambi-guities in the same way humans do, by bringing prior knowledge and context to bear on the pro-blem. As it happens, prior knowledge of relationships among topics in the text can be embodied in hypertext markup, and semantic networks can be derived from the entire verbal context through statistical analysis. These are two immediate frontiers of research on full-text description.

5. THE USES AND LIMITS OF EDITORIAL MARKUP

In print publishing, editorial markup is concerned with how the text will appear on a two-dimensional printed page. The major issues are type font and graphic layout. In electronic publishing, there are generally fewer choices in typography but many more choices in layout.

Electronic text is not two-dimensional. "Pages" are not what they seem, because there is probably hidden material at the left of each page as well as unhidden but often unnoticed material from columns 80 through 239 on the right of each page -- the normal display window uses only the left 76 columns of text, not counting hidden columns for tags in the left margin. The top and the bottom of each long text file may also serve as locations for information that is not in the text itself but would be helpful to the user, such as a topical overview of the database at the top and a glossary of terms at the bottom. Overlays on displayed pages further remove them from the realm of two dimensions.

However, the greatest asset of electronic text can also be its greatest liability. All pages are equidistant from each other in terms of searching or displaying speed. Pages that deal with a com-mon topic, even though they are separated by thousands of other pages in the text file, will be seen together often, in direct succession as if they were physically assembled in a mini-book for the occasion. In a bookshelf CD-ROM collection, a retrieval with 200 "hits" from 50 books, which can be screened at the rate of one hit or page every few seconds, can be compared with the effort of removing 50 printed books from the library shelf, bookmarking the relevant pages in each book, and trying to read each book in the context of what is being learned from the other books.

The disadvantage in reviewing 200 hits from 50 books in 10 minutes is the disorientation that comes from knowing too little about each book. As a result, the reader can become disoriented.

After other editorial preparation is complete, the markup of a full-text database for CD-ROM publication typically affects five features of the database:

• Left-margin field tags for field-specific searching. Image calls that assign images to spe-cific text locations are also flagged here.

• Right-margin heading markers, hypertext markers, instructions, synonyms, etc. The abundant space between columns 80 and 239 is largely "out of sight, out of mind" to the user, although it can be viewed by shifting right with the TAB key or right arrow. It can hold, for example, two complete foreign-language versions of the 76-column text on the left that the user normally sees. More commonly the right margin is used for any editorial material that will enhance searching or understanding the text but should not be inserted in the text itself.

• Topical overviews of the database, or other introductions to the database, as the first con-tent for the user to see at the top of the database. Unlike the editorial introduction to a book, new material at the top of a CD-ROM database can contain hypertext markers for direct jumps to sec-tions thousands of pages away. Thus the topical overview can be dynamic and can take the user directly to each section that corresponds to a topic heading at the top of the database.

• Glossaries and other helpful reference material at the bottom of the database. The editor can construct a "safety net" at the bottom of the database so that search terms that "fall through" the text index (because they lack occurrences in the text, and synonyms have not been posted) may find a hit in the glossary, where the user may be able to identify a comparable term and go to its occurrences in the main text via hypertext.

• The layout of each 22-line screen, especially where the layout of the CD-ROM display file should have a different appearance than the layout of a previous printed or online version; a CD-ROM display file is often a continuous scroll from top to bottom without page breaks, yet unlike the one-directional scroll of an online file it can be "paged" forward and backward very quickly in large jumps or small jumps; when the user is looking at any 22-line segment, is it as self-explana-tory as it can be?

If a full-text database is viewed as an endless page that can be 10 million lines long, which equals 450 megabytes and can be contained along with its 200-megabyte indexes on one CD-ROM, then the face of that page as well as its four margins are all available for editorial markup as described above. However, no budget for a bookshelf CD-ROM collection can afford 10 million lines of markup plus the open-ended list of enhancements that can be added in the four margins. Careful markup of headings, judicious use of hypertext markers, and perhaps some topical markup based on database analysis are as much markup as an average project can afford.

6. THE USES AND LIMITS OF HYPERTEXT

It might be possible to produce a large bookshelf CD-ROM collection without using hyper-text, but the results would be disappointing for the user. Hypertext is defined as an ability to jump between related sections of the text, whether within the same book or between books. Hypertext does not replace the searching power of the retrieval system. For example, it cannot carry out a search using Boolean logic such as "(term1 AND term2) NOT term3".

The purpose of hypertext is to lead the user along paths through the database that connect related sections. The sections may be connected either by naturally occurring terms or by markers that have been added by editors. In the future, electronically sophisticated authors will add their own hypertext markers to the text, just as they now add cross-reference page numbers for the book reader.

Some of the many uses of hypertext include:

• Linking two or more cross-referenced sections so that a user can jump from one to the other and back again.

• Linking a general introduction to each of the detailed topics that it introduces, no matter how far away they are in the text.

• Linking the main text to tables, figures, or supplementary appendix material.

• Linking a bibliographic record to the corresponding full text.

• Linking Table of Contents and Index entries to text locations.

• Linking text terms to a glossary.

• Linking footnote references directly to footnotes.

For the publisher, a good implementation of hypertext has four features: (1) it is informal, so that any favorite markup system will work; (2) it is forgiving, so that links can be added or changed in the last stages of text preparation; (3) it uses naturally occurring terms or symbols in the text whenever possible, in order to minimize editorial markup; (4) the links are WYSIWYG (what you see is what you get); any necessary markup occurs without programming, and the markers can be seen and proofread along with the text.

When naturally occurring terms are used to forge the links between sections, the hypertext capability is called intrinsic, since the occurrences of a term in multiple sections of the text create the links automatically. Apart from the usual problems of consistency in spelling (e.g., "behavior" vs. "behaviour") that can be solved by a computer pass through the database prior to indexing, the naturally occurring terms require no editorial attention. However, cross-references and references from a book's index are examples of natural links that may need some editing to ensure that page numbers or section numbers are serving as unambiguous markers.

Extrinsic links, based on new markers, create exciting but also effortful hypertext applica-tions. In simple applications, an editor chooses markers (sometimes called "hot buttons") and places them at the origin and destination of each link. It is then possible for the user to jump between pages, sections, or even different books that are not linked by naturally occurring terms. In a bookshelf CD-ROM application that combines popular and technical terminology, this may be the only way to enable jumps between related sections in different books, since the books do not share a common vocabulary. The retrieval system, limited by its literalness, cannot recognize relatedness in dissimilar vocabularies.

The use of hypertext for conceptual "mapping" is not as easy as the advertisements for Hypercard and other simplified hypertext tools imply. At a minimum, the effort is the same as creating an extensive book index. First, it is necessary to know the content thoroughly so that the links are correct. Second, it is necessary to recognize concepts when they are cited in many different ways and to judge when a link would be digressive. Third, since one person sees links that another person misses, because of different cognitive frames of reference, it is necessary for more than one person to work on the markup.

Careful hypertext markup of this kind cannot progress much faster than three to five minutes per page. For the small applications described in advertisements, it is possible to allocate 3,000 minutes or 50 hours to 1,000 pages -- about the same time that an extensive book index requires. However, when the CD-ROM will contain 450,000 pages, as in the case of the Compact Agricul-tural Library, the time budget expands to more than eleven person-years. New source material may accumulate faster than the original material can be marked up.

Therefore, although hypertext has become an indispensable feature in bookshelf CD-ROM collections, the common-sense rules for its application in large databases are still being developed and tested. There is an obvious lesson in the fact that naturally occurring terms create hypertext links at no cost, while editorial insertion of "hot buttons" can be very costly.

7. MANAGING AND ENHANCING BOOKSHELF COLLECTIONS: THE COMPACT AGRICULTURAL LIBRARY

Knowledge Access International is developing the Compact International Agricultural Research Library (CIARL) for the World Bank Consultative Group for International Agricultural Research. The information was developed at 20 centers located in 15 countries as shown in the following:
 

AVRDC Asian Vegetable Research and Development Center Taiwan, Republic of China

CIAT Centro Internaciónal de Agricultura Cali, Columbia

CIMMYT Centro Internaciónal de mejoramiento de Maiz y Trigo Mexico D.F., Mexico

IP Centro Internaciónal de la Papa Lima, Peru

IBPGR International Board for Plant Genetic Resources Rome, Italy

IBSRAM International Board for Soil Research and Management Bangkok, Thailand

ICARDA International Center for Agricultural Research in the Dry Areas Aleppo, Syria

ICIMOD International Centre for Integrated Mountain Development Kathmandu, Nepal

ICIPE International Centre of Insect Physiology and Ecology Nairobi, Kenya

ICLARM International Center for Living Aquatic Resources Management Manila, Philippines

ICRAF International Council for Research Agroforestry Nairobi, Kenya

ICRISAT International Crops Research Institute for the Semi-Arid Tropics Hyderabad, India

IFDC International Fertilizer Development Center Muscle Shoals, Alabama, USA

IFPRI International Food Policy Research Institute Washington, D.C. USA

IITA International Institute of Tropical Agriculture Ibadan, Nigeria

ILCA International Livestock Centre for Africa Addis Ababa, Ethiopia

ILRAD International Laboratory for Research on Animal Diseases Nairobi, Kenya

IRRI International Rice Research Institute Manila, Philippines

ISNAR International Service for National Agricultural Research The Hague, Netherlands

WARDA West Africa Rice Development Association Bouake Ivory Coast

Certain facts and assumptions play a more important role in the design of exceptionally large collections than in the design of small collections. With approximately 25 discs in an integrated collection, it matters that:

• Few user sites have more than one CD-ROM drive. Disc swapping in the single drive is unavoidable but can be minimized through proper design.

• Since the necessary set of index files and auxiliary files is about 50% as large as the source file, the largest source file that will result in a one-disc index is about 1.3 gigabytes. On a one-drive system, a one-disc index is the key to good performance. If the text of a collection is larger than 1.3 gigabytes, it should be divided into two or more subcollections.

• Careful assignment of books to each subcollection is the first step in minimizing disc swapping.

• Any thought of hand editing, except to designate section breaks and headings for the electronic Table of Contents, becomes quixotic when the collection approaches half a million pages. A useful rule of thumb is that it takes 1 person-year to devote 1 minute to each of 100,000 pages. A book-level analysis of content is more feasible. However, that procedure, whatever it is, must be repeated 5,000+ times.

Of course, all problems of design and implementation are more than offset by the great opportunity of producing the Compact Agricultural Library, which is made possible through the generosity of CGIAR's international donors. This collection will be a laboratory for many ideas in the production of multiple-disc sets. Better procedures for designing and implementing large bookshelf CD-ROM collections will result from the project.

Disc layout for the Compact Agricultural Library is not merely a question of organizing text vs. images. There are actually six types of content to be organized interdependently on the discs:

• Full text book pages

• Catalogue records for all books

• Monochrome images

• Color images

• Indexes and auxiliary files created for the retrieval system

• A Syntopicon of subject-matter clusters to help identify relevant books

The first five of these already exist or will be created automatically during CD-ROM produc-tion. The sixth type of content, the Syntopicon, will result from a semantic network analysis.

The catalogue records will be processed in two ways and stored on the discs in two formats. First, they will be processed as part of the full-text collection, with hypertext links to the books they represent. In this format, the catalogue records will be stored on every disc for constant accessibility. They will occupy less than 1 percent of the space on any disc.

Second, the catalogue records will be processed in a fully fielded format. From this set of files, librarians and researchers can create a variety of special outputs, ranging from local catalog cards to book catalogs and other bibliographic lists with full control over which fields are printed in which positions. This version of the catalog will probably be stored on one disc only.

The topic clusters of the Syntopicon will provide users with an overview of the subcollec-tions of books. The Syntopicon will be placed on every disc, so that the user can refer to it from anywhere.

The goal of the Syntopicon is to allay the confusion that a user will feel in approaching a set of 25 discs containing more than 5,000 books. If the collection were far smaller, a master Table of Contents could provide the desired overview. However, with minimum description of 10 lines per book, the master Table of Contents would be 50,000 lines long, which is the length of four 300-page books.

Until now, the largest integrated collections of full text and images have been in the range of three to five discs.* Thus, as we jump to 25 discs, we are in effect designing a B-747 on the experience base of a DC-3. The new design cannot be a DC-3 stretched to five times its original size. New design principles -- new ways of thinking about electronic information in all its variety -- are essential.

Still, bookshelf CD-ROM collections of the future will dwarf our present efforts. Design principles that emerge in the development of the Compact Agricultural Library should, if possible, provide some guidance for 100-disc and 500-disc collections as well. Worldwide information production in research, government, and business totals billions of pages per year, and paper technology is no longer adequate for managing it. CD-ROM is certainly not the ultimate medium for electronic storage and distribution of information, yet all future electronic libraries will be based on evolving principles of information design to which our present efforts should contribute.