Amanda Siewling Lo, Yu-Huei Lainer & Ching-chih Chen
Graduate School of Library & Information Science
Simmons College, Boston, MA 02115, USA
Abstract: Since the first introduction of CD-ROM in libraries in January 1985, CD-ROM technology has had substantial impact on the library and information services in libraries. Up until recent months, librarians have been mainly consu-mers of CD products almost all of which have been produced by profit-making companies and organizations. Technology has developed so rapidly that it is possible now for desk-top CD-ROM publishing. Well motivated to provide a real demonstration at NIT '91, two master students at the Graduate School of Library and Information Science, Simmons College, elected an independence study course under Ching-chih Chen on CD-ROM desk-top publishing in the Fall of 1991. In a very short two-month period -- mid-September to mid-November, 1991, we are able to produce a CD-ROM containing full text and images of all the four conference Proceedings in the NIT series. With the contribution of Knowledge Access International, KAware™ Disk Publisher is used to analyze, prepare and index both the text and image files, and KAware2™ Retrieval System is used to provide full-text & image retrieval capabilities. The replication of the CD-ROMs was made possible at the courtesy of the Disc Manufactuing, Inc. (DMI).
This paper will briefly outline the process of our CD-ROM desktop publishing activity, discuss both our excitement and frustration with this involvement , and offer some practical tips for those who are contemplating producing CD-ROMs themselves.
During the last few years, we have witnessed incredibly fast growth of electronic CD-ROM publishing. Currently there are over 1,500 CD-ROM titles commercially available. Accordingly, the use of CD-ROM titles in libraries and information centers has also grown rapidly (Chen, 1991). Yet, until recently, cost has prohibited individual organizations from considering CD-ROM publishing themselves. Thus few would be interested in creating a CD-ROM.
The recent advent in both PC's hardware and software, particularly the availability of low-cost desktop disk publishers for microcomputers, which are generally available in offices, has made it possible for librarians and information professionals to consider desktop CD-ROM publishing themselves. The potential of this development to libraries and information centers around the world can be easily imagined. CD-ROM has many advantages. This covers the major problems related to cost, access, and distribution, faced by librarians and information professionals. CD-ROM's large storage capacity, its transportability, and its readiness for information access make this an ideal medium for information sharing and distribution. Like-wise, CD-ROM's possibilities for libraries and information centers in developing and less developed countries in enhancing information access and sharing are indeed great!
Two students in the Master Program in Library and Information Science at Simmons College took an intensive course with Professor Chen in the summer of 1991, and had been eager to learn the ins and outs of creating a CD-ROM themselves. With NIT '91 being organiz-ed by their professor, and knowing that most of the previous Proceedings of the series of conferences on new information technology were produced by desktop publishing, the two students considered this a perfect opportunity for them to embark on such an ambitious and useful project. Instead of learning the process more or less in theory, they chose to have hands-on experience in producing this CD-ROM, which combines contents of all four conferences organized by Chen:
• The Second Pacific Conference on New Information Technology, Singapore, May 29-31, 1989;
• The Third International Conference on New Information Technology, Guadalajara, Mexico, November 26-28, 1990; and
• The Fourth International Conference on New Information Technology, Budapest, December 2-4, 1991.
The idea was conceived in mid-September 1991, and the actual process started immediate-ly the following week.
2. THE PLANNING PROCESS
The planning phase of any project is the most critical phase. Chen and Wiggins (1990) in its multimedia application product introducing the CD-ROM technology and products (Fig. 1), illustrates rather clearly the production chair of CD publishing (Figs. 2-3). Clearly, from start to finish, CD-ROM publishing, whether desktop or otherwise, involves the following steps:
• Data generation, or database development
• Data preparation
- Data conversion• Assembly and production- File preparation
- Indexing
- Assembly processed files• Distribution and sales- Testing
- Premastering and mastering
- Replication
In order to ensure the success of our project, we needed to have every step of the above under close control. In addition, it was also important to consider numerous logistic matters during the planning process. These include:
• How to budget our project?Since NIT is a series of non-profit conferences organized by Chen as her contribution to international library and information work, potentially financial limitations can be one of the biggest problems facing us. Fortunately, several software companies contributed to our effort. Since time does not permit us to conduct research in the comparative use of these CD desktop publishing software, Knowledge Access International's KAware™ Disk Publisher was chosen. Furthermore, our financial worry was totally eliminated in mid-process when Disc Manufac-turing, Inc. (DMI), who acquired the Mastering and Replication facilities of Philip Dupont Optical Inc. (PDO), agreed graciously to produce our CD-ROM free of charge as a contribution to international development.• Where to go for possible help and/or contribution?
• What are the necessary and desirable hardware and software for CD-ROM desktop publishing?
• How to choose a Publisher and retrieval software?
• What are the environmental limitations?
As expected, currently almost all available CD desktop "Publishers" are running on IBM PC or 100% compatible with at least 640KB RAM.
2.1. Data Generation and Database Development
This is the first step in the process of electronic publishing that we need to determine what kind of information is to be published. Information sources may be textual, fielded, or image. During this phase, the involved activities are basically information gathering and electronic management. The former determines the formats to be chosen for inclusion, and the latter considers how to turn all the desired information to electronic forms and also how to manage them.
Potentially, our CD-ROM can include all three types of information. But, aside from information on participants which can come in fielded database format, almost all are pre-dominantly full-text files, with each Proceedings consisting of an average of about 45 images. When the images are included in an article, it is very unsatisfactory not to include those images. Thus, the decision was made from the beginning to include both textual and image information.
A complete inventory was taken of the number of articles in each Proceedings, and also the number of images existing in the three published Proceedings. A running list of the articles of each Conference Proceedings were created, and types of information, such as printed copy or electronic file, are clearly marked. When the electronic files are available, the size of each file is recorded and they are carefully checked for system compatibility (i.e. IBM PC or Mac files) and word processing software used. Word processing software commonly used on the IBM are WordStar Professional, WordPerfect 5.0, Microsoft Word and Word4.0, MacWrite, Word-Perfect for Macintosh. This inventory should not be overlooked as it is a very important step towards consistent conversion of data, since all word processed files have to be converted to ASCII. The papers for the Proceedings of the 4th Conference reached the editor at random intervals and some are as late as a week before the meeting dates, so we processed each one as it arrived.
2.2. Data Preparation
2.2.1. Data Conversion
The existing information for the first three Proceedings was available in both printed and electronic forms. For the printed information, all three popular ways of conversion to an electronic format -- Keyboarding, imaging(scanning) , and optical character recognition -- were used. These activities applied heavily to materials of the First Conference and also the new materials of the Budapest Conference as the papers are facsimiles instead of electronic files on floppy disks.
Since most of our data files available in electronic formats are mostly Macintosh files, we have encountered difficulties in converting both text and images from the Macintosh disks.
• Word processing - A good amount of time has been spent on exploring a few PC based wordprocessing software to ensure the files saved as ASCII did truly have all the control character codes removed. This seemingly easy task turns out to be very tedious, and at times frustrating and boring one. But, the process is a must!
• Images - The difficulties are even more considerable in image file conversion as the KAware™ Disk Publisher recognizes only image files with Tag Image File Formats (TIFF), including TIFF1 or TIFF2 formats. Although many Macintosh and IBM scan and paint/draw software save files as TIFF, these files may not be supported by KAware™ Disk Publisher.
2.2.2. File Preparation
The is one of the most essential steps in CD-ROM data preparation. Once the available information and/or data are assembled and converted to electronic format, we have to index, format, and organize the information files in logical formats that describe the disc layout. In order to be ready for indexing, all files have to be either singularly first, and then in the merged way, prepared and analyzed.
Before running the KAware™ Disk Publisher to index the source file so that it becomes fully retrievable, the source files and image has to go through a very thorough process, such as tagging, preparation of special wordlist, phrase list, and stop words, etc... In our project, each conference proceeding contributes to one single file. Each is analyzed, prepared and indexed separately before we merged the four conference proceedings (merged in WordPerfect) and to run it through the disk publisher at one go.
The preliminary file preparation using the KAware™ Disk Publisher involved TAGGING and MARKING the files. Figure. 4 shows the a sample tagged file with markers. The markers
were essentially there so that the headings such as AUTHOR, TITLE, ABSTRACT will show up in the TABLE OF CONTENTS in KAWARE™ Retrieval system . Once a file is tagged, it can be run in TEXT ANALYZER of KAware™ Disk Publisher to check for any irregularities or problem pages and lines that could cripple the final stage, TEXT INDEXING. The second stage prepares the file for indexing. Figure 5 shows how problems if any, are being specified in the text analyzer report.
2.2.3 Image File Preparation
Image files have to be compressed at the F3 (Other Utility ) in the KAware™ DISK Publisher. The compressed images are created and saved as *.CPR. The compressed image files become retrievable in KAware2™ after the text file has been tagged with an image call. Figure 6 shows that the images are tagged with file names, such as IM=91CHEN05.CPR (this is not visible to the user), and it can be viewed by pressing ALT-I. As mentioned in Section 2.2.1., not all image files with header or extension "TIFF" are automati-cally compatible and compressable in KADI (KAware™ Disk Publisher).
Our project requires us to manage 184 image files in both line art and gray tones and in sizes ranging from 80 KB to 300KB. An IBM-compatible scankit was the solution to these problems of TIFF compression. The low-cost Marstek ScanKit comes with a 4.5 by 4.5 hand held scanner and scanning software. The scanning software allows us to scan images of size 4.5" by 4.5 " and to save the images in various formats (.TIFF, .PIX, TIF 1D, etc...). In our
case, we finally chose the "TIF 1D" after having experimented with several
others. This format seemed to work with KADI. After the scanned images
are saved as TIF1D with the Marstek ScanKit, the images (*.TIF) are compressed
by the IMAGE COMPRESSOR resulting in the corresponding *.CPR. The next
step is to use the File mover to MOVE the *.CPR to a sub-directory AA in
the root dir C:\ . The *.CPR must reside in a subdirectory of the directory
in which the sourcefile is kept. This MOVE is similar to the COPY command
in DOS. Figure 7
shows how KAware™ DISk Publisher prompts you to move your compressed image files.
It is worthnoting that since our scanner is a low-cost one, the resulting scanned images are not of the best quality. Furthermore, time limitation also prevent us to further process the images. Nevertheless, we are happy to have been able to provide the users of our disc the capa-bility of image retrieval, which also include the zooming capability at 4 levels.
2.2.4. Indexing
Once the file is ready and the necessary "helper" files, such as file definition, stopword, and phrase list, have been prepared, we are ready to index the file with the Publisher. Figure 8 shows part of the steps involved in indexing in KAware™:
• Parsing each field to find all words except stopwords
• Sorting these words
• Finding all unique words
• Indexing the words
• Creating the display file
• Creating a template file that instruct the KAware2™ Retrieval System to specify information about the file.
It should be of interest to know that, as shown in Figure 8, the complete NIT file of four conference proceedings consists of 84,466 text lines which yield 44,561 unique words. The total size of the NIT file is about 4 MBs, and the indexing time is 48 minutes and 43 seconds.
In addition to the text file. NIT CD-ROM also contains about 180 image files. Although we used a rather low-cost hand-held scanner, the resolution could be set as high as 400 dpi. However, the originals of many images, specifically those of the Bangkok Conference were of poor quality, thus the scanned images can not be as satisfactory as we would like. The file size of each varies from about 80 KB to 350 KB, and thus the size of the image files before compression totals over 40 MBs. However, after compression, the size has been drastically reduced to about 9 MBs.
2.2.5. Full-text Retrieval
Once we have gone through the analysis, preparation, and indexing processes accurately, we are ready to conduct full-text retrieval by using KAware2™ Retrieval System. Figure 9 illustrates how the the user can view the indexed file. By pressing F3 [Select], a list of searchable fields, such as TABLE OF CONTENTS, CONFERENCE, TITLE, AUTHOR, etc... are displayed. By choosing the TABLE OF CONTENTS, the user will be able to scroll in a new menu that contains a list of articles by TITLE, AUTHOR and ABSTRACT. The full text of the desired article will be displayed when any selectable entry is chosen.
2. 3. Assembly and Production
2.3.1. Assembling Processed Files
When all the files are ready to be sent for premastering, we have to consider alternative storage media to download the files from the hard disk of the IBM PS/2.
The Pipeline System loaned to us by Knowledge Access was not configured to be used on our PS/2. We attempted to install it on a 386, but then there were still numerour booting up problems, which were not solvable quickly even by KAI support staff. Due to a lack of time, we made a decision to download to floppies. Having become resigned to some of the technical difficulties of installation, and having overcome others, files created by the KAware™ Disk Publisher are ready to be shipped to DMI for premastering/mastering and replication. Prior to the shipping, we need to also prepare the following:
• Prepare a complete list of necessary files for the distribution of the product and for the manufacturing facility, DMI, Inc.;
• Test the backup and restore process; (downloading from hard disk to 1.2 MB floppies and vice versa);
• Testing the application with the KAware™ Retrieval System with the complete merged file.
After many days of long and hard work, we finally packed our processed files via Federal Express to DMI in Wilmington, Delaware on November 14, 1991. The files were loaded at DMI, simulated by their production hardware. A test disc was also prepared by the DMI staff on November 18, 1991, and tested by us on the next day (see Figure 10). Having verified that
there is no existing error, the 9-track tape was forwarded to DMI's plant in Hunsville, Alabama on the same day. Our hard work is finally paid off when the newly pressed CD-ROMs in its proper casing arrive as sche-duled, in time for us to bring to NIT '91 in Budapest on November 28, 1991!
2.4. Distribution
We are proud to say that our desktop published CD-ROM, ready in two months from the beginning of planning, is ready for free distribution to participants at this conference.
3. CONCLUSION
This has not been an easy project since both the technology and the
process are so new, and few people have expertise in testing this technology
on a low-end desktop platform. Although the software vendor is most generous
in willing to provide us technical support whenever needed, most of the
time, difficulties and problems -- whether in image processing, or in hardware
configuration, or in file merging, etc... -- were new to the vendor's technical
support staff as well. Thus, we have managed to solve the difficulties
by persistence and trials. In between, we have tested and abandoned many
alternatives to our problems. We have learned a great deal, and have marvelled
at the potential this application can bring to libraries, informa-tion
centers, and archives. With our limited resources, we have accomplished
"the impossible," so can you! Take advantage of the tools available to
you, and let them work for you!
ACKNOWLEDGMENTS
A project of this intensive nature can not be completed without the help and contribution
of many people and organizations. We wish to thank Knowledge Access
International and its President, Dr. Matilda Butler, for their permission
to use KAware™ Disk Publisher and KAware2™ Retrieval
System with our product. We would like to thank Fred Cash, and Sue Davy,
of Knowledge Access International for spending many hours over the phone
to discuss technical problems with us. The generous donation from DMI and
its marketing director, Mr. Rusky Caper, is essential for the free distribution
of our CD-ROMs to NIT '91 participants. J. Philips Busk of DMI has
provided us invaluable technical supports, and Wendy Buyea and Calyton
Summer sacrified part of their Saturday for us. We are indeed fortunate
to work with these professionals at DMI. We appreciate the support given
by Hugh Blackmer of Simmons College. Finally, we would like to thank Lora
Lopes for supervising the manuscript process of not only NIT '91,
but also part of the other three conferences. Her work and those of two
student assistants, Jenn Desormeau and Magi Rami, are very much appreciated.
REFERENCES
Butler, Matilda, & William Paisley. How to Make Your Own CD-ROM. Mountain View, CA: Knowledge Access International, Inc. 1990.
Chen, Ching-chih. Optical Discs in Libraries: Use & Trends. Medford, NJ: Learned Information, Inc., 1991.
Chen, Ching-chih and Rae Jean N. Wiggins. HyperIntro to CD ROM Technology
and Products. Newton, MA: MicroUse Information, Inc., 1990. [An electronic
product on floppy discs for Macintosh using HyperCard front-end.]