Matilda Butler
Knowledge Access International, Inc.
Mountain View, CA 94043, USA
Abstract: The introduction of CD-Recorders, pre mastering software, and inexpensive CD-ROM authoring software combined with the rapid growth in the sales of CD-ROM drives has significantly expanded the opportunities to create CD-ROM products for infor-mation distribution. Libraries and other information using organizations have been a major market for CD-ROM titles. However, these organizations have rarely produced their own discs. Now the production of discs can be accomplished by an organization of any size and in any country.
A case study of a recent development project illustrates how it is possible to go from printed books to CD-ROM discs in 30 days. The case study focuses on the procedures necessary and is designed to be used as a planning guide. The 30 day timeline as well as typical costs are included.
CD-ROMs are changing the way information is accessed and used around the world. CD-ROM access is faster and better than print access. Online access in many countries is cost-prohibi-tive while distributed CD-ROM access is affordable. Purchasing titles on CD-ROM is a relatively simple matter. There are several excellent directories that describe available products. When the same information has been released by two or more companies, there are usually reviews in popular CD-ROM magazines that can assist in the selection process.
While buying and using CD-ROMs can now be viewed as a fairly simple matter, creating your own CD-ROMs is still regarded as a complex matter. There are many decisions and steps involved. Fortunately, it is a technology available to all organizations, independent of their size, throughout the world. However, without adequate planning tools, many organizations continue to delay their deci-sion to create their own CD-ROM titles.
Much that has been written underscores the belief that creating a CD-ROM product is a daunt-ing task. After all, it involves new technology, new terminology, and new technical skills. You may ask, "Can I do it at a reasonable cost and within a reasonable amount of time?" The answer is "Yes". This paper describes how you can go from printed pages to retrievable information on CD-ROM in just 30 days.
This planning guide is intended to take you through all the necessary steps. The outlined procedures are generic and apply independent of which software and hardware you choose. The examples draw on experiences with a number of customers using our product line, however, you may choose other software and/or hardware solutions in following these steps.
2. IS IT TIME TO GET STARTED?
The first decision you'll face is "Should I develop a CD-ROM?" For the past several years, would-be CD-ROM producers and would-be CD-ROM users have faced a dilemma. Not enough CD-ROM titles to justify the CD-ROM drive investment and not a large enough installed base of CD-ROM drives to justify the investment to create a CD-ROM title.
Now, these factors have changed. Why? There are three major changes. First, the installed base of CD-ROM drives is significant and continuing to grow. Access times and transfer rates have gotten higher while costs have gotten lower. These changes in the technology have made CD-ROM a much more favorable alternative to end users. Second, there is now a substantial amount of infor-mation on CD-ROM with more being released each month. Depending on the directory you consult, there are now about 4,000 commercial titles. There are at least twice that number of corporate and government non-commercial applications. Third, CDs can now be recorded right on your desktop. This means that you can use CD-ROM even when you anticipate a very limited distribution. You can record the CDs without going to a manufacturing facility for mastering and replication. With the widely supported double- speed CD-Recorders you can produce a 600+ megabyte disc every 36 minutes. Blank recordable media have dropped in price so that you now pay less than 3 cents per megabyte.
CD-ROM is becoming the most cost-effective distribution technology available to information owners. On a single 4.7" disc, it is economically efficient to store text only or text and images.
If your organization is considering CD-ROM distribution, how do you plan this transition? How can you make it happen? Let's assume that it has become your responsibility to see to it that information is distributed on CD-ROM. Furthermore, assume you are operating on a tight schedule. These are a few questions you're probably considering:
• Where do I start?
• What resources do I need that I don't already have?
• Should I make this an in-house project? Contract out the whole job? Or, implement some components and contract out others?
• What is all of this going to cost?
• Can I make just a few copies of the CD in-house?
• Where do I get the large number of replicates I'll need?
Following are the steps to convert an existing print product to CD-ROM:
2. Choose pre mastering software and hardware (optional)
3. Design product
4. Convert printed text to electronic text with mark-up as needed for chosen authoring system
5. Scan images in text
6. Design CD-ROM artwork and create film positives
7. Index electronic version of documentation
8. Test with retrieval software, making changes if desired
9. Write and print CD-ROM user manual
10. Premaster software and documentation and simulate performance
11. Produce writable CDs for corporate approval
12. Subcontract mastering and replication of CD-ROMs
13. Purchase retrieval system licenses
14. Receive CD-ROM replicates and begin distribution
3. DAY 1: USER MANUALS IN HAND
Recently, a major COBOL software tools developer, MicroFocus, came to Knowledge Access with a project that had tough timing requirements. The product is called COBOL Workbench. The job began when MicroFocus walked in the door with 15 printed manuals and the question, "Can we have 450 CD-ROMs containing software and documentation in 30 days?"
Although our company has helped many organizations produce CD-ROMs, we knew this would be a challenge. We already had authoring software: KAware Disk Publisher with KAware3/ Windows Retrieval System; premastering software: Personal Premastering System; hardware for indexing and premastering: IBM PC compatible with 1.2-gigabyte hard drive capable of a 800 Kb sustained transfer rate; and hardware for creating "one-off" CD discs: both the Philips and Kodak CD-Recorders.
If you have not chosen the indexing and retrieval software, then you need to allocate a month for that step. During that month you would request demo disks, evaluate software, and purchase an authoring package. If you do not have the necessary hardware, then you would make that purchase within the same 30 day period. In our case study, with decisions already made on the software and hardware, the first step was to get an electronic version of the 15 manuals. We faced three conver-sion options:
1). Use existing electronic files
2). Use Optical Character Recognition (OCR) to scan and convert the printed pages
3). Keyboard the text.
Although it was tempting to use existing electronic files, this doesn't always turn out to be the best solution. Some text files contain embedded typesetting codes that are difficult to remove. Other text files don't reflect final changes made after "blue-line" proofs where a quick cut and paste often seems to be a more expedited solution than going back to the source file. In this case, the text files were created by many authors. There was no single, consistent, electronic file. The cost and time to write custom programs to conform the text files would have been expensive. And even then, it would have been necessary to add hypertext links, fields, image calls, and other CD-ROM specific features.
The second option was to use Optical Character Recognition (OCR). The feasibility of using OCR for data conversions depends primarily on the suitability of the source materials and the amount of reliability needed. The particular type font, within-page type font changes, inclusion of tabular materials, and number of different type sizes can all "fool" the OCR software and can poten-tially lead to many errors.
Typically, OCR, with corrections, provides between 95%-99% accuracy. This sounds accep-table until you consider what this means per page. A typical page has about 2,000 characters. Even with 99% accuracy, there would be 100 errors per 10,000 characters. This means 20 errors per page. At the 95% level of accuracy, there would be an average of 100 errors per page.
The third option, and the one chosen for this project, was to keyboard all 15 manuals and to scan the images. A high accuracy level of 99.975% is achieved through double keying. Even at that highly accurate rate there can be 2.5 errors for 10,000 characters, or about one error every two pages. Compare this to the 100 errors per page that OCR often yields and it is easy to understand the decision to keyboard the text.
Many organizations consider converting to Standard Generalized Mark-Up Language (SGML) at the time of keyboarding text. With development support from the American Association of Pu-blishers, SGML is a language for marking text for a variety of purposes, including typesetting and disk publishing. A well-designed SGML scheme would enable the publisher to mark text just once for multiple uses. Since typesetting requires more codes than disk publishing, the procedure of moving SGML-marked text to disk usually involves suppressing or modifying some of the codes. SGML significantly increases the cost of conversion (doubling to tripling the cost depending on the complexity of the mark-up) and is frequently not the best choice for a publisher's goal.
For this product, we chose a standard left-tag format. This is an inexpensive format to imple-ment while providing field specific searching power (e.g. a search just in the major header field rather than all fields) and type font and type color control (e.g. major header field will be displayed and printed in red 14 point bold Helvetica while the text field will be displayed and printed in blue 12 point regular Times New Roman.
4. DAY 2: PURCHASE AUTHORING SOFTWARE; PURCHASE PREMASTERING SOFTWARE AND HARDWARE
MicroFocus had already evaluated the authoring and retrieval software. Although they were relying on our company to assist them in getting the first disc published, they immediately purchased the authoring software. This gave them about two weeks to learn the software while the data con-version step was taking the printed manuals and creating the electronic text. If they had chosen to do the indexing themselves, they would have been ready to use the CD-ROM authoring software by the time the electronic text and image files were ready.
Similarly, they will take over all future production steps and therefore purchased our Personal Premastering System with the Philips CDD-521, the double speed CD-Recorder that creates a 600+ MB disc each 36 minutes. By purchasing the software and hardware at the beginning of the 30 day period, there was time to learn the product before needing to use it.
5. DAY 2-5: PURCHASE ORDER ARRIVES AND TECHNICAL SPECIFICATION BEGINS
The Purchase Order arrived on the second day and the project was underway with the clock already running. Digital Services Limited, Mt. Pocono, PA, was chosen as the data conversion company based on their relevant experience with this type of conversion and their low cost. There are many conversion companies. The keyboarding usually is done in the Philippines, Jamaica, or India even though the company is headquartered in the U.S.
Developing a technical specification for the conversion is next on the list. This step involves an analysis of each printed volume to determine:
2. Fields (Chapter, Major Heading, Minor Heading, Page Number, Text, Glossary, Index, Appendix Image)
3. Section Markets (Table of Contents automatically created by software based on markers inserted in the text during conversion)
4. Images (Definition of what would be scanned as an image; development of naming convention for images; decision on scanning resolution)
5. Tabular material
6. Hypertext links
Once the printed technical specification was complete, MicroFocus was consulted on many of the issues and changes were made, as appropriate.
The technical specification isn't really complete until there is an initial conversion test. Be-cause of time constraints, only 50 text pages and 10 images were converted using the spec. Adjust-ments and clarifications were made and a final version of the technical specification was produced. This became a 15 page document with an additional 20 pages of examples. If time permits, we recommend a larger sample to ensure that most of the possible variations have been found.
6. DAY 6-20: DATA CONVERSION TAKES PLACE IN JAMAICA; ARTWORK DEVELOPMENT TAKES PLACE IN PALO ALTO
Data Conversion. With a sign-off on the technical specification, the full data conversion began. In this case, the technical team that handled the coordination and relayed questions was in the U.S. The keyboarding was done in Jamaica. All text was double-keyed and then a program compared the two files. Errors were identified and corrected.
Disc Art Development. In preparation for the manufacturing step, a graphic design artist developed the artwork. There are specifications for artwork including number of colors, placement of text, and size of text. It is always best to choose the manufacturing facility and get their specifi-cations before spending money on the artwork. In this case, we chose DMI. DMI has two manu-facturing plants -- Anaheim, CA and Huntsville, AL. As a California company, we usually send our work to Anaheim. However, by coordinating the artwork early, we found that the client had a uni-que ink need that could only be met by the Huntsville facility. Always allocate adequate time for the artwork. Most facilities, including 3M, Sony, and Metatec, prefer to receive artwork one-week before the premastered files. There are manufacturing facilities in many countries including Austra-lia, Austria, Canada, France, Germany, Greece, Japan, Netherlands, Sweden, United Kingdom and United States. Costs for their mastering and replication services are primarily determined by the turnaround time. If you need discs back in 3 days the fees are higher than if you can wait 15 days. Costs may also vary depending on the number of colors used in your artwork as well as your need for such additional services as shrink-wrapping and insertion of documentation booklet.
7. DAY 21-24: INDEXING, RETRIEVAL, AND PERSONALIZATION OF PRODUCT; QUALITY CONTROL; PREPARATION OF USER MANUAL
Indexing. On day 21, floppy disks containing the converted 15 manuals and an Exabyte tape containing the 1,000 images arrived. We copied the files onto our production PC: a 486 with a 220 MB hard drive and a 1.2 GB hard drive. The KAware Disk Publisher and KAware3 Retrieval Sys-tem were already loaded. Using utilities in the Disk Publisher, we:
2. Created a tag file (list of fields that were to be separately indexed)
3. Used File Preparer to make global changes in the file (e.g. \ had been used to mark the beginning of each section, not realizing that the same character is used in many of the sample COBOL programs listed in the manuals)
4. Modified the Word Rule file that contains parsing rules (e.g. the dash was added to the list of continuation characters so that pagination could be used as hypertext links)
5. Reviewed the results of the word frequency list to determine additional words to add to the Stopword list (e.g. added COBOL to the list of 85 most frequent words such as "a", "an", "the" that are not indexed)
6. Indexed the text file (a complete inverted index is created so that the retrieval system knows the location and frequency per field of each word and image)
In this case, the client was nearby and came over to view the product. In looking at the Table of Contents, it was thought there was too much detail. Therefore, a new section was created at the top of the source file that just indicated the names of the 15 books. Each marked section then pro-vided hypertext links to the beginning of the book and to the book's own Table of Contents. When the file was reindexed, the newly created Table of Contents gave users easy access to the individual books. Then once the user was in a book, there was the detailed Table of Contents, useful both for orientation and for choosing sections.
If you are doing the disc yourself, you may also find that some decisions made don't turn out exactly as you intended. The indexer program typically indexes 15 MB an hour. This makes it easy and fast to revise text and reindex until you are satisfied with the results.
Personalization. Once there is agreement on the core of the product, it is time to consider the various elements that help to make it a unique product to be identified with the organization. De-pending on the indexing and retrieval package you have chosen, there will be different amounts of personalization possible. In KAware3 for Windows, there are five major elements: logo, icon, installation, type fonts, initial configuration.
ICON. KAware3 provides an icon to start the program. However, most organizations want to create their own using a utility like Image Editor. Depending on the complexity of the design and the graphic "talents" on the staff, this will either be done in-house or by a graphic artist.
LOGO. The opening logo reflecting the organization and/or product name can be created and installed. The program comes with a default logo so that it is not necessary to create a custom one. However, most organizations will either do their step in a utility like PC Paint or have a graphic artist prepare the logo.
INSTALL. KAware3 comes with an installation program that needs to be modified by the publisher. The product name should be inserted and the name of individual files to be downloaded are added. Even with the faster CD-ROM drives, most publishers download the retrieval system and a few other files that help to speed the searching.
INITIALIZATION. There are several features that the publisher can control. In this case, MicroFocus did not want the default opening to display the Search Dialog box where the user is asked "What do you want to search for?" They felt their users would want to start with Table of Contents access rather than word or phrase access. The placement of win-dows is also under the control of the publisher. For this product, the decision was made to display the Table of Contents on the top left, images on the top right, documentation text in the middle, and the summary of retrieved sections and number of "hits" (frequency of retrieved words per section) at the bottom. Other publishers may want quite different configurations. And, of course, users are able to change these configurations to match their own uses of the product. However, it is important for the publisher to create the presentation of the product for first-time users.
TYPE FONTS, SIZES, COLORS. And finally, the publisher has control over the look of the text on the screen. KAware3 uses a lookup file created by the publisher to control the look of the screen. This approach achieves many of the same results as an SGML file without the expense. The file contains:
2. Type font (e.g. Times Roman)
3. Type size (e.g. 24, 12)
4. Style (e.g. Bold, Italic)
5. Presence or absence of underlining
6. RGB values for the foreground color
7. RGB values for the background color
In KAware2 for DOS, there are 4 ways to personalize the product: opening screen, help, install, and configuration.
HELP. The help files are provided for publisher modification. This makes it easy to add information about your organization and/or about the product. A utility is also provided to convert the help file to a random access file that is used by the retrieval system.
INSTALL. A sample install program is provided with instructions for modifying it to be used with your CD-ROM product.
CONFIGURATION. As the publisher, you can control the default options for many of the features. As users become experienced, they can then change these options to match their searching style.
2. Check each image both for quality and for presence in correct directory
3. Accuracy of Installation program
4. Appropriateness of Help system (modifiable by publisher)
8. DAY 25-26: PREMASTER; PRODUCE "CHECK DISCS"; CORPORATE TESTING AND ACCEPTANCE
Premastering. When the product was ready for final approval, the files were premastered. Premastering software takes all the indexed files and puts them into ISO9660, a format for CD-ROM. In this case, only the indexed files and the retrieval software were premastered onto a large hard disk. MicroFocus took over the final premastering using the hardware and software purchased from us. Since they needed to add their COBOL software files to the disc, the in-house premastering approach meant they had control over the final steps.
On our production machine, we use a utility to "split" our 1.2 GB drive into a DOS partition and a CD-ROM partition. Files are premastered from the DOS partition to the CD-ROM partition. While the files are still on the hard disk, we do additional testing and simulation. The premastering software makes it easy to simulate the performance of the product on CD-ROM drives of different speeds. The premastering software also provides data on the frequency of access of each file so that their location can be modified to enhance speed of operation.
Produce "Check Discs". When we were satisfied with the premastered files, we produced two "check discs". These are recordable CDs that can be created in-house. Until recently, this was an expensive (about $300) and time consuming (3-7 days) step. Now, within 30 minutes you can produce your own CD that can be put in any CD-ROM drive and used. The blank recordable media retail for $25-$30. In an effort to keep the entry point for CD production as low as possible, we're selling the Kodak media for $17 each.
It is the introduction of the CD-Recorder that makes it possible to use CDs even for low-volume distribution of text and images. Previously you had to need at least 100 CD-ROMs to prorate the cost of the master and replicates. The economics have changed so that it is possible to make a few copies. Now libraries and smaller organizations can economically share information resources even if only 5 or 10 copies are needed.
Our company recently introduced a CD-Recorder control unit called the CD- Maker. Designed and manufactured by our Italian partner, Alea Sistemi, the CD-Maker lets you produce CDs without a PC. The controller has a flat touch panel for entering commands such as "Make another copy? Y/N." The CD-Maker means that any staff can create the multiple CDs for small production runs or for classified or corporate information that cannot be sent to a manufacturing facility. This equipment is also appropriate in countries where there are no CD-ROM manufacturing facilities.
Corporate Testing and Acceptance of CD-ROM and User Manual. The "check disc" was given to MicroFocus to evaluate the product and determine if they wanted any additional changes. Several departments reviewed both the CD and the printed documentation. At this point it was still easy and inexpensive to make changes.
MicroFocus made the final "check disc" that included the COBOL software files as well as the indexed technical documentation and images along with the retrieval software. We sent the revised user manual to the printer.
9. DAY 27-29: MASTER AND REPLICATES PRODUCED USING 2-DAY TURN
It's day 27 and the rush is on to get the replicates. DMI had been selected to produce the mul-tiple copies of the CD. Because of schedule constraints, the master and 450 replicates were pro-duced using a two-day turnaround time. The manufacturing facility uses the "check disc" to create a glass master in a clean-room environment. The master stamps the inexpensive replicates. The replicates are then printed using the artwork supplied by the publisher. And finally, the discs are placed in jewel cases, packed, and shipped. We used overnight Federal Express service to meet the deadline.
10. DAY 30: REPLICATES AND USER MANUALS ARRIVE; DISTRIBUTION BEGINS
Day 30, 9:40 AM. The 450 replicates arrive. At 11 AM, the printer delivers the user manuals. MicroFocus was called and sent over a truck and staff to pick them up. By the end of the afternoon, they had the first copies in the mail.
Costs
The steps described above show how relatively easy it is to create a CD-ROM product. How-ever, a next critical decision in the planning process is to determine if the costs of the project are affordable.
Costs for CD-ROM publishing will vary, of course. Some costs are non- recurring such as the authoring system, other costs are recurring such as replicates. Some costs are optional such as the purchase of a CD Recorder while others are required such as artwork. Still other costs depend on whether the project is done in-house or involves a contractor. Time requirements influence some costs such as mastering and replicates. In spite of all these variables, we have provided ranges to each of 15 costs to assist in your planning. Because Knowledge Access's goal is to make CD-ROM production available and affordable to as many organizations as possible, we price our products and services at the low end of the range. However, there are other more expensive products that may better serve your needs and may still be considered affordable for your organization.
1. Purchase CD-ROM authoring software $1,000-$35,000
2. Purchase premastering software/hardware $5,000-$12,000
3. Product design including technical specifications $2,500-$10,000
4. Convert printed documentation text to electronic
text w/ mark-up as needed for chosen authoring system $2-$12/1000 char.
5. Scan images $ .75-$2.50/image
6. Design CD-ROM artwork and create film positives $300-$1000
7. Index electronic text, load images into correct
subdirectories, quality control images $5,000-25,000
8. Test with retrieval software, making changes if desired $500
9. Develop and print user manual $2,000-$10,000
10. Premaster indexed files; simulate performance $200-$1,000
11. Produce writable CD(s) for approval $100-$500
12. License retrieval software $5-$75/disc
13. Subcontract mastering $1,000-$2,800
14. Subcontract replication of CD-ROMs $1.80-2.50/each
15. Mail CD-ROM disc/manual $1.67/postage ea.
(without manuals) $ .98/postage ea.
For any particular CD-ROM product, many of these items are not necessary. For instance, if you already have an electronic file, then you don't have costs associated with 4 and 5. If you create the product in-house then you won't have outside costs associated with 3, 7, and 8. If you purchase premastering software and hardware, then you don't have costs associated with 10.
CONCLUSION
Choosing to publish on CD-ROM used to be complicated, expensive, and unusual. Now that has changed. Inexpensive authoring software takes over many of the steps. Low-cost data conver-sion means that even materials not available in an electronic form can be distributed on CD. In-house use of the authoring software means products can be created with few outside costs. Well-priced premastering software and CD-recorders bring additional control to the process. The accep-tance of CD-ROM technology and the boom in the purchase of CD-ROM drives means users of your information are ready for this new form of distribution.
The savings of electronic over print distribution is well documented. Whatever you decide to do today, remember that the CD-ROM tide is swelling. Learning all you can about it now will lead to reducing your costs and broadening the use of your information.