DOCUMENT IMAGING: A STEP

DOCUMENT IMAGING: A STEP TOWARD A PAPERLESS OFFICE

Wilfred W. Fong

School of Library and Information Science
University of Wisconsin-Milwaukee
Milwaukee, WI 53201, USA
E-Mail: WFONG@CONVEX.CSD.UWM.EDU

Although image processing technology has been available for many years, not until recently it has drawn much attention in office automation. The rapid improvement and cost affordability in technology have made image processing systems more readily accessible. An imaging system scans all the paper and converts the information into digitized images, then compresses and stores the images into computerized storage devices such as optical disks. This type of imaging technology can reduce a roomful of filing cabinets to several optical disks. It also allows multiple users to access the same document simultaneously.

In general, documents in paper forms are scanned in the order of receipt or categories. Docu-ments can be batched and scanned at one or more locations, local or remote. Most of the document scanners have a resolution of 200-400 dots per inch. The high-speed scanners operate at up to 100 pages per minute but with the double-sided documents the process scans at 55 pages per minute. The scanned image pages/files are then individually or selectively indexed and integrated into existing records or files. Several index terms may be entered on each scanned page. The first is a sequential index number while the others link the image to its "parent" file, making retrieval possible through multiple paths. As the images are now stored in magnetic form or optical disks, the integrity and content of the document are also preserved permanently. The quality of the scanned images will then be examined. If it passes the inspection, the paper copy will then be shredded or recycled. The filing space will now be left for other uses (Sherron, Jan/Feb 1992, p. 35). With this technology, it makes the "real" office automation -- the paperless office -- closer to reality.

A document image processing system consists of scanners/optical character recognition (OCR), image compression/decompression boards, mass storage devices, laser printers, and indexing/ retrieval software (see Figure 1).

Scanners/OCR are the primary input devices for an image system. Desktop scanners with a 300-400 dots per inch (dpi) resolution and the capability of handling multiple gray scales and colors are becoming popular. There are two types of scanners, namely, single-sheet and batch scanners. Single-sheet involves scanning a sheet of paper at one time and batch scanning involves pre-batching pages with batch separators. High speed scanners usually have batch scanning features. With OCR software, scanners can capture virtually any typeface in sizes from 6 points or higher even with special effects such as italics, boldface and subscripts. In addition, a document can be scanned as an image format file. This capability allows drawings and halftones to be processed.

Image compression/decompression boards provide image enhancement, compression and decompression of scanned documents or images, conversion from analog signals to digital bit streams. One of the problems in image processing is the size of a scanned image. A scanned page of text document in image format file is almost 100 times larger than the size of the same page in text format file. One double-spaced typed page may be about 3 kilobytes (KB) but with the same page scanned as a 300dpi image, the size will increase to nearly 60KB of storage. The ratio depends on the size of the text image. The bigger the image size and the higher the resolution, the larger the image file size (Roth, 1990, pp. 8-9).

Due to the size of an image file, document image processing systems require mass storage devices including magnetic and digital tape drives, hard disk drives, or optical disk drives. Although magnetic and digital tapes are portable and economical, the access speed is slow due to its sequential access procedure. For example, if an image is near the end of the tape, the drive has to rewind the entire tape to its end and then retrieve the image. This type of storage is ideal for archival materials which is seldom accessed. Hard disk drives are another storage media device. As most of the hard disks are internal and expanding its size is not feasible, the use of Bernoulli Box remedies this problem. Bernoulli disks, developed by Iomega Corporation, are portable with a size of a 5.25-inch floppy disk but 0.3 inches thicker and can save up to 90MB. The average seek time of a Bernoulli disk drive is 27 msec. Optical disk storages, such as write-once read-many (WORM) disks and compact disk-read only memory (CD-ROM), are another alternatives for mass storage. A 12-inch WORM disk can hold about 2.5 gigabytes of character or about 50,000 page images and the 5.25-inch disk holds about 20,000 page images. A 5.25-inch CD-ROM can store about 600MB of characters. Archival storage on WORM or CD-ROM optical disks is the most cost-effective alternative for a growing number of information management environment. In situations where access to stored records is frequent, where quick retrieval is necessary or where records can be purged after several years, optical disk storages often provide the most efficient solution at the lowest cost. In addition, optical disks increase security over magnetic tape storage, and enormous space efficiencies over both tape and paper format. Once information is written on the disk, it is unalterable. However, with the average access speed of an optical disk of 400 msec, it is considerably slower than 17 msec of a SCSI hard disk or 27 msec of a Bernoulli disk. Optical disk with faster average access time of 256ms are now available in the market (Santelli & Hurley, 1990, p. 60).

Another important element in imaging system is the output devices. One of such devices is the high resolution large screen monitors that can display from 200 to 400 pixels per inch and zoom, pan, and rotate the image. Images can also be produced on paper using high resolution laser printers. Using modern communication technology, images can even be transmitted directly to a remote location through international networks such as Internet or Bitnet and faxes.

Retrieval software serves as a medium between the end-user and hardware. The entire process, beginning with scanning the document/image to saving scanned images, is simple. However, the retrieval of such images demands an efficient means of storage devices. Image databases differ from regular textual databases in many ways: content, organization, storage, coding and abstraction, query languages, search strategies, and output (Roth, 1990, p. 11). An image database is a system in which a large amount of image data and related information are stored in an integrated manner with a management system (Tamura & Yokoya, 1984). The related information is textual description of an image. Searching through an image database using the entire image or a portion of it is not feasible due to the speed of the PC microprocessor. To expediate the searching process, one has to use the text associated with the images. When a document is to be stored on a document image system, the user enters one or more search terms or criteria into the retrieval software system. These terms or criteria will then be used when retrieving images.

Assigning an accession point is like indexing, which is often quite labor-intensive. An automated indexing approach has been used to expediate the process. A document is first scanned using optical character recognition (OCR) to convert the text into ASCII file, indexing terms, or accession points, are then generated automatically and placed in the appropriate database fields for future retrieval. Another automated method is bar-coding which is used to encode file folder or document indexing information. Bar codes can match any meaningful number such as an invoice number, employee identification number or account number (Thiel, Summer 1992, pp. 46-48).

Another automated indexing method based on artificial intelligence is the PixTex Electronic Filing System produced by Excalibur Technologies Corporation in Virginia, USA. PixTex looks for patterns in binary data instead of exact matches in text. It then builds an index based on the binary representation of the ASCII text, not on the ASCII text itself. This mapping approach produces indexes only one-third the size of the original data. The images can then be retrieved by using free-form, associative, and open syntax clues (Thiel, Summer 1992, p. 46).

In 1989, the International Organization for Standardization (ISO) published the International Standard "Office Document Architecture (ODA) and Interchange Form." The ODA, known as ISO 8613, is an attempt to standardize the storage and retrieval of image databases. Under this standard documents envisaged are primarily, but not exclusively, those which are encountered in an office environment such as letters, memos, reports, manuscripts, contracts or invoices. The receiver of such a document should be able to render the document on a printer or a computer display in the form specified by the sender, but one should also be able to process the document, for example, to modify its content or its visual appearance. The imaging process is the final processing step for ODA documents described in the Standard. Since the imaging process is obviously very hardware dependent, the ODA Standard contains only few specifications on how the imaging process should be performed but it does include the rules of determining the content of basic layout objects during the imaging process (Appelt, 1991). Unfortunately, a majority of document image system vendors has not yet widely adopted this standard.

A document image management system is to scan, index, retrieve, store, and print scanned documents. There are three types of document image system:

• Desktop systems are designed for single-user applications. They handle small document volumes and are more economically priced.

• Departmental systems handle between five and one hundred simultaneous users and hundreds of thousands of documents. Sometimes referred to as mid-size systems, they are designed for work groups or corporate departments.

• Enterprise-wise systems are designed to accommodate all of a large organization's documents and users. These can accommodate millions of documents and thousands of users (O'Connor, July/August 1991, p. 230). This type of system are usually operated on a network and equipped with high-speed scanners.

To select the correct document image system, it is important to conduct a thorough study of the applications. First, the main objectives for system should be determined. This step is to determine precisely what are the information management needs of the organization (Santelli & Hurley, 1990, p. 46). The benefits of implementing the system, such as faster access to information, access to more information, reductions in storage space, better document security and improvements in document retention, should be evaluated carefully (O'Connor, July/August 1991, p. 231).

Second, potential users of the system must be identified. Questions such as who will be using what on the system must be answered. In other words, what type of information will be released and the security level of access for each user have to be included in the feasibility study.

Third, the type and size of documents to be included in the document image system, and the total document volume or the number of document/image pages to be stored on the system, needs to be determined. In this step, an appropriate storage media device will be selected based on the amount of documents/images need to be stored.

Fourth, the retrieval keys of the scanned documents have to be designed. Every document image system will allow users to assign keys, or accession points, to each document. The administration has to determine what accession points to be used and how the documents should be organized on the system. In some cases, full-text indexing might be required if the users want to search by entering any word or group words to find the documents that contain those words (O'Connor, July/August 1991, p. 231). Accessing the scanned documents/images depends on the retrieval software systems. Accession is sometimes limited by the features and functions that are available on the software systems. A comparison study must be conducted to examine various retrieval software programs and to see which one best fits the needs of the applications.

After the above mentioned four steps in preparing and conducting a thorough study and analysis of implementing a document image system, the final step is to contact and consider image system vendors. There are four types of document image system vendors. The first type is micrographics vendors that are primarily oriented toward storage and have their roots in the micrographics and paper-storage industries. Examples of micrographics companies are Eastman Kodak, 3M, Bell & Howell Document Management Products and Tab Products. The second category is the manufac-turers of dedicated storage and document image systems. They produce storage units, image servers and workstations as well as document image retrieval software system. These companies usually have stand-alone systems designed for single-user applications. The third group includes vendors that are integrating optical disk storage and document image applications into PC environment. This group usually involves in the development of retrieval and indexing software for document image system to be used with existing computer hardware. Examples of this group are value-added resellers or consulting firms specializing in document imaging. The final type is mainframe computer manufacturers such as IBM and Wang, which offer applications to take advantage of their installed bases of office automation and computer applications. These manufacturers concentrate their product development in enterprise-wide document imaging systems (Wohl, January 1991, p. 32; Sherron, Jan/Feb 1992, p. 37).

Imaging systems offer a number of advantages to organizations with large quantities of paper records that require certain processing steps and/or actions. The benefit of implementing imaging systems include greatly reduced storage space needs, speed of retrieval, secure and organized storage of documents, multiple-user access to documents, and the ability to integrate imaging with existing computer applications and functions (Sherron, Jan/Feb 1992, p. 34). With images stored on optical disks, there is no alternatives to delete or change the image. This increases the security of the documents being altered. Once an image is retrieved, the user can transmitted it electronically via electronic mail systems or faxes. The ability to integrate optical disk-based documents with other computer-based information is a major advantage of document image document systems. Integration supports online access from a workstation to other computer services, such as databases stored on mainframes or minicomputers. The images can then be incorporated with related database, word processing or spreadsheet files simultaneously in a multi-window screen presentation (Wohl, 1991, p. 34).

In order to evaluate the cost-effectiveness of document image systems, the cost of paper storage should be studied (see Figure 2). For example, office space in one of the World Trade Center Towers in New York City costs approximately $35 per square foot a year. A letter-size file cabinet occupies about six square feet. The annual cost for just one file cabinet totals $2,520. The office space in Midtown New York costs about $1,800 per year. The same file cabinet will cost approximately $1130 per year in Boston at $15.70 per square foot while it is about $1,080 annually in Columbus, Ohio at $15 per square foot. In Houston, Texas, the file cabinet space will cost approximately $916 (Gilder, May 1992, p. 42). An office with an average of 6 file cabinets and a 25% growth annually in paper documents filings, the cost to maintain a paper storage space will dramatically increase every year.

The costs of an image system depends on the functions and features available on such system. A basic imaging system with a PC, scanner, retrieval software, optical-disk drive, and laser printer may cost about $20,000. With the networking option and an additional five workstations, the entire system will cost approximately $100,000. The price will rise as more options are added. Such options include a jukebox of optical disks, high speed scanners with double-sided scanning capability, high resolution image display monitors and high resolution laser printers. Furthermore, most local area networks are not designed for the transmission of images due to the size of image files. A networked document imaging system must be installed on a separate network. This will also increase the setup cost of a document imaging system. As Sherron observes, "Costs are high because seldom is there a suitable off-the-shelf design: usually the hardware configuration has to be custom designed for the user's application" (Sherron, Jan/Feb 1992, p. 37).

What is "Paperless Office"? A simple definition to this term is an office with no paper. It is an environment with heavily use of information technology in automating office operations. Since the introduction of computer technology, the term "office automation" seemed to surface among many of office information managers. The phenomenon has been growing rapidly especially after the personal computer technology introduced into the office environment. Word processor software has replaced many typewriters; electronic mail replaced inter-office memo notes; databass program has replaced rolodex, and the list can go on and on.

If we need to accomplish the "paperless" stage, we need to know how an office operates. Using a document imaging system can greatly affect the workflow as well as the distribution of information and the personnel structure of an organization. A careful and thorough study should be conducted to examine and define the objectives of such an application and to identify the requirements to meet them. As Doswell stated that discussing office automation without the organization context is like describing computer technology without the technological context (Doswell, 1990).

Different offices have different operational and management strategy, however, the basic office procedures are within a spectrum of standard operations. Hines combined both technical and human factors into a formula to compute the success of an office operation:

The technical success implied the use of information technology successfully in an office environment. This does not only requires the hardware and software to work, but also the hardware and software must be used effectively. The effectiveness is the human factor which calls for continuing training and education of the users.

The technology is advancing rapidly. This makes powerful computer hardware become more accessible and affordable. In less than five years, computer prices dropped drastically. What people can afford now for a 386-PC was the same price range as a PC-XT several years ago. Intel, the makers of PC microprocessor, has been improving the power of the 86 family processor. The 486 chip is smaller and more powerful than its predecessor and this makes the portable PC more lighter to carry and powerful to work with. Software programs share the same scenario. Programs are getting more and more powerful and containing more and more features. Sometimes, the programs contains more options than a normal user will ever need.

With the development of Fiber Distributed Data Interface (FDDI), the traffic flow of information, especially image files, on a network will be greatly enhanced. FDDI is a fiberoptic token passing ring system that can transfer data at a rate of 100Mbit/sec vs. the coaxial cable of 16Mbit/sec limits. It also offers multivendor interpretability and interconnection. On the other hand, research and development in the areas of artificial intelligence (AI) have greatly improved the office automation environment. AI includes expert systems, natural-language interfaces, voice recognition, and machine vision. With the 486s and 586s level microprocessor, even more sophisticated AI applications will become available on the desktop level. Although imaging technology still lacks the hardware and software standards necessary to render optical storage a complete replacement for traditional magnetic storage devices, with the growing popularity of erasable optical disks, a standard will surface very soon.

Golden and Eisenberger summarized best on the implementation of document imaging systems, "although things will improve, nothing happens overnight" (Golden & Eisenberger, Winter 1990).

Appelt, Wolfgang. (1991). Document Architecture in Open Systems: The ODA Standard. New York: Springer-Verlag. p.10.

Doswell, Andrew Doswell. (1990). Office Automation: Context, Experience and Future. Chichester, England: John Wiley and Sons. pp. 4-5.

Gilder, Jules H. (May 1992). "How Databases Work," Document Image Management Systems, 1: 42.

Golden, Cynthia Golden & Eisenberger, Doreit. (Winter 1990). "The Effect of Relational Database Technology on Administrative Computing at Carnegie Mellon University," CAUSE/EFFECT, p. 53.

Hines, Douglas V. (1985). Office Automation. New York: John Wiley and Sons. p.12.

O'Connor, Mary Ann. (July/August 1991). "Implementing Optical Storage: How to Select a Document Image Management System," Document Image Automation, p.230-231.

Roth, Judith Paris. (1990). "Electronic Image Information" in Judith Paris Roth, ed. Converting Information for WORM Optical Storage: A Case Study Approach. Westport, CT: Meckler Corporation. pp. 8-9, 11,

Santelli, Gerard A. & Hurley, James J. (1990). "Optical Disk Technology: Selection and Conver-sion" in Judith Paris Roth, ed. Converting Information for WORM Optical Storage: A Case Study Approach. Westport, CT: Meckler Corporation. p.46, 60.

Sherron, Gene. (January/February 1992). "Imaging Systems: An Overview for Management," EDUCOM Review, p. 34-35, 37.

Tamura, H.& Yokoya, N. (1984). "Image Database Systems: A Survey," Pattern Recognition, 171 (1): 29-43.

Thie, Thomas. (Summer 1992). "Automated Indexing of Document Image Management Systems," Document Image Automation, 12: 46-48.

Wohl, Amy D. (January 1991). "Large-scale Optical-disk Systems Offer Big Benefits," Today's Office, 25: 32-34