IMPLEMENTATION AND EVALUATION OF AN ARCHIVAL IMAGING PROJECT
 

Marilyn Lutz

Raymond H. Fogler Library
University of Maine System
Orono, ME 04469-5729, USA

E-mail: lutz@maine.maine.edu


  Keywords: Digital Technology, Multimedia Technology, Visual Information Age, Information Access, Scanning Technology, Image Capture, Electronic Imaging, Archives Management, University of Maine, Network Access, Internet, Tagged Image File Format, TIFF, JPEG, Joint Photo Experts Group, OCR, Optical Character Recognition, Optix Network System, Image Capture, Conversion, UNIX, Centric 650, AMC, Archives and Manuscripts Control, NLS, Natural Language Search, Full-text Indexing.

Abstract: An overview of an Archive Imaging Project which describes efforts to preserve from deterioration and provide Internet access to unique resources by applying optical scanning technology to capture text and images in digital form. Topics covered will include a definition of the process, strategic planning, system selection criteria, lessons learned in the conversion process, retrieval using natural language searching, and linking the image database to the online catalog holdings. The project describes a realistic laboratory where the technological problems associated with creating surrogate docu-ments in digitized form, full text retrieval, display, user interface design and delivery were investigated, as well as the appropriateness of electronic imaging technology for preservation and access to archival resources and the impact on library service.
 

1. INTRODUCTION

In October 1992, the University of Maine System Libraries began a project to use digital image technology to preserve a wide array of archival materials, and to disseminate document surrogates via statewide and national networks. Three prominent archival collections on Maine and the Cana-dian Maritimes -- the Maine Folklife Center and Special Collections at Fogler Library at the Univer-sity of Maine Orono campus, and the Acadian Archives/archives acadiennes at the Fort Kent campus -- are being converted to bitmapped images and ASCII text files, searchable with full text retrieval software, and stored on optical disks. These collections represent the results of research in Maine folklore and folklife, and much of the information is unavailable from any other source. When possible image files are converted to ASCII text files, searchable with full text retrieval software.

Network access to the image database is provided via a client/server architecture that enables the transmission of the images to common computer platforms, including the IBM PC, the Apple Macintosh, and the UNIX workstation. An electronic link between the image database and biblio-graphic records in the library public access catalog extends access to any device (dumb terminal) which connects to the Library gateway via local Ethernet or the Internet, to search the holdings records of the archives. Those surrogate documents which could be converted to ASCII text are linked to the bibliographic records and may be retrieved from the image database and displayed on public access terminals.

2. BACKGROUND ON ARCHIVES IMAGING PROJECT

The archives imaging project grew out of an interest in facilitating access to and use of local collections via the network among the University librarians, whose directives in this era of down-sizing and budget cuts were to rethink and redesign information services. In this climate, it is not surprising that untapped resources like archival collections would become a focus of attention and the use of optical scanning technology and full text retrieval software would receive serious considera-tion among the options for improving access to and preserving Maine’s historical resources.

The archival collections are diverse in nature but highly complementary, and each Archive was exploring the possibilities of improving access to the materials entrusted to them through the use of computer technology. Up to this point access to these unique collections meant traveling to the archive to use finding aids or other collection guides which substituted for full cataloging, and often even that level of access was restricted because of the frailty of the documents. Small and large organizations alike face problems of control, cataloging, preservation and access to historical docu-ments scattered across wide geographic areas (Kenney, Friedman & Poucher, 1993). Optical disk technology provides a viable, sensible solution to these recurrent problems. Using the power of an electronic imaging system, with full text indexing, in a distributed networked environment enables the library to create facsimiles of the documents and finding aids and revolutionize service for these collections.

3. PROJECT GOALS

The purpose of the project was to broaden access to and preserve from deterioration unique archival resources by applying optical scanning technology to capture text and images in digital form. By providing enhanced access to preserved materials using surrogate documents and finding aids, researchers can use the resources of archives without physically going to them. From an information management perspective, it was an opportunity to explore the appropriateness of digitization for archive resource management and the impact of integrating such services into the library’s existing environment. The project focused on the use of digital technology for preservation and access to archival resources, and investigated the technological problems associated with storage, full text retrieval, image display, user interface design, and delivery.

4. PROJECT DEFINITIONS

Digital image technology is defined as the electronic copying of bit mapped (binary) images at a scanning resolution of 200 dots per inch (dpi). The files were encoded in Aldus/Microsoft TIFF Version 6.0 (Tagged Image File Format), one of the most popular file formats for document images. TIFF provides a means for labeling a file so that application software can decipher the file, thus making it possible to exchange information.

Several different methods of compression and decompression are encompassed within the TIFF definition, including formats for compressing gray-scale and color images. TIFF also allows for annotations to be attached to an image without changing the original image, and meets the Internet Engineering Task Force standard definition for the exchange of images on the Internet.

Image files, because of their size, must be compressed prior to storage and transmission. Image compression/decompression algorithms vary based on the content of the files (black and white, gray scale, color), and the techniques used to shrink the file size. Initially, text files (black and white) were compressed using facsimile compatible CCITT Group 4 format (International Tele-graph and Telephone Consultative Committee) compression, and gray scale images (32 levels) were compressed using LZW (Lempel-Ziv). JPEG (Joint Photo Experts Group) is used consistently now, as the compression ratio allows for a greater economy in storage and faster network speeds for gray scale and color documents (cuts delivery speeds in half).

Text contained in an image is not converted (for textual interpretation or indexing purposes) to alphanumeric form at the time of scanning. OCR (Optical Character Recognition) was the method used to examine raster image data and recognize letters and numbers on the page. This process generates an ASCII text file which is then indexed, using full text retrieval software.

5. SELECTION OF MATERIALS

The Maine Folklife Center is a repository for manuscripts, tape recordings, transcripts of recordings, and photographs which document the folklore, folklife, social and cultural history of Maine and the Canadian Maritime Provinces. The collection comprises 2,500 accessions (an esti-mated 125,000 pages of text and 7,500 photographs). About one half of the typed transcripts of oral recordings and accompanying photographs have been digitized to date.

A core group of the most prominent accessions in the holdings of Fogler Library's manuscript collection were selected, including primary resource materials on Maine-related local history for which collection guides existed, and vertical file documents of regional and national importance. The latter is an eclectic collection of singular, small-size accessions which vary widely in content, date, and format, but offer interesting visual items (broadsides, letters, logs, diaries related to U.S. poli-tical figures, literature, U.S. Civil War). About 40 accessions have been added to the image data-base, and conversion is ongoing.

Acadian archives/Archives acadiennes is a repository for information on the culture and history of the Upper St. John Valley with particular attention to the Acadian presence.

6. PLANNING FOR AN OPTICAL IMAGING SYSTEM

Planning for an imaging system means collecting information and advice from vendors and authorities in the field with experience of each aspect of the process (scanning, OCR, optical storage, and full text retrieval). The process is complicated by the recent proliferation of equipment and software for what is perceived as a "cutting edge technology" market with a multi-million dollar future. Imaging systems are expensive and require a commitment of time and money, not only for the initial capitalization of the system, but also for ongoing staff to maintain the system and plan its future development. Before a system can be implemented, decisions must be made on database contents, storage requirements, system architecture, site preparation, database structures to support retrieval, licensing for concurrent users, and retrospective conversion (Falen and Stackpole, 1993). A good strategic plan should include the following considerations:

Determining storage requirements: Identify documents for conversion and estimate annual growth.

Developing basic system architecture: Determine the number and types of hardware components, hardware storage requirements and media to use, network requirements (how many people want access to the system and how fast), and special needs (custom software); determine tradeoffs in system architecture (which process is most important: image quality, retrieval or speed of delivery).

Planning the site: Locate space for the system and address site requirements (power, wiring, lighting).

Defining indexing structures for retrieval and access: Determine extent of structured indexing needed, and design user interface.

Implementing the system: Install the system; configure the operating system, database management system and network; define users and security; initialize mass storage devices; establish system usage procedures; train staff and users.

Converting backfile of existing documents: Develop work flow for the logistics of conversion; determine staff needs; conduct document inventory analysis (what type of documents will be converted? how many documents will be scanned? what is condition of documents? are there any existing index records); establish quality control methods to insure integrity of scanned images and index elements entered into the database.

Allocating sufficient budget: Establish an ongoing budget sufficient to cover costs of maintenance of equipment and software, system upgrades, expanding concurrent user licenses, and staff to continue document conversion.

7. SYSTEM SELECTION

There are many well-qualified vendors in the field who can provide systems tailored to specific requirements and budget (Datapro Reports, 1991). Proper selection of equipment and software must be a carefully weighed decision, as the major components of the system become the platform for future developments. The investment in sufficient memory to support the processing and retrieval of image files, particularly gray scale or color photographs, is as crucial to the system as the software, but frequently incorrectly sized.

The Optix Network System from Blue Ridge Technologies is an off-the-shelf document imag-ing system which uses Macintosh hardware and proprietary multi-user software, and is availa-ble in stand alone and customized networked configurations. Optix uses optical scanning metho-dology to create digitized images and ASCII text files which can be retrieved by structured headings or full text searching using the optional NLS (Natural Language Searching). The client-server architecture, distributed processing, UNIX operating system and integrated software components which included both OCR and text searching options, and user licensing were among the most important system characteristics which influenced our decision to buy.

Optix applications run under AUX/System 7 UNIX with Berkeley extensions and may be configured to operate multiple tasks. The system software integrates Informix SQL as the database manager, Text Pert as the OCR software engine, and Conquest developed by Johns Hopkins Uni-versity as the search engine for full text retrieval (natural language searching). The Optix scanner interface scans and compresses documents at the same time, and supports software based CCITT Group IV, LZW and JPEG compression/decompression algorithms.

The Archive project configuration uses two high resolution scanning workstations (Centris 650 with 24 mg ram) linked by SCSI interface to high speed Fujitsu 3090G scanners. The workstations, located in Fogler Library (Orono campus) and Acadian Archives (Fort Kent campus) are used to scan, OCR, compress, decompress, view, annotate, edit, index, retrieve and print documents. Viewing or display stations are located in the three archives, and are standard Macintosh computers (Centris 610 with 16 mg ram) capable of viewing ASCII and bit mapped images.

The scanning workstations and display stations are networked to the nucleus of the system, an image and text server (a Quadra 950) and optical jukebox ( NKK optical jukebox with 56 GB capacity) across fiber optic, local area networks, connected to T1 links. Optical storage (5.25 inch optical disks with 1-GB capacity) is front ended by 2 GB magnetic storage used for caching and for storage of ASCII text files, and both the server and the jukebox, along with a tape drive for backup, are located in the computer center on the flagship campus in Orono.

8. CONVERSION PROCESS

With materials to be scanned in hand, the imaging process generally involved five basic steps.

Image capture: Documents are placed on a scanner flat bed which converts them to digital form in much the same way that a facsimile machine does, creating an electronic image. Given the nature of the archival materials, documents could not be batched into an automatic document feeder. The scanning operation is prompted by menu driven cues to set dpi (dots per inch), page size, and type of image (bilevel, gray-scale, halftone). Since such images consist of very large amounts of computer information, a scanning workstation is used to electronically "compress" the document to a fraction of its original size. The quality control check of scanned images is performed page by page on the scanning workstation. Before being stored, the image is indexed using structured headings derived from the bibliographic records in the online catalog. A worksheet for each accession records pertinent scan data such as number of pages scanned, length of time, scanner settings, and comments on unusual enclosures. Data is entered in a daily statistics log, i.e. how many pages, accessions, and photographs are scanned and required time.

In order to make available surrogates of primary research material, we found it essential to establish scanning conventions which duplicate the way a researcher examines an original source. Scanning guidelines were written for handling: Double-sided pages; folded articles or brochures; loose, unnumbered enclosures; oversize documents; envelopes with contents; photographic contact sheets; and secondary source material.

Storage: Compressed and indexed documents are stored either on an optical disk, or magne-tic disk. Optical disks are better suited to the more long-term archival storage of large amounts of information which do not require constant retrieval. Magnetic storage with its rapid response time is used when a document must be called up often, quickly, and is minimal in size. Images (bit mapped files) are stored on optical disk. Their counterpart ASCII files are stored on magnetic disks.

OCR process: Optical character recognition (OCR) converts the bit mapped image into ASCII text files which can then be indexed by full text retrieval software. Documents are batched and processed on the workstation at night. The quality control check of the converted text files involves assigning an editing level to the document, depending on the error rate of the OCR process. Documents are edited to eliminate OCR errors, especially those effecting natural language search indexing. Structured index headings are also assigned at this time.

We determined that many of the typed transcripts of oral recordings could not be converted using OCR technology. An alternative plan was devised to scan printouts of the bibliographic records of these documents. Creating these record place holders broadens the access to the image files, providing as much access as is currently possible, although only a fraction of the access we had anticipated with full text retrieval software.

Create full text index: The edited text file is placed in a job queue for full text indexing. The system may be profiled to index continuously or at intermittent time frames. Text indexing is spot checked.

Link bibliographic and image databases: Custom software was developed to link bibliographic records of archives holdings in the library online catalog with the respective text file documents in the image database. To accomplish the link, system assigned object IDs for image and text documents are entered in a formatted field 538 of the MARC bibliographic record. Users searching the online catalog may retrieve the corresponding text document from the image database and display the text.

9. RETRIEVAL

Indexing bibliographic and image databases

Indexing fields for the image database (structured headings) are derived from bibliographic records in the online catalog. Early on the decision was made to create a separate bibliographic file accessible through the online catalog. Collection level records using the Archives and Manuscripts Control format (AMC format) are being created, a fairly common practice among archives. The AMC format seemed particularly tailored to the task of describing these collections with provision for: Extensive notes, such as provenance, biographical/history and summary fields; detailed physical description, such as medium, organization and arrangement fields; the use of hierarchical place names (field 752) which allows for access by esoteric place names; and the definition of host item entry (field 773) linking collection level names and component piece names. Records are enhanced with one additional field borrowed from the Sound Recordings Format, namely the participant or performer note (field 511). This field is used to address names of informants, although it is indexed with personal names. The bibliographic records may be searched by the full compliment of MARC field indexes (name, title, subject, French subject, geographic location, genre, accession shelflist) and keyword.

For subject access, the Library of Congress subject headings provide the broadest level of subject access, which is refined with a local thesaurus of terms created by our catalogers to suit the collections in hand. We are also exploring the use of the Art and Architecture Thesaurus which has proved especially helpful with the assignment of genre terms.

Index pointers in the image database are a subset of the bibliographic indexes. Catalogers have chosen to "optimize" or prioritize three key fields for which Informix creates a database term index: name, subject and genre. These index terms are given a higher retrieval priority than terms entered in other indexed fields (archive, accession, number, dates) which receive a standard database hookup (slower response).

Natural Language Searching of full texts

NLS or Natural Language Search is the software component that explodes access to the text documents beyond the formatted MARC fields. NLS is full text retrieval software which allows the user to enter a query in "plain English", (a word, phrase, sentence) and the system searches by meaning, returning hits ranked by relevance to the meaning of the query. Parsing techniques are used to index and retrieve information based on concepts. Document priority is based upon a weight-ing of the fit of the query to the actual document text. Weighting is determined by the quality of the match of the query to the word senses in the document. Boolean searches and wild carding are also possible. Users can search the image database by direct indexes, natural language, or a combined search to retrieve an image file, an ASCII text file or the linked text-image file.

Users are able to customize the text search portion of their queries by setting the control on the expansion level of the query. This means a query may be expanded to also search for documents which include such things as word variant, synonyms, and words with related meanings. A sliding scale is provided which allows the user to adjust expansion to the desired level.

Gray-scale photographs are not indexed separately, thus far, but are retrieved as part of a docu-ment with any search. A number of contrived thematic groups of photographs will be retrieved with subject access.

Network Access

Implementation plans for network access to the image database include the distribution of client software for Apple Macintosh computers (System 7) and MS-DOS based personal computers (386-based or higher running Windows). System requirements include a minimum of 10 mg RAM, an Ethernet board, and Telnet software in order to connect to the image server, search the database, and view the contents of individual documents .

ASCII text versions of documents may be retrieved and viewed from public access terminals (dumb terminals ) or personal computers connected to the library catalog and searching the archive bibliographic database. Records for archival documents are linked to corresponding images and ASCII text files. Once a document is selected, choosing the option to "display text" will retrieve the ASCII file of the document, and display the text on the dumb terminal. User commands will be limited to scrolling forward, backward or printing.

10. EVALUATION

For all of the documents in this project scanning produced a digitized facsimile that was com-parable to the original document. This was especially true for most machine produced documents (typescript, offset printing, and letterpress printing), and effectively captured most handwritten documents. However, a scanning system will not automatically produce a continuous stream of perfect quality images. The workflow of the scanning technician must include a quality control check as a primary focus to ensure images are clear, contrast is correct, size scanned does not waste storage, dpi (dots per inch) is accurate for the original source. The choice of compression algorithm impacts the quality of image displayed, storage and network transmission. Mode of access for most users, and speed of delivery influenced our choice of JPEG over LZW, and the scan setting of 200 dpi. While scanners give you a choice between flatbed and batch feeds, the type of document will override the use of batch feed(i.e. primary materials cannot be batch processed), and this factor effects the time needed to do the job.

The type of document will also determine the suitability of OCR technology for the project. If the original document is typescript created on a wide variety of manual typewriters and paper grades, OCR algorithms cannot be used to produce a good quality ASCII version of the digitized text. It would be more cost effective to key in the documents from scratch than to edit an OCR text with less than 75% accuracy. Text editing is the most time consuming of the database tasks, and depending on the type of document database, multiple levels of editing must be defined to address such com-mon OCR issues as errors due to characters which touch and noise, page layout errors, insertion of messages, etc. if the goal is to create text files adequate for full text retrieval indexing.

Memory and high resolution bit-mapped displays are critical to imaging applications.

Standards for compression and image formats are critical to network access.

The results of the project to date have been informative, though inconclusive. Client use of the image database and impact of the availability of both bibliographic and image databases on archive services must be thoroughly evaluated. With the distribution of client software, users will be asked to evaluate the service in terms of user interface design, image quality, search capabilities and per-formance characteristics. The general assessment is that digital image technology offers a promising means for reformatting a variety of archival and manuscript material. The participating archives plan to continue their collaborative efforts to preserve and make available surrogates of original research materials through the use of digital technology.
 
 

REFERENCES

Datapro Reports on Document Imaging Systems, Volumes 1-8. Delran, NJ: McGraw-Hill, Inc., 1991.

Folen, Doris R. and Laurie E. Stackpole, "Optical Storage and Retrieval of Library Material," Information Technology and Libraries, 12 (2): 181-191 (June 1993).

Kenney, Anne R.; Friedman, Michael A., and Poucher, Sue A. Preserving Archival Material through Digital Technology, Final Report. Sponsored by Cornell University on behalf of the Eleven Comprehensive Research Libraries. Ithacca, NY: Cornell University Library, 1993.