ELECTRONIC FILES AND TEXTUAL ANALYSIS

ELECTRONIC FILES AND TEXTUAL ANALYSIS

Donald E. Riggs

Graduate Library
University of Michigan
Ann Arbor, MI, USA

Within the past few years we have witnessed new dimensions in the use of computers in storing, retrieving, and analyzing texts. Computer-aided analysis of texts is becoming very important in specific disciplines (e.g., the humanities). Traces of electronic text files and textual analysis can be found in nearly all disciplines; however, the primary focus in this paper will be on the humanities.

It is noteworthy to quickly establish an understanding that full-text retrieval and textual analysis do not serve the same function. Access is naturally enhanced to print format collections via full-text retrieval, while, on the other hand, the electronic- dependent textual analysis endeavor focuses on the study of the text itself. Specific features of the text are examined and analyzed. Textual analysis can, for example, investigate the occurrence of words, the co-occurrence of the same or similar meaning words, or various themes in a text. And there are numerous types of textual analysis, as there are various types of activities that are reflected in the printed texts in the humanities.

• Authoring studies -- This type has been identified as the most challenging endeavor in textual analysis. Using the computer to trace distinctive traits of authorship has not been as easy as antici-pated. Authenticating an author's work, comparing writing traits of different age groups, and drawing comparisons between two or more authors are examples of authoring studies.

• Thematic analysis -- Following or discovering themes in literature is perhaps one of the more common types of textual analysis. Computers do very well in detecting the use of concepts or themes in a given work.

• Linguistic analysis -- Phonological changes in a text can assist the historian or sociologist. And the linguist will find the frequent occurrences of a single character (e.g., the letter "g") or diphthong (e.g., the position of one vowel to that of another vowel within the same syllable) to be of great value in one's research (Price-Wilkin, 1991).

There are not many libraries offering electronic access to textual analysis. As an example of a library that has begun this service, a brief overview of UMLibText (the University of Michigan Library's local system for distributed textual analysis) is presented.

UMLibText's genesis began in 1986, but did not enter the implementation stage until 1989. The project developed from work done in consultation with campus scholars. The system can be accessed via the campus network or Internet; one has to be affiliated with the University of Michigan in order to gain access to the system. Michigan's faculty, students, and staff may access UMLibText without charge. Those with X-Windows can view graphical images of pages; standard communi-cations packages (e.g., PC-TIE and Versaterm) allow searches to be observed in concordance-like views.

UMLibText offers over 300 medieval, modern, and philosophical English-language titles; the entire corpus of Old English language and literature (some 3,000 texts and fragments); the second edition of the Oxford English Dictionary (OED); and the Michigan Early Modern English materials. Plans are underway to add the Patrologia Latina and the more than 4,000 volumes of verse repre-sented in the New Cambridge Bibliography of English Literature (600-1900). And other materials will be added as deemed appropriate.

Learning early that textual analysis is still an evolving endeavor and that conditions are not yet ideal (e.g., recently developed standards for the description of text have not been properly applied to

previous textual analysis projects), the University of Michigan Library established three goals:

• To provide a software and hardware platform adequate to serve a broad spectrum of textual analysis needs.

• To begin a local collection of text in a standardized format suitable for a number of applications (Price-Wilkin, 1991).

Based on the above goals, agreements with several sources of text were secured and the UNIX software package, Pat, was selected. Pat has a strong searching function capable of accommodating large amounts of structured and unstructured texts in some fairly generic ways. Pat also has some important attributes such as speed and the ability to work with generically tagged texts. It was developed at the University of Waterloo during their work with the Oxford English Dictionary.

One of the biggest challenges of UMLibText has been the condition of the texts. They have been coded in a multitude of different ways; for example, marking page breaks, italics, and dramatic divisions/verse have all been treated differently. More attention is being given to Standard Generalized Markup Language (SGML) by publishers and researchers. However, the number of texts that comply with SGML is tiny, and consequently this reality results in UMLibText spending much time formatting texts for use with the system. SGML has the potential to support transfer of texts between different machines/softwares, and to greatly facilitate text analysis projects. Other important characteristics of SGML regarding textual analysis include:

UMLibText runs on a library-supported IBM RS/6000 workstation-class computer. Users log several hundred sessions from their homes and classrooms, proving that UMLibText is an important resource and that its utility is high. In 1991, the 300 registered users of UMLibText logged an impressive 3,817 accesses to the system. Access has been further improved by the addition of Lector, an X-Windows-based display software package. Like Pat, Lector was developed in con-junction with the OED. The software allows users with X-windows displays to view results containing a wide array of fonts at very high resolutions in a format closely approximating the original page. UMLibText provides remarkable response time; it is a rare search or display that takes more than a few seconds, even when the body of text is more than seven million words and the word is as common as "a," "an," or "the" (Price-Wilkin, 1991).

UMLibText permits various types of searches (e.g., by word, word stem, character, and phrase searches). Variations in searches can include typing in character string with desired spelling, by suppressing the matching of variant endings by including a space within quotes, or by searching for particular phrases by typing the entire phrase in quotes. The latter search strategy is illustrated below:

Based on the use of standards and "open systems," UMLibText can support a diverse computing environment. X-Windows can be supported on virtually every platform from IBM-compatibles to Macs to NeXTs to UNIX workstations. Using SGML formatted text, the various commercial offerings can be incorporated with very little effort and, concurrently, be offered to researchers to use with other systems that offer different facilities. Purchasing a site license for Lector and Pat, software packages supported on a wide range of UNIX workstations, makes it possible to distribute the software to users on campus workstations (Price-Wilkin, 1991).

UMLibText is only one example of a local system for dis- tributed textual analysis. It is, how-ever, a system that works. Enhancements and refinements continue to be made to UMLibText. Only a thumbnail sketch of the system has been given in the above narration.

Libraries do excellent work in providing service. Most of their service role remains dependent on the provision of books and journals in the traditional print format. One could describe the tradi-tional library as a dormant, inert organization. This description is undoubtedly debatable, but the fact is the many people who do not fully understand libraries hold this perception. The application of technology is making libraries more visible in local communities, on campuses, and in society. The replacement of the card catalog with the online public access catalog is a good case in point of a value-added service; while the card catalog normally only reflected about 60% of a library's hold-ings, the OPAC has the potential of containing at least a bibliographic record of the library's complete holdings.

Textual analysis systems have provided a giant step forward in providing added value to the services offered by libraries. And we can expect greater excitement about textual analysis and computer- aided application of the study of texts. The following illustrates this enthusiasm:

One has to temper the enthusiasm about electronic texts with a dose of reality. Since electronic texts and textual analysis are still evolving phenomena, they must be carefully evaluated in order to instill a sense of continuous improvement in future products/ services. Lowry (1992) has identified some major characteristics that should be considered when assessing electronic texts:

• Quality of text. Which printed edition is being reproduced by the electronic text? Are important parts of the text omitted in the electronic version? Has the electronic text been carefully proofread to match the source text? Is the editor of the electronic text a scholar in the respective discipline?

• Software. Is the software adequate to handle the capacity of the text? Is it sufficiently flexible to accommodate the unique characteristics of the text (e.g., critical notes)? Does the software require special encoding?

• Markup. What, if any, are the requirements for special markup of the electronic text? Is the Standard Generalized Markup Language (SGML) used?

• Medium. On what format is the text distributed (e.g., CD-ROM, diskette, magnetic tape, or online via dial-up access)? Which microcomputer platform is required?

• Documentation. What is the quality of the documentation necessary to verify the above items (1-4)? Does documentation exist in print and/or online format?

• Price. How much does the electronic text and/or software cost? Can one buy/lease site or network licenses? Can the products/services be obtained from more than one vendor?

We can expect electronic texts and textual analysis to flourish in the next few years. This activity will be especially noticeable in the humanities. Libraries have an opportunity to take a transforma-tional leadership role in this area. And it is recommended that librarians work closely with scholars in the respective disciplines in the formulation of electronic texts and their analyses. This partnership is crucial toward ensuring credibility, accuracy, scholarship, and utility.

Lowry (1992) predicates the following three assumptions about the development of electronic texts:

• Electronic texts and related computer-based research tools will become increasingly important for scholarship and teaching in the humanities.

• The library has a natural and central role to play in collecting, preserving, and providing service and instruction for these new resources.

The technology is now available for sophisticated textual analysis and its application. Is the library community ready to take advantage of this opportunity to improve scholarship? Based on the evidence witnessed thus far, we are becoming well postured to offer this important value-added service. To paraphrase Alvin Toffler, "we are crouching at the starting line of a new and different race into the library's future."

Lancashire, Ian. (1989). Foreword in The Dynamic Text Guide. Toronto: Centre for Computing in the Humanities.

Lowry, Anita. (1992). "Electronic Texts in English and American Literature," Library Trends 40 (4): 704-23.

Price-Wilkin, John. (1991). "Text Files in Libraries: Present Foundations and Future Directions," Library Hi Tech 35 (3): 7-44.