Procedures and Problems
Description of Collection
The project began with an investigation of the Integration and Implementation Sciences collection that included its users, their activities, literature and related texts. This data provided an overview of the collection, its setting, as well as potential users. The overview also provided the justification and need for the development of a subject specific thesaurus.
Annotated Bibliography
Once the need for a thesaurus was established, an annotated bibliography was developed. Thus began a lengthy literature review of texts and articles on and about thesaurus construction, indexing languages, faceted classification and classification theory, and also classification and the internet. All of these areas provided valuable input and perspective to the project. The practical texts on thesaurus construction proved to be invaluable, while the texts on faceted classification were instrumental in helping with the design of the classification scheme for thesaurus terms. It is intended that the annotated bibliography will continue to be of value for not only the maintainers of the thesaurus, but also, any person interested in this subject.
Key concept gathering
The IIS key concepts were established by reviewing and extracting words and/or phrases from IIS collection sample resources, including the chief paper of Dr. Gabriele Bammer. This proved to be a marginally overwhelming exercise, as approximately 600 words/phrases were generated. Although, this is not an obvious problem, it posed a challenge in the development of a strategy to slim down the list while preserving the IIS specific terms. The first pass over this list eliminated 150 redundant and obvious terms leaving a list of only 450 terms!
Thesauri review and selection
The first stage of evaluation began with a review of 10 existing thesauri. Comparing concepts against existing thesauri, provided a clearer picture of what terms were specifically unique to IIS, and produced a side-by-side comparison of how well these thesauri matched the initial concepts. The goal was to find a thesaurus that could be used as a foundation thesaurus for IIS. A few problems were encountered during this review process relating to selection. The evaluation produced two good candidates: ERIC and CSA. ERIC covers the most terms and is publicly available, while CSA is strong in areas that ERIC is not, but is not openly available to the public. Ultimately, ERIC was selected because it provided coverage of a wide array of areas and because of its availability.
Classification design (or, How we made a classification scheme)
After ERIC was selected as the foundation thesaurus, a few crucial design points needed to be decided on. Would the ERIC thesaurus be imported into the new system building IIS terms off of it? By using this design approach, a better understanding of the ERIC classification was needed. A considerable amount of time was taken to review both the online and print ERIC thesauri and it was concluded that although ERIC offered 41 categories for refining searches online (each term is assigned to a category) they did not constitute the top terms of a hierarchical classification. It was found that terms within a category often referenced broader terms (BTs) in another category entirely. This conclusion was both confusing and discouraging. The intent was to fit IIS terms neatly into the ERIC classification scheme, but how could the scheme accommodate the IIS terms. In addition, to complicate matters a bit more, a cost effective way, to obtain and/or import the ERIC thesaurus of approximately 6,300 terms, was not obvious. An email was sent to ERIC requesting access to an interoperable electronic version of the thesaurus database, but there was no response. Manual data entry was not an option.
This discovery presented two problems. How could the IIS terms be accessed if they were not residing within the ERIC thesaurus? And, since IIS terms were not being housed in the ERIC thesaurus, what type of classification scheme could be used? As luck would have it, the IIS terms could easily stand alone in a thesaurus, which could employ its own classification scheme separate from ERIC. The IIS thesaurus would act as a supplemental thesaurus to ERIC. Choosing a classification scheme is not as easy as one would hope. Other classification systems were reviewed such as Bliss, CRG (Classification Research Group), and Raganathan. In the end, it was decided that IIS concepts would be reviewed with the guidance of Ranganathan's PMEST (Personality, Matter, Energy, Space, Time) model. While this exercise proved to be fruitful, it was not easy. Lengthy discussions about the terms that could logically fall under more than one facet frequently stalled this step. Rather than creating a poly-hierarchical scheme these relationships were accommodated by multiple RT references. Ultimately, it was concluded that five facets could encompass the field of IIS: activities, agents, conditions, domains, and locations. While this is not a perfect representation on PMEST, it is quite close.
Nuts and bolts - establishing the thesaurus terms
There are several kinds of thesaurus construction software available. Some are quite expensive and are comprised of proprietary code, while others are open-source, and free. For this project, Tim Craven's TheW32 software was chosen, primarily because it was free and because the documentation is very good. Before entering any terms into the thesaurus software, terms that were either too narrow or peripheral to IIS or general knowledge were eliminated out of the 450 IIS terms. Terms that we were able to find in ERIC were also eliminated. This process was much harder than it initially appeared to be. Some of the terms found in ERIC had scope notes that connoted another meaning than what would have been relevant to IIS. This created a philosophical exercise in determining whether to use the ERIC term with an additional scope note, or to enter an identical term with a different scope note ("re-scoping") describing its relevance to IIS.
The elimination exercise produced approximately 200 terms that included entry vocabulary and ERIC terms for hierarchical reference. The thesaurus section on vocabulary control gives a more in-depth description of this process. All of these terms were assigned to one of the five facets (activities, agents, conditions, domains, and locations) that were established in the classification design section. Sub-division, in the following manner, was then conducted, grouping by ERIC terms when appropriate:
ACTIVITIES
Practice
Goal based activities
Processed based activities
Research based activities
Methods (ERIC)
Research (ERIC)
Theories (ERIC)
AGENTS
Groups
People
CONDITIONS
Individual and group conditions
Problems
Processes & structures
Social conditions
DOMAINS
Integration and implementation sciences
Intellectual disciplines
Sectors
LOCATIONS
Making relationships (and giving terms meaning)
The list of terms in relation to their facets provided an excellent framework for establishing relationships. As terms were entered into the TheW32 thesaurus construction software, scope notes as well as establishment of the classification hierarchy (by assigning broader terms (BTs) and narrower terms (NTs)) were simultaneously entered. In some cases, it was necessary to include ERIC terms in order to provide BT terms or sibling terms for more specific IIS terms in order to retain the appropriate hierarchy. This was not intrinsically obvious when data entry started, but the need became increasingly apparent, especially under the Activities facet, as grouping patterns were found under the established sub-divisions.
Another problem, or rather hic-cup, encountered during this step was efficient workload distribution. We thought that working simultaneously on five separate thesauri (under each facet) would be far more productive and would facilitate error checking. This in the end proved to be true, but it assumed that the five thesauri could be merged easily. After some technical challenges, a successful merge occurred without much loss other than the re-ordering of the facet headings to alphabetical order. The original intention was for the facet headings to follow an order similar to Ranganathan's PMEST:
DOMAINS, AGENTS, LOCATIONS, CONDITIONS, ACTIVITIES. One other problem that incurred, by merging post data entry, was the inability to map cross facet RTs. These relationships had to be established after all the thesauri were merged.
Evaluation
The evaluation consisted of of two steps: establishing rules and guidelines through preliminary testing, and gathering and correlating the test data. In theory and practice, thesaurus evaluation is an ongoing process; language is constantly in flux, ergo: updating ones thesaurus is a necessity.
Indexing and Testing
By indexing IIS citations, it was possible to check for concepts not represented in either ERIC or the IIS thesaurus. The original plan was to index a healthy chunk of IIS citations directly into either a central IIS database or citation management program, but timing issues for the development of the central database and scope issues within the semester framework prevented the proper implementation of this step.
Top
|