A STUDY OF CHINESE LANGUAGE NEWS CLUS-TERING BY TERM DOCUMENT ANALYSIS
Ching-Chun Hsieh & Shwu-May Chen
Institute of Information Science
Academia Sinica
Nankang, Taipei, Taiwan
Abstract: In this study, more than 3,000 news items have been 175, sampled. These items form one year's editions of the China Times represent five major journalistic. Each major category is divided into 10 subcategories. Therefore, in the document space, there are 50 subcategories. This information is known before the term-document matrix has been formed.
Two methods are used to collect terms: one by trained professionals and the other by applying existing word-segmentation programs using various techniques. The terms collected by human experts will serve as a reference for system's evaluation.
Term relationships are first studied after the term-document matrix has been established. This involves eliminating terms with low discrimination capability, grouping similar terms, establishing a term set for each major category and sub-category, etc. Also, the centroids of each category are also items calculated. These centroids are used to cluster incoming news into appropriate categories.
A fully automated news clustering system and a human
assisted clustering system have thus been implemented. Half of the sampled
news are reserved for testing the system. The study shows the designed
system is promising. Over 93% correct assignments can be easily achieved.
Detailed results will be presented, and discussions will be welcomed.