Decision support systems
|Classification Codes:||9130: Experimental/theoretical|
2600: Management science/operations research
5220: Information technology management
|Copyright M. E. Sharpe Inc. Winter 1999/2000|
WITH THE SUDDEN EMERGENCE AND PROLIFERATION OF INTERNET SERVICES, the information overload problem has become more pressing than ever. Researchers in the field of information and knowledge management have started to seek assistance from the information retrieval and artificial intelligence communities, who have much to offer concerning advanced information indexing, searching, and classification techniques. Previous research has strongly suggested the Kohonen Self-Organizing Map (SOM) algorithm as an ideal candidate for classifying textual documents . The Kohonen SOM  provides an intuitively appealing organization of input data. Documents are classified according to their content, and conceptual regions are formed and named on a two-dimensional grid. Kohonen SOM output also exhibits two distinctive characteristics that are appealing for cognitive and visual reasons: First, the related topics/regions are clustered closely (the proximity hypothesis); second, larger regions represent more important aspects of a data collection (the size hypothesis). The graphical display of SOM maps prompted us to experiment with these features in a study intended to validate both the proximity and the size hypotheses. These hypotheses, if verified, would have significant implications for designing an effective and graphically appealing human-computer interface for textual analysis.
The next section summarizes techniques (statistical or neural network based) for document classification and presents a framework developed for applying the Kohonen SOM algorithm in document and concept clustering. We describe the hypotheses, experimental procedures and results used to validate the proximity hypothesis and then present the experimental results for the size hypothesis.
Document Clustering Techniques
CLASSIFICATION OF TEXTUAL DOCUMENTS REQUIRES GROUPING similar concepts/terms by category or topic. There are two approaches to cluster analysis: the statistical approach and the neural network approach. In this section, we provide only a brief summary of the conventional statistical approach and a more detailed review of the newer parallel, neural network approach because our techniques are based on a neural network algorithm.
In the serial, statistical approach, automatic document classification involves determining a document representation structure and method for determining similarities between documents. The hierarchical clustering of documents can be done divisively or agglomeratively . Divisive clustering breaks one complete cluster into smaller pieces. In agglomerative clustering, similarities between individual documents are used as a starting point, and a gluing operation is carried out to form larger groups. Stepp  described conceptual clustering as the new frontier in artificial intelligence. Algorithms for clustering involve cooccurrence of feature values, discovering conjunctive features among the attributes rather than variations in the value taken by a single attribute, and clumping concepts on the basis of most commonly occurring relations in the data. By these techniques, classes of similar objects are basically found by doing pairwise comparisons among all the data elements. These clustering algorithms are serial in nature in that pairwise comparisons are made one at a time and the classification structure is created in a serial order.
The neural network approach, on the other hand, addresses clustering and classification problems by means of a connectionist approach. Algorithms based on neural networks are parallel in that multiple connections among the nodes allow for independent, parallel comparisons. Neural network techniques can be classified as supervised and unsupervised. In supervised learning, a set of training examples is presented, one by one, to the network. The network then calculates outputs based on its current input. The resulting output is then compared with a desired output for that particular input example. The network weights are then adjusted to reduce any error. In unsupervised learning, network models are first presented with an input vector from the set of possible network inputs. The network learning rule adjusts the weights so that input examples are grouped into classes based on their statistical properties. Doszkocs, Reggia, and Lin  provide an excellent overview of connectionalist models in information retrieval including artificial networks, spreading activation models, associative networks, and parallel distributed processing. Chen  provides an up-todate review of various machine learning techniques, neural networks, and genetic algorithms for intelligent information retrieval applications.
Among unsupervised learning methods, the Kohonen SOM has been strongly suggested as an ideal candidate for clustering of textual documents. Kohonen based his neural network on the associative neural properties of the brain. The network contains two layers of nodes: an input layer and a mapping layer in the shape of a twodimensional grid. The output layer acts as a distribution layer. The number of nodes in the input layer is equal to the number of features associated with the input. Each node of the mapping layer has as many features as there are input nodes. Thus, the input layer and each node of the mapping layer can be represented as a vector that contains the number of features of the input. The network is fully connected in that every mapping node is connected to every input node. The topology of the Kohonen SOM network is shown in figure 1.
Several recent studies adapted the SOM approach to textual analysis and classification. Ritter and Kohonen  applied the Kohonen SOM to textual analysis in an attempt to detect the logical similarity between words from the statistics of their contexts. Miikkulainen  developed DISCERN (Distributed Script processing and Episodic memoRy Network) as a prototype of a subsymbolic natural language processing system based on the Kohonen SOM. Lin, Soergel, and Marchionini  used the Kohonen SOM for classifying documents for information retrieval. The documents were classified according to their content, and conceptual regions were formed and named on a two-dimensional grid. Lin's work first demonstrated the feasibility of using the Kohonen algorithm for classification of textual documents. Orwig and Chen adopted a scalable SOM algorithm to classify electronic brainstorming outputs and Internet homepages [2, 8]. The scalability was achieved using the Scalable SOM (SSOM) technique developed by Roussinov and Chen . The SSOM data structure and algorithm took advantage of the sparsity of coordinates in the document input vectors and reduced the SOM computational complexity by several order of magnitude, thus making large-scale textual categorization tasks possible. Kaski, Honkela, Lagus, and Kohonen reported WEBSOM , an SOM-based text classifier for clustering postings to the Usenet newsgroups. The basic WEBSOM architecture consists of two hierarchically interrelated SOMs. A word category map  is created first to describe relations of words based on their averaged short contexts. In the second stage, the text of a document is mapped onto the word category map previously created, and a histogram of the hits on it is formed. The document map is then obtained using the histograms as the fingerprints of the textual documents. WEBSOM also provides automatic labeling as exhibited in Chen's work, but the lack of distinct region boundaries could limit its use.
A Framework for Document Classification
The Kohonen SOM algorithm for classifying textual documents requires outputs from automatic indexing or the noun-phrase extraction process, which contain index terms for the documents and a list of terms in decreasing order of frequency for the entire collection. Based on the indexing terms identified, each document then is represented by a term vector of 1 or 0. The number of 1s in each document is equal to the number of terms in the document, and each vector position corresponds with one unique term.
We chose a 20'10 grid map for displaying SOM outputs, based on what would fit on an output screen. We used a hexagonal neighborhood area that considers six surrounding nodes to be a node's immediate neighborhood. Finally, we used the bubble adjustment method, which is an adjustment of the weights of neighboring nodes based upon the decreasing gain term. In the initial training phase, we used a gain term adjustment of 0.05, and a neighborhood size of 10. In the fine-tuning phase we used a small gain term adjustment of 0.01, and a smaller neighborhood size of 3.
After the training and tuning phases, the SOM visualization consisted of running the same input file against the trained map and reporting the map grid location that is the closest in Euclidean distance to each input. Each document (vector) and each term (represented as a unit vector) were thus mapped to a node and also to a region (of the same nodes) on the map. The nodes were then labeled so nodes with the same labels could form regions. The SOM algorithm we adopted, as opposed to the original Kohonen SOM, is summarized below:
1. Initialize input nodes, output nodes, and connection weights: Represent each document (or image) as an input vector of N keywords (or image features) and create a two-dimensional map (grid) of M output nodes (e.g., a 20'10 map of 200 nodes). Initialize weights from N input nodes to M output nodes to small random values.
2. Present each document or image in order: Represent each document by a vector of N features and present to the system.
3. Compute distances to all nodes: Compute distance d. between the input and each output node j using
where x^sub i^(t) is the input to node t at time t and w^sub ij^(t) is the weight from input node t to output node j at time period t.
4. Select winning node j* and update weights to node j* and neighbors: Select winning node j* as that output node with minimum d^sub j^. Update weights for node j* and its neighbors to reduce their distances (between input nodes and output nodes). (See  for the algorithmic detail of neighborhood adjustment.)
5. Label regions in map: After the network is trained through repeated presentation of all inputs, submit unit input vectors of single terms to the trained network and assign the winning node the name of input feature. Neighboring nodes that contain the same feature then form a concept or topic region. The resulting map thus represents regions of important terms or image patterns (the more important a concept, the larger a region) and the assignment of similar documents or images to each region.
6. Apply the above steps recursively for large regions: For each map region that contains more than k (e.g., 100) documents or images, conduct a recursive procedure to generate another self-organizing map until each region contains no more than k documents or images.
IN ORDER TO VALIDATE THE PROXIMITY HYPOTHESIS FOR SOM MAPS, we recently designed and conducted a user study aimed at answering the following questions: Can the SOM really cluster related topics together? Can the results be systematically validated using human beings as judges? Specifically, do the term associations suggested by the SOM match the associations that human subjects expect to see?
To evaluate the term associations produced by the SOM, the SOM maps were compared with maps generated at random, which provided region associations created by chance without the treatment of the Kohonen SOM algorithm. Concept precision and recall were used as the measurements. The respective null and alternative hypotheses were:
H^sub 0^: SOM performs no better than a map generated randomly in terms ofprecision and recall;
H^sub 1^: Otherwise (SOM does perform better).
Two data collections were used in order to compare results across different domains and types of documents. EBS was a set of electronic brainstorming output containing 206 comments. Each comment was one to four lines of textual description regarding the future of collaborative systems. This collection is a good example of a small data collection that focuses on a single topic. ITO was a collection of 586 project summaries for project proposals that have been awarded by the Information Technology Office (ITO) in the Defense Advanced Research Projects Agency (DARPA). These documents contained three to four pages of structured textual descriptions about the title, performer, objective, approach, and accomplishments of research projects. While the project summaries tended to be consistent in size, the range of topics within the ITO collection was much greater than within the EBS collection, covering everything from digital libraries to intelligent agents to IP multicasting. ITO provides a good example of a rich technical domain with a wide range of hardware and software topics.
Automatic indexing and noun phrasing were applied to the EBS and TO collections. For the EBS collection, 1,104 concept terms were identified, while 4,258 terms were identified for ITO. Table 1 lists the twenty most frequently appearing terms in the ITO collection, along with their term frequency. For the EBS collection, we used 500 training iterations and 5,000 tuning iterations. The vector size (the number of terms used to form document vectors) was 100. It took five minutes to generate the EBS dataset on a mid-sized DEC Alpha 3000/600 server. For the ITO collection, the training and tuning cycles were 1,800 and 4,800, respectively. The vector size was 1,000. It took twenty minutes to generate when computation was done on a DEC 2100 server.
The SOM output for the EBS collection is presented and discussed in . The SOM Map for the ITO collection is shown in figure 2. The SOM map created contains fortyfive regions, with thirty-nine unique concepts. An alphabetical list view showing the unique concept regions in a sidebar is provided as an alternative to the two-dimensional grid. For easier map visualization, clicking items from the list will make their corresponding SOM regions blink. The numbers on the map correspond with the number of documents that are classified into a particular concept region. The concept regions can be clicked to view the documents directly. The initial outputs and computational characteristics for the ITO maps are interesting. We observed that many of the larger concept regions appeared to be meaningful and to relate to each other (e.g., DIGITAL LIBRARY and INFORMATION RETRIEVAL form neighboring regions in the middle of the map, while MOBILE NETWORK and WIRELESS NETWORK are on the top-right region of the map).
Subjects and Experimental Procedures
The experiment involved thirty human subjects, mainly graduate students from the MIS and ECE departments at the University of Arizona. Subjects chosen possessed prior knowledge and training in collaborative computing and advanced artificial intelligence/software engineering topics so they could evaluate both the EBS and the ITO collections.
We first sampled regions from the SOM maps. For each region sampled, the number of its neighbors and the neighbors' respective region labels were recorded. We then asked the human subjects to select from a list of all region labels the same number of concepts as the SOM had found most relevant to the same concept we sampled. The subjects were asked to perform a total of nine tasks, four for the EBS collection and five for the ITO collection. The user interface for the experiment was a Java applet, shown in figure 3, which allowed users to drag and drop terms from the term list to single out concept terms most related to the head concept. Subjects completed the experiment in fifteen to thirty minutes.
The respective performances of the SOM and the random maps were computed using concept recall and precision as measurements determined by equations in which X represented terms suggested by either the SOM or the random map and Y represented terms suggested by the human subjects:
In our study, since the number of terms suggested by the human subject equaled the number of terms suggested by the SOM maps or the random maps, the concept precision and recall scores were effectively the same.
Tables 2 and 3 contain the results of statistical analysis (both ANOVA and paired t-tests) of the comparison of the recall/precision levels of SOM and the random maps as judged by the subjects. For the EBS collection, SOM achieved a respectable 32.29 percent precision and recall, versus 18.12 percent for the random map. For the ITO dataset, SOM had 26.85 percent precision and recall, versus 6.59 percent for the random map. In both cases, the p values = 0.0000 (meaning that they were less than 0.00005). With such a small p value, we have strong evidence that the mean change is greater than zero. We rejected our null hypothesis and concluded that SOM does make a difference in terms of the ability to cluster similar concepts/regions together. In summary, the results from our evaluation were encouraging. In light of the cognitive demand and the cumbersome nature of classifying/clustering textual documents, we believe this research has established the Kohonen SOM as a promising and visually appealing neural-network-based textual analysis and mining technique.
THE SIZE HYPOTHESIS SUGGESTS THAT LARGER REGIONS REPRESENT more important topics in the data collection. We expect that as more documents are assigned to a region, the size of the region will increase. The statistical test we performed therefore is aimed to answer the following question: Is there sufficient evidence to conclude that a significant positive relationship exists between the number of documents in a region (X) and the corresponding region size (Y)?
In this study, the relationship between X and Y is assumed to be linear. Let b^sub 1^ be the population slope. The null and alternative hypotheses are:
Test Collections and Procedures
We used SOM maps generated for both the EBS and ITO collections to validate the size hypothesis. The size and number of documents contained within each region on the map were recorded and tabulated. From the SOM map for EBS collection, we recorded twenty-eight data entries (region size and document number pairs). From the SOM map for ITO collection, we recorded forty-two data entries in the data collection and discarded one outlier. We then performed a linear regression procedure on the data and conducted a one-tailed test on the slope of the regression line.
Tests of Significance
The regression test statistics are shown in Tables 4 and 5, respectively. For the EBS collection, the regression output gives the test statistic t = 4.82 and the p values = 0. With 27 degrees of freedom, this value of t is highly significant, giving us evidence that b^sub 1^ > 0. Similarly, for the ITO dataset: t = 6.76 and the p values = 0. The t value is also highly significant, with 40 degrees of freedom. This in turn implies that the number of documents in a region is a useful predictor of the corresponding region size for these documents.
Conclusions and Research Directions
THIS RESEARCH SOUGHT TO VALIDATE THE PROPERTIES OF THE KOHONEN SOM in the domain of textual classification. In evaluation of the proximity hypothesis, we compared SOM's clustering of concept terms with a random map topology and used human beings as judges. The results were encouraging; the SOM was able to cluster related topics (i.e., similar head concepts on the SOM map) together, thus allowing users of the data collection to identify concept clusters quickly. For example, the DARPA ITO program officers were able to identify several research streams that they have been supporting. These include projects with an emphasis on issues related to networking such as Multicast Routing, Mobile Network, and Wireless Network, or projects geared toward information retrieval and analysis, such as Digital Library, Intelligent Agents, and Speech Recognition related research.
For the size hypothesis, our results suggest is a positive correlation between the size of SOM regions and their relative importance in the data collection. This property is significant in that, when a user is presented an SOM map, his or her attention is immediately drawn to the larger regions on the map. If the larger regions on an SOM map indeed represent the major issues (judged by the concentration of documents in the SOM regions) in the data collection, SOM can then be used as a tool to provide graphical summaries for large textual data collection. For example, among the regions on the ITO SOM map, larger regions such as Quality of Services (QoS), User Interface, and Complex Systems were usually among the first few topics to be observed by the viewers.
More work is needed to extend this research to validate such properties in 3D SOM, in which the SOM outputs are mapped to 3D cubes as opposed to 2D nodes. Another possible extension involves evaluation of the multilayered SOM , which uses the divide-and-conquer strategy to provide scalable organization for large-scale textual collections. We believe this research has established the Kohonen SOM algorithm as an intuitively appealing and promising neural-network-based textual classification technique for addressing part of the longstanding problem of "information overload."
Acknowledgment: This project is supported by the following grants: NSF/DARPA/NASA Digital Library Initiative, IRI-9411318, 1994-1998 (B. Schatz and H. Chen, "Building the Interspace: Digital Library Infrastructure for a University Engineering Community"), DARPA Information Management Program, N66001-97-C-8535, 1997-2000 (B. Schatz and H. Chen, "The Interspace Prototype: An Analysis Environment for Semantic Interoperability"), and DARPA ITO N66001-99-1-8907 (J.R Nunamaker, "Dialog Architecture for Collaboration"). We would also like to thank Olivia Sheng, Ron Larsen, and Allen Sears for comments and insights that have guided us in our research.