• An Algorithmic Approach to Concept Exploration in a Large Knowledge Network (Automatic Thesaurus Consultation): Symbolic Branch-and-Bound Search vs. Connectionist Hopfield Net Activation

      Chen, Hsinchun; Ng, Tobun Dorbin (Wiley Periodicals, Inc, 1995-06)
      This paper presents a framework for knowledge discovery and concept exploration. In order to enhance the concept exploration capability of knowledge-based systems and to alleviate the limitations of the manual browsing approach, we have developed two spreading activation-based algorithms for concept exploration in large, heterogeneous networks of concepts (e.g., multiple thesauri). One algorithm, which is based on the symbolic Al paradigm, performs a conventional branch-and-bound search on a semantic net representation to identify other highly relevant concepts (a serial, optimal search process). The second algorithm, which is based on the neural network approach, executes the Hopfield net parallel relaxation and convergence process to identify â convergentâ concepts for some initial queries (a parallel, heuristic search process). Both algorithms can be adopted for automatic, multiple-thesauri consultation. We tested these two algorithms on a large text-based knowledge network of about 13,000 nodes (terms) and 80,000 directed links in the area of computing technologies. This knowledge network was created from two external thesauri and one automatically generated thesaurus. We conducted experiments to compare the behaviors and performances of the two algorithms with the hypertext-like browsing process. Our experiment revealed that manual browsing achieved higher-term recall but lower-term precision in comparison to the algorithmic systems. However, it was also a much more laborious and cognitively demanding process. In document retrieval, there were no statistically significant differences in document recall and precision between the algorithms and the manual browsing process. In light of the effort required by the manual browsing process, our proposed algorithmic approach presents a viable option for efficiently traversing largescale, multiple thesauri (knowledge network).
    • Alleviating Search Uncertainty through Concept Associations: Automatic Indexing, Co-Occurrence Analysis, and Parallel Computing

      Chen, Hsinchun; Martinez, Joanne; Kirchhoff, Amy; Ng, Tobun Dorbin; Schatz, Bruce R. (Wiley Periodicals, Inc, 1998)
      In this article, we report research on an algorithmic approach to alleviating search uncertainty in a large information space. Grounded on object filtering, automatic indexing, and co-occurrence analysis, we performed a large-scale experiment using a parallel supercomputer (SGI Power Challenge) to analyze 400,000/ abstracts in an INSPEC computer engineering collection. Two system-generated thesauri, one based on a combined object filtering and automatic indexing method, and the other based on automatic indexing only, were compared with the human-generated INSPEC subject thesaurus. Our user evaluation revealed that the system-generated thesauri were better than the INSPEC thesaurus in concept recall, but in concept precision the 3 thesauri were comparable. Our analysis also revealed that the terms suggested by the 3 thesauri were complementary and could be used to significantly increase â â varietyâ â in search terms and thereby reduce search uncertainty.
    • A Concept Space Approach to Addressing the Vocabulary Problem in Scientific Information Retrieval: An Experiment on the Worm Community System

      Chen, Hsinchun; Martinez, Joanne; Ng, Tobun Dorbin; Schatz, Bruce R. (Wiley Periodicals, Inc, 1997-01)
      This research presents an algorithmic approach to addressing the vocabulary problem in scientific information retrieval and information sharing, using the molecular biology domain as an example. We first present a literature review of cognitive studies related to the vocabulary problem and vocabuiary-based search aids (thesauri) and then discuss techniques for building robust and domain-specific thesauri to assist in cross-domain scientific information retrieval. Using a variation of the automatic thesaurus generation techniques, which we refer to as the concept space approach, we recently conducted an experiment in the molecular biology domain in which we created a C. elegans worm thesaurus of 7,657 worm-specific terms and a Drosofila fly thesaurus of 15,626 terms. About 30% of these terms overlapped, which created vocabulary paths from one subject domain to the other. Based on a cognitive study of term association involving four biologists, we found that a large percentage (59.6-85.6%) of the terms suggested by the subjects were identified in the conjoined fly-worm thesaurus. However, we found only a small percentage (8.4-18.1%) of the associations suggested by the subjects in the thesaurus. In a follow-up document retrieval study involving eight fly biologists, an actual worm database (Worm Community System), and the conjoined flyworm thesaurus, subjects were able to find more relevant documents (an increase from about 9 documents to 20) and to improve the document recall level (from 32.41 to 65.28%) when using the thesaurus, although the precision level did not improve significantly. Implications of adopting the concept space approach for addressing the vocabulary problem in Internet and digital libraries applications are also discussed.
    • Concept-based searching and browsing: a geoscience experiment

      Hauck, Roslin V.; Sewell, Robin R.; Ng, Tobun Dorbin; Chen, Hsinchun (Wiley Periodicals, Inc, 2001)
      In the recent literature, we have seen the expansion of information retrieval techniques to include a variety of different collections of information. Collections can have certain characteristics that can lead to different results for the various classification techniques. In addition, the ways and reasons that users explore each collection can affect the success of the information retrieval technique. The focus of this research was to extend the application of our statistical and neural network techniques to the domain of geological science information retrieval. For this study, a test bed of 22,636 geoscience abstracts was obtained through the NSF/DARPA/NASA funded Alexandria Digital Library Initiative project at the University of California at Santa Barbara. This collection was analyzed using algorithms previously developed by our research group: concept space algorithm for searching and a Kohonen self-organizing map (SOM) algorithm for browsing. Included in this paper are discussions of our techniques, user evaluations and lessons learned.
    • Creating a Large-Scale Digital Library for Georeferenced Information

      Zhu, Bin; Ramsey, Marshall C.; Ng, Tobun Dorbin; Chen, Hsinchun; Schatz, Bruce R. (1999-07)
      Digital libraries with multimedia geographic content present special challenges and opportunities in today's networked information environment. One of the most challenging research issues for geospatial collections is to develop techniques to support fuzzy, concept-based, geographic information retrieval. Based on an artificial intelligence approach, this project presents a Geospatial Knowledge Representation System (GKRS) prototype that integrates multiple knowledge sources (textual, image, and numerical) to support concept-based geographic information retrieval. Based on semantic network and neural network representations, GKRS loosely couples different knowledge sources and adopts spreading activation algorithms for concept-based knowledge inferencing. Both textual analysis and image processing techniques have been employed to create textual and visual geographical knowledge structures. This paper suggests a framework for developing a complete GKRS-based system and describes in detail the prototype system that has been developed so far.
    • Exploring the use of concept spaces to improve medical information retrieval

      Houston, Andrea L.; Chen, Hsinchun; Schatz, Bruce R.; Hubbard, Susan M.; Sewell, Robin R.; Ng, Tobun Dorbin (Elsevier, 2000)
      This research investigated the application of techniques successfully used in previous information retrieval research, to the more challenging area of medical informatics. It was performed on a biomedical document collection testbed, CANCERLIT, provided by the National Cancer Institute (NCI) , which contains information on all types of cancer therapy. The quality or usefulness of terms suggested by three different thesauri, one based on MeSH terms, one based solely on terms from the document collection, and one based on the Unified Medical Language System UMLS Metathesaurus, was explored with the ultimate goal of improving CANCERLIT information search and retrieval. Researchers affiliated with the University of Arizona Cancer Center evaluated lists of related terms suggested by different thesauri for 12 different directed searches in the CANCERLIT testbed. The preliminary results indicated that among the thesauri, there were no statistically significant differences in either term recall or precision. Surprisingly, there was almost no overlap of relevant terms suggested by the different thesauri for a given search. This suggests that recall could be significantly improved by using a combined thesaurus approach.
    • Federated Search of Scientific Literature

      Schatz, Bruce R.; Mischo, William; Cole, Timothy; Bishop, Ann Peterson; Harum, Susan; Johnson, Eric H.; Neumann, Laura; Chen, Hsinchun; Ng, Tobun Dorbin (IEEE, 1999-02)
      The Digital Libraries Initiative (DLI) project at the University of Illinois at Urbana-Champaign (UIUC) was one of six sponsored by the NSF, DARPA, and NASA from 1994 through 1998. Our goal was to develop widely usable Web technology to effectively search technical documents on the Internet. We concentrated on building the experimental Illinois DLI Testbed with tens of thousands of full-text journal articles from physics, engineering, and computer science, and on making these articles available over the Internet before they are available in print. Our DLI Testbed used document structure to provide federated search across publisher collections, by merging diverse tags from multiple publishers into a single uniform collection. Our sociology research evaluated the usage of the DLI Testbed by more than a thousand UIUC faculty and students. Our technology research moved beyond document structure to document semantics, testing contextual indexing of document content on millions of documents.
    • Federated Search of Scientific Literatures: A Retrospective on the Illinios Digital Library Project

      Schatz, Bruce R.; Mischo, William; Cole, Timothy; Bishop, Ann Peterson; Harum, Susan; Johnson, Eric H.; Neumann, Laura; Chen, Hsinchun; Ng, Tobun Dorbin; Harum, S.; et al. (UIUC, 2000)
      The NSF/DARPA/NASA Digital Libraries Initiative (DLI) project at the University of Illinois at Urbana-Champaign (UIUC), 1994-1998, had the goal of developing widely usable Web technology to effectively search technical documents on the Internet. The DLI testbed focused on using the document structure to provide federated searches across publisher collections. Our sociology research included the evaluation of its effectiveness under use by over 1,000 UIUC faculty and students, a user community an order of magnitude bigger than the last generation of research projects centered on searching scientific literature. Our technology research developed indexing of the contents of text documents to enable a federated search across multiple sources, testing this on millions of documents for semantic federation. This article will discuss the achievements and difficulties we experienced over the past four years.
    • Generating, Integrating, and Activating Thesauri for Concept-based Document Retrieval

      Chen, Hsinchun; Lynch, K.J.; Basu, K.; Ng, Tobun Dorbin (IEEE, 1993-04)
      This Blackboard-based design uses a neural-net spreading-activation algorithm to traverse multiple thesauri. Guided by heuristics, the algorithm activates related terms in the thesauri and converges on the most pertinent concepts.
    • Medical Data Mining on the Internet: Research on a Cancer Information System

      Houston, Andrea L.; Chen, Hsinchun; Hubbard, Susan M.; Schatz, Bruce R.; Ng, Tobun Dorbin; Sewell, Robin R.; Tolle, Kristin M. (Kluwer, 1999)
      This paper discusses several data mining algorithms and techniques that we have developed at the University of Arizona Artificial Intelligence Lab.We have implemented these algorithms and techniques into several prototypes, one of which focuses on medical information developed in cooperation with the National Cancer Institute (NCI) and the University of Illinois at Urbana-Champaign.We propose an architecture for medical knowledge information systems that will permit data mining across several medical information sources and discuss a suite of data mining tools that we are developing to assist NCI in improving public access to and use of their existing vast cancer information collections.
    • A Parallel Computing Approach to Creating Engineering Concept Spaces for Retrieval: The Illinios Digital Library Initiative Project

      Chen, Hsinchun; Schatz, Bruce R.; Ng, Tobun Dorbin; Martinez, Joanne; Kirchhoff, Amy; Lin, Chienting (IEEE, 1996-08)
      This research presents preliminary results generated from the semantic retrieval research component of the Illinois Digital Library Initiative (DLI) project. Using a variation of the automatic thesaurus generation techniques, to which we refer as the concept space approach, we aimed to create graphs of domain-specific concepts (terms) and their weighted co-occurrence relationships for all major engineering domains. Merging these concept spaces and providing traversal paths across different concept spaces could potentially help alleviate the vocabulary (difference) problem evident in large-scale information retrieval. We have experimented previously with such a technique for a smaller molecular biology domain (Worm Community System, with 10+ MBs of document collection) with encouraging results.
    • Support Concept-based Multimedia Information Retrieval: A Knowledge Management Approach

      Zhu, Bin; Ramsey, Marshall C.; Chen, Hsinchun; Hauck, Roslin V.; Ng, Tobun Dorbin; Schatz, Bruce R. (1999)
      Identified as an important management concept five years ago (Gamer 1999), knowledge management (KM) aims to enable organizations to capture, organize, and access their intellectual assets. This paper proposes a prototype system that applies a knowledge management approach to support concept-based multimedia information retrieval by integrating various information analysis and image processing techniques. The proposed system uses geographical information as its testbed and aims to provide flexibility to users in terms of specifying their information needs and to facilitate parallel extraction ofinformation in different formats (i.e., text, image). Our testbed selection is based not only on the fact that geographical information has become an important resource supporting organization decision making, but also on the diversity of its information media and the fuzziness of geo-spatial queries. We hope that the proposed system will improve the accessibility of geographical information in different media and provide an example of integrating various information and multimedia techniques to support concept-based cross-media information retrieval.
    • Using Backpropagation Networks for the Estimation of Aqueous Activity Coefficients of Aromatic Organic Compounds

      Chow, Hsiao-Hui; Chen, Hsinchun; Ng, Tobun Dorbin; Myrdal, P.; Yalkowsky, S.H. (1995-07)
      This research examined the applicability of using a neural network approach to the estimation of aqueous activity coefficients of aromatic organic compounds from fragmented structural information. A set of 95 compounds was used to train the neural network, and the trained network was tested on a set of 31 compounds. A comparison was made between the results and those obtained using multiple linear regression analysis. With the proper selection of neural network parameters, the backpropagation network provided a more accurate prediction of the aqueous activity coefficients for testing data than did regression analysis. This research indicates that neural networks have the potential to become a useful analytical technique for quantitative prediction of structure-activity relationships.