• An Algorithmic Approach to Concept Exploration in a Large Knowledge Network (Automatic Thesaurus Consultation): Symbolic Branch-and-Bound Search vs. Connectionist Hopfield Net Activation

      Chen, Hsinchun; Ng, Tobun Dorbin (Wiley Periodicals, Inc, 1995-06)
      This paper presents a framework for knowledge discovery and concept exploration. In order to enhance the concept exploration capability of knowledge-based systems and to alleviate the limitations of the manual browsing approach, we have developed two spreading activation-based algorithms for concept exploration in large, heterogeneous networks of concepts (e.g., multiple thesauri). One algorithm, which is based on the symbolic Al paradigm, performs a conventional branch-and-bound search on a semantic net representation to identify other highly relevant concepts (a serial, optimal search process). The second algorithm, which is based on the neural network approach, executes the Hopfield net parallel relaxation and convergence process to identify â convergentâ concepts for some initial queries (a parallel, heuristic search process). Both algorithms can be adopted for automatic, multiple-thesauri consultation. We tested these two algorithms on a large text-based knowledge network of about 13,000 nodes (terms) and 80,000 directed links in the area of computing technologies. This knowledge network was created from two external thesauri and one automatically generated thesaurus. We conducted experiments to compare the behaviors and performances of the two algorithms with the hypertext-like browsing process. Our experiment revealed that manual browsing achieved higher-term recall but lower-term precision in comparison to the algorithmic systems. However, it was also a much more laborious and cognitively demanding process. In document retrieval, there were no statistically significant differences in document recall and precision between the algorithms and the manual browsing process. In light of the effort required by the manual browsing process, our proposed algorithmic approach presents a viable option for efficiently traversing largescale, multiple thesauri (knowledge network).
    • Automatic Thesaurus Generation for an Electronic Community System

      Chen, Hsinchun; Schatz, Bruce R.; Yim, Tak; Fye, David (Wiley Periodicals, Inc, 1995-04)
      This research reports an algorithmic approach to the automatic generation of thesauri for electronic community systems. The techniques used included term filtering, automatic indexing, and cluster analysis. The testbed for our research was the Worm Community System, which contains a comprehensive library of specialized community data and literature, currently in use by molecular biologists who study the nematode worm C. elegans. The resulting worm thesaurus included 2709 researchers’ names, 798 gene names, 20 experimental methods, and 4302 subject descriptors. On average, each term had about 90 weighted neighboring terms indicating relevant concepts. The thesaurus was developed as an online search aide. We tested the worm thesaurus in an experiment with six worm researchers of varying degrees of expertise and background. The experiment showed that the thesaurus was an excellent “memory-jogging” device and that it supported learning and serendipitous browsing. Despite some occurrences of obvious noise, the system was useful in suggesting relevant concepts for the researchers’ queries and it helped improve concept recall. With a simple browsing interface, an automatic thesaurus can become a useful tool for online search and can assist researchers in exploring and traversing a dynamic and complex electronic community system.
    • Comparing noun phrasing techniques for use with medical digital library tools

      Tolle, Kristin M.; Chen, Hsinchun (EBSCO, 2000-02)
      In an effort to assist medical researchers and professionals in accessing information necessary for their work, the A1 Lab at the University of Arizona is investigating the use of a natural language processing (NLP) technique called noun phrasing. The goal of this research is to determine whether noun phrasing could be a viable technique to include in medical information retrieval applications. Four noun phrase generation tools were evaluated as to their ability to isolate noun phrases from medical journal abstracts. Tests were conducted using the National Cancer Institute's CANCERLIT database. The NLP tools evaluated were Massachusetts Institute of Technology's (MIT's) Chopper, The University of Arizona's Automatic Indexer, Lingsoft's NPtool, and The University of Arizona's AZ Noun Phraser. In addition, the National Library of Medicine's SPECIALIST Lexicon was incorporated into two versions of the AZ Noun Phraser to be evaluated against the other tools as well as a nonaugmented version of the AZ Noun Phraser. Using the metrics relative subject recall and precision, our results show that, with the exception of Chopper, the phrasing tools were fairly comparable in recall and precision. It was also shown that augmenting the AZ Noun Phraser by including the SPECIALIST Lexicon from the National Library of Medicine resulted in improved recall and precision.
    • A Concept Space Approach to Addressing the Vocabulary Problem in Scientific Information Retrieval: An Experiment on the Worm Community System

      Chen, Hsinchun; Martinez, Joanne; Ng, Tobun Dorbin; Schatz, Bruce R. (Wiley Periodicals, Inc, 1997-01)
      This research presents an algorithmic approach to addressing the vocabulary problem in scientific information retrieval and information sharing, using the molecular biology domain as an example. We first present a literature review of cognitive studies related to the vocabulary problem and vocabuiary-based search aids (thesauri) and then discuss techniques for building robust and domain-specific thesauri to assist in cross-domain scientific information retrieval. Using a variation of the automatic thesaurus generation techniques, which we refer to as the concept space approach, we recently conducted an experiment in the molecular biology domain in which we created a C. elegans worm thesaurus of 7,657 worm-specific terms and a Drosofila fly thesaurus of 15,626 terms. About 30% of these terms overlapped, which created vocabulary paths from one subject domain to the other. Based on a cognitive study of term association involving four biologists, we found that a large percentage (59.6-85.6%) of the terms suggested by the subjects were identified in the conjoined fly-worm thesaurus. However, we found only a small percentage (8.4-18.1%) of the associations suggested by the subjects in the thesaurus. In a follow-up document retrieval study involving eight fly biologists, an actual worm database (Worm Community System), and the conjoined flyworm thesaurus, subjects were able to find more relevant documents (an increase from about 9 documents to 20) and to improve the document recall level (from 32.41 to 65.28%) when using the thesaurus, although the precision level did not improve significantly. Implications of adopting the concept space approach for addressing the vocabulary problem in Internet and digital libraries applications are also discussed.
    • Concept-based searching and browsing: a geoscience experiment

      Hauck, Roslin V.; Sewell, Robin R.; Ng, Tobun Dorbin; Chen, Hsinchun (Wiley Periodicals, Inc, 2001)
      In the recent literature, we have seen the expansion of information retrieval techniques to include a variety of different collections of information. Collections can have certain characteristics that can lead to different results for the various classification techniques. In addition, the ways and reasons that users explore each collection can affect the success of the information retrieval technique. The focus of this research was to extend the application of our statistical and neural network techniques to the domain of geological science information retrieval. For this study, a test bed of 22,636 geoscience abstracts was obtained through the NSF/DARPA/NASA funded Alexandria Digital Library Initiative project at the University of California at Santa Barbara. This collection was analyzed using algorithms previously developed by our research group: concept space algorithm for searching and a Kohonen self-organizing map (SOM) algorithm for browsing. Included in this paper are discussions of our techniques, user evaluations and lessons learned.
    • Expertise and the perception of shape in information

      Dillon, Andrew; Schaap, Dille; Kraft, Donald H. (Wiley, 1996-10)
      This item is not the definitive copy. Please use the following citation when referencing this material: Dillon, A. and Shaap, D. (1996) Expertise and the perception of structure in discourse. Journal of the American Society for Information Science, 47(10), 786-788. Abstract: Ability to navigate an information space may be influenced by the presence or absence of certain embedded cues that users have learned to recognize. Experimental results are presented which indicate that experienced readers of certain academic journals are more capable than inexperienced readers in locating themselves in an information space in the absence of explicit structural cues.
    • Extending SGML to accommodate database functions: A Methodological Overview

      Sengupta, Arjit; Dillon, Andrew; Kraft, Donald H. (Wiley, 1997-07)
      A method for augmenting an SGML document repository system with database functionality is presented. SGML (ISO 8879, 1986) has been widely accepted as a standard language for writing text with added structural information that gives the text greater applicability. Recently there has been a trend to use this structural information as metadata in databases. The complex structure of docuuments, however, makes it difficult to directly map the structural information in documents to database structures. In particular, the flat nature of relational databases makes it extremely difficult to model documents that are inherently hierarchical in nature. Consequently, documents are modeled in object-oriented databases (Abite-boul, Cluet, & Milo, 1993), and object-relational databases (Holst, 1995), in which SGML documents are mapped into the corresponding database models and are later reconstructed as necessary. However, this mapping strategy is not natural and can potentially cause loss of information in the original SGML documents. Moreover, interfaces for building queries for current document databases are mostly built on form-based query techniques and do not use the â â look and feelâ â of the documents. This article introduces an implementation method for a complex-object modeling technique specifically for SGML documents and describes interface techniques tailored for text databases. Some of the concepts for a Structured Document Database Management System (SDDBMS) specifically designed for SGML documents are described. A small survey of some current products is also presented to demonstrate the need for such a system.
    • Extending theory for user-centered information systems: Diagnosing and learning from error in complex statistical data.

      Robbin, Alice; Frost-Kumpf, Lee (John Wiley & Sons, Inc., 1997-02)
      Utilization of complex statistical data has come at great cost to individual researchers, the information community, and to the national information infrastructure. Dissatisfaction with the traditional approach to information system design and information services provision, and, by implication, the theoretical bases on which these systems and services have been developed has led librarians and information scientists to propose that information is a user construct and therefore system designs should place greater emphasis on user-centered approaches. This article extends Dervinâ s and Morris's theoretical framework for designing effective information services by synthesizing and integrating theory and research derived from multiple approaches in the social and behavioral sciences. These theoretical frameworks are applied to develop general design strategies and principles for information systems and services that rely on complex statistical data. The focus of this article is on factors that contribute to error in the production of high quality scientific output and on failures of communication during the process of data production and data utilization. Such insights provide useful frameworks to diagnose, communicate, and learn from error. Strategies to design systems that support communicative competence and cognitive competence emphasize the utilization of information systems in a user centered learning environment. This includes viewing cognition as a generative process and recognizing the continuing interdependence and active involvement of experts, novices, and technological gatekeepers.
    • From Translation to Navigation of Different Discourses: A Model of Search Term Selection during the Pre-Online Stage of the Search Process

      Iivonen, Mirja; Sonnenwald, Diane H. (John Wiley and Sons, Inc., 1998-04)
      We propose a model of search term selection process based on our empirical study of professional searchers during the pre-online stage of the search process. The model characterizes the selection of search terms as the navigation of different discourses. Discourse refers to the way of talking and thinking about a certain topic; there often exists multiple, diverse discourses on the same topic. When selecting search terms, searchers appear to navigate a variety of discourses, i.e., they view the topic of a client's search request from the perspective of multiple discourse communities, and evaluate and synthesize differences and similarities among those discourses when selecting search terms. Six discourses emerged as sources of search terms in our study. These discourses are controlled vocabularies, documents and domains, the practice of indexing, clients' search requests, databases and the searchers' own search experience. Data further suggest that searchers navigate these discourses dynamically and have preferences for certain discourses. Conceptualizing the selection of search terms as a meeting place of different discourses provides new insights into the complex nature of the search term selection process. It emphasizes the multiplicity and complexity of the sources of search terms, the dynamic nature of the search term selection process, and the complex analysis and synthesis of differences and similarities among sources of search terms. It suggests that searchers may need to understand fundamental aspects of multiple discourses in order to select search terms.
    • Genres and the Web - is the home page the first digital genre?

      Dillon, Andrew; Grushowski, Barbara; Kraft, Donald H. (Wiley, 2000-01)
      Genre conventions emerge across discourse communities over time to support the communication of ideas and information in socially and cognitively compatible forms. Digital genres frequently borrow heavily from the paper world even though the media are very different. This research sought to identify the existence and form of a truly digital genre. Preliminary results from a survey of user perceptions of the form and content of web home pages reveal a significant correlation between commonly found elements on such home pages and user preferences and expectations of type. Results suggest that the personal home page has rapidly evolved into a recognizable form with stable, user-preferred elements and thus can be considered the first truly digital genre.
    • A graphical self-organizing approach to classifying electronic meeting output

      Orwig, Richard E.; Chen, Hsinchun; Nunamaker, Jay F. (Wiley Periodicals, Inc, 1997-02)
      This article describes research in the application of a Kohonen Self-Organizing Map (SOM) to the problem of classification of electronic brainstorming output and an evaluation of the results. This research builds upon previous work in automating the meeting classification process using a Hopfield neural network. Evaluation of the Kohonen output comparing it with Hopfield and human expert output using the same set of data found that the Kohonen SOM performed as well as a human expert in representing term association in the meeting output and outperformed the Hopfield neural network algorithm. Recall of consensus meeting concepts and topics using the Kohonen algorithm was equivalent to that of the human expert.
    • A Machine Learning Approach to Inductive Query by Examples: An Experiment Using Relevance Feedback, ID3, Genetic Algorithms, and Simulated Annealing

      Chen, Hsinchun; Shankaranarayanan, Ganesan; She, Linlin; Iyer, Anand (Wiley Periodicals, Inc, 1998-06)
      Information retrieval using probabilistic techniques has attracted significant attention on the part of researchers in information and computer science over the past few decades. In the 1980s, knowledge-based techniques also made an impressive contribution to â â intelligentâ â information retrieval and indexing. More recently, information science researchers have turned to other newer inductive learning techniques including symbolic learning, genetic algorithms, and simulated annealing. These newer techniques, which are grounded in diverse paradigms, have provided great opportunities for researchers to enhance the information processing and retrieval capabilities of current information systems. In this article, we first provide an overview of these newer techniques and their use in information retrieval research. In order to familiarize readers with the techniques, we present three promising methods: The symbolic ID3 algorithm, evolution-based genetic algorithms, and simulated annealing. We discuss their knowledge representations and algorithms in the unique context of information retrieval. An experiment using a 8000-record COMPEN database was performed to examine the performances of these inductive query-by-example techniques in comparison with the performance of the conventional relevance feedback method. The machine learning techniques were shown to be able to help identify new documents which are similar to documents initially suggested by users, and documents which contain similar concepts to each other. Genetic algorithms, in particular, were found to out-perform relevance feedback in both document recall and precision. We believe these inductive machine learning techniques hold promise for the ability to analyze usersâ preferred documents (or records), identify usersâ underlying information needs, and also suggest alternatives for search for database management systems and Internet applications.
    • Machine Learning for Information Retrieval: Neural Networks, Symbolic Learning, and Genetic Algorithms

      Chen, Hsinchun (Wiley Periodicals, Inc, 1995-04)
      Information retrieval using probabilistic techniques has attracted significant attention on the part of researchers in information and computer science over the past few decades. In the 1980s, knowledge-based techniques also made an impressive contribution to “intelligent” information retrieval and indexing. More recently, information science researchers have turned to other newer artificial-intelligence- based inductive learning techniques including neural networks, symbolic learning, and genetic algorithms. These newer techniques, which are grounded on diverse paradigms, have provided great opportunities for researchers to enhance the information processing and retrieval capabilities of current information storage and retrieval systems. In this article, we first provide an overview of these newer techniques and their use in information science research. To familiarize readers with these techniques, we present three popular methods: the connectionist Hopfield network; the symbolic ID3/ID5R; and evolution- based genetic algorithms. We discuss their knowledge representations and algorithms in the context of information retrieval. Sample implementation and testing results from our own research are also provided for each technique. We believe these techniques are promising in their ability to analyze user queries, identify users’ information needs, and suggest alternatives for search. With proper user-system interactions, these methods can greatly complement the prevailing full-text, keywordbased, probabilistic, and knowledge-based techniques.
    • Raising Reliability of Web Search Tool Research through Replication and Chaos Theory

      Nicholson, Scott (2000)
      Because the World Wide Web is a dynamic collection of information, the Web search tools (or "search engines") that index the Web are dynamic. Traditional information retrieval evaluation techniques may not provide reliable results when applied to the Web search tools. This study is the result of ten replications of the classic 1996 Ding and Marchionini Web search tool research. It explores the effects that replication can have on transforming unreliable results from one iteration into replicable and therefore reliable results after multiple iterations.
    • Seeking explanation in theory: Reflections on the social practices of organizations that distribute public use microdata files for research purposes

      Robbin, Alice; Koball, Heather (2001-11)
      Public concern about personal privacy has recently focused on issues of Internet data security and personal information as big business. The scientific discourse about information privacy focuses on the crosspressures of maintaining confidentiality and ensuring access in the context of the production of statistical data for public policy and social research and the associated technical solutions for releasing statistical data. This article reports some of the key findings from a smallscale survey of organizational practices to limit disclosure of confidential information prior to publishing public use microdata files, and illustrates how the rules for preserving confidentiality were applied in practice. Explanation for the apparent deficits and wide variations in the extent of knowledge about statistical disclosure limitation (SDL) methods is located in theories of organizational life and communities of practice. The article concludes with suggestions for improving communication between communities of practice to enhance the knowledge base of those responsible for producing public use microdata files.
    • Spatial semantics: How users derive shape from information space

      Dillon, Andrew; Kraft, Donald H. (2000)
      This is a preprint of a paper published (with a slightly different title: Spatial semantics and individual differences in the perception of shape in information space) in the Journal of the American Society for Information Science, 51(6), 521-528. Abstract: User problems with large information spaces multiply in complexity when we enter the digital domain. Virtual information environments can offer 3-D representations, reconfigurations and access to large databases that can overwhelm many usersâ abilities to filter and represent. As a result, users frequently experience disorientation in navigating large digital spaces to locate and use information. To date, the research response has been predominantly based on the analysis of visual navigational aids that might support users' bottom-up processing of the spatial display. In the present paper an emerging alternative is considered that places greater emphasis on the top-down application of semantic knowledge by the user gleaned from their experiences within the socio-cognitive context of information production and consumption. A distinction between spatial and semantic cues is introduced and existing empirical data are reviewed that highlight the differential reliance on spatial or semantic information as domain expertise of the user increases. The conclusion is reached that interfaces for shaping information should be built on an increasing analysis of users' semantic processing.
    • Validating a Geographic Image Retrieval System

      Zhu, Bin; Chen, Hsinchun (Wiley Periodicals, Inc, 2000)
      This paper summarizes a prototype geographical image retrieval system that demonstrates how to integrate image processing and information analysis techniques to support large-scale content-based image retrieval. By using an image as its interface, the prototype system addresses a troublesome aspect of traditional retrieval models, which require users to have complete knowledge of the low-level features of an image. In addition we describe an experiment to validate the performance of this image retrieval system against that of human subjects in an effort to address the scarcity of research evaluating performance of an algorithm against that of human beings. The results of the experiment indicate that the system could do as well as human subjects in accomplishing the tasks of similarity analysis and image categorization. We also found that under some circumstances texture features of an image are insufficient to represent a geographic image. We believe, however, that our image retrieval system provides a promising approach to integrating image processing techniques and information retrieval algorithms.
    • Who's Zooming Whom? Attunement to animation in the interface

      Chui, Michael; Dillon, Andrew; Kraft, Donald H. (Wiley, 1997-01)
      A number of references in the Human-Computer Interaction literature make the common-sense suggestion that the animated zooming effect accompanying the opening or closing of a folder in the Apple Macintosh graphical user interface aids in a user's perception of which window corresponds to which folder. We examine this claim empirically using two controlled experiments. Although we did not find a statistically significant overall difference resulting from the presence or absence of the zooming effect, a post hoc analysis revealed a highly significant interaction between the experience of users with the Macintosh user interface and the zooming effect. This individual difference suggests that users become attuned to the informational content of the zooming effect with experience.