• Applying Associative Retrieval Techniques to Alleviate the Sparsity Problem in Collaborative Filtering

      Huang, Zan; Chen, Hsinchun; Zeng, Daniel (ACM, 2004-01)
      Recommender systems are being widely applied in many application settings to suggest products, services, and information items to potential consumers. Collaborative filtering, the most successful recommendation approach, makes recommendations based on past transactions and feedback from consumers sharing similar interests. A major problem limiting the usefulness of collaborative filtering is the sparsity problem, which refers to a situation in which transactional or feedback data is sparse and insufficient to identify similarities in consumer interests. In this article, we propose to deal with this sparsity problem by applying an associative retrieval framework and related spreading activation algorithms to explore transitive associations among consumers through their past transactions and feedback. Such transitive associations are a valuable source of information to help infer consumer interests and can be explored to deal with the sparsity problem. To evaluate the effectiveness of our approach, we have conducted an experimental study using a data set from an online bookstore. We experimented with three spreading activation algorithms including a constrained Leaky Capacitor algorithm, a branch-and-bound serial symbolic search algorithm, and a Hopfield net parallel relaxation search algorithm. These algorithms were compared with several collaborative filtering approaches that do not consider the transitive associations: a simple graph search approach, two variations of the user-based approach, and an item-based approach. Our experimental results indicate that spreading activation-based approaches significantly outperformed the other collaborative filtering methods as measured by recommendation precision, recall, the F-measure, and the rank score.We also observed the over-activation effect of the spreading activation approach, that is, incorporating transitive associations with past transactional data that is not sparse may “dilute” the data used to infer user preferences and lead to degradation in recommendation performance.
    • Building an Infrastructure for Law Enforcement Information Sharing and Collaboration: Design Issues and Challenges

      Chau, Michael; Atabakhsh, Homa; Zeng, Daniel; Chen, Hsinchun (2001)
      With the exponential growth of the Internet, information can be shared among government agencies more easily than before. However, this also poses some design issues and challenges. This article reports on our experience in building an infrastructure for information sharing and collaboration in the law enforcement domain. Based on our user requirement studies with the Tucson Police Department, three main design challenges are identified and discussed in details. Based on our findings, we propose an infrastructure to address these issues. The proposed design consists of three modules, namely (1) Security and Confidentiality Management Module, (2) Information Access and Monitoring Module, and (3) Collaboration Module. A prototype system will be deployed and tested at the Tucson Police Department. We anticipate that our studies can potentially provide useful insight to other digital government research projects.
    • CI Spider: a tool for competitive intelligence on the Web

      Chen, Hsinchun; Chau, Michael; Zeng, Daniel (Elsevier, 2002)
      Competitive Intelligence (CI) aims to monitor a firm’s external environment for information relevant to its decision-making process. As an excellent information source, the Internet provides significant opportunities for CI professionals as well as the problem of information overload. Internet search engines have been widely used to facilitate information search on the Internet. However, many problems hinder their effective use in CI research. In this paper, we introduce the Competitive Intelligence Spider, or CI Spider, designed to address some of the problems associated with using Internet search engines in the context of competitive intelligence. CI Spider performs real-time collection of Web pages from sites specified by the user and applies indexing and categorization analysis on the documents collected, thus providing the user with an up-to-date, comprehensive view of the Web sites of user interest. In this paper, we report on the design of the CI Spider system and on a user study of CI Spider, which compares CI Spider with two other alternative focused information gathering methods: Lycos search constrained by Internet domain, and manual within-site browsing and searching. Our study indicates that CI Spider has better precision and recall rate than Lycos. CI Spider also outperforms both Lycos and within-site browsing and searching with respect to ease of use. We conclude that there exists strong evidence in support of the potentially significant value of applying the CI Spider approach in CI applications.
    • CopLink: Managing Law Enforcement Data And Knowledge

      Chen, Hsinchun; Zeng, Daniel; Atabakhsh, Homa; Wyzga, Wojciech; Schroeder, Jennifer (ACM, 2003-01)
      In response to the September 11 terrorist attacks, major government efforts to modernize federal law enforcement authorities’ intelligence collection and processing capabilities have been initiated. At the state and local levels, crime and police report data is rapidly migrating from paper records to automated records management systems in recent years, making them increasingly accessible. However, despite the increasing availability of data, many challenges continue to hinder effective use of law enforcement data and knowledge, in turn limiting crime-fighting capabilities of related government agencies. For instance, most local police have database systems used by their own personnel, but lack an efficient manner in which to share information with other agencies. More importantly, the tools necessary to retrieve, filter, integrate, and intelligently present relevant information have not yet been sufficiently refined. According to senior Justice Department officials quoted on MSNBC, Sept. 26, 2001, there is “justifiable skepticism about the FBI’s ability to handle massive amounts of information,” and recent anti-terrorism initiatives will create more data overload problems. As part of nationwide, ongoing digital government initiatives, COPLINK is an integrated information and knowledge management environment aimed at meeting some of these challenges.
    • Design and evaluation of a multi-agent collaborative Web mining system

      Chau, Michael; Zeng, Daniel; Chen, Hsinchun; Huang, Michael; Hendriawan, David (Elsevier, 2003-04)
      Most existing Web search tools work only with individual users and do not help a user benefit from previous search experiences of others. In this paper, we present the Collaborative Spider, a multi-agent system designed to provide post-retrieval analysis and enable across-user collaboration in Web search and mining. This system allows the user to annotate search sessions and share them with other users. We also report a user study designed to evaluate the effectiveness of this system. Our experimental findings show that subjectsâ search performance was degraded, compared to individual search scenarios in which users had no access to previous searches, when they had access to a limited number (e.g., 1 or 2) of earlier search sessions done by other users. However, search performance improved significantly when subjects had access to more search sessions. This indicates that gain from collaboration through collaborative Web searching and analysis does not outweigh the overhead of browsing and comprehending other usersâ past searches until a certain number of shared sessions have been reached. In this paper, we also catalog and analyze several different types of user collaboration behavior observed in the context of Web mining.
    • MetaSpider: Meta-Searching and Categorization on the Web

      Chen, Hsinchun; Fan, Haiyan; Chau, Michael; Zeng, Daniel (Wiley Periodicals, Inc, 2001)
      It has become increasingly difficult to locate relevant information on the Web, even with the help of Web search engines. Two approaches to addressing the low precision and poor presentation of search results of current search tools are studied: meta-search and document categorization. Meta-search engines improve precision by selecting and integrating search results fromgeneric or domain-specific Web search engines or other resources. Document categorization promises better organization and presentation of retrieved results. This article introduces MetaSpider, a meta-search engine that has real-time indexing and categorizing functions. We report in this paper the major components of MetaSpider and discuss related technical approaches. Initial results of a user evaluation study comparing Meta- Spider, NorthernLight, and MetaCrawler in terms of clustering performance and of time and effort expended show that MetaSpider performed best in precision rate, but disclose no statistically significant differences in recall rate and time requirements. Our experimental study also reveals that MetaSpider exhibited a higher level of automation than the other two systems and facilitated efficient searching by providing the user with an organized, comprehensive view of the retrieved documents.
    • Testing a Cancer Meta Spider

      Chen, Hsinchun; Fan, Haiyan; Chau, Michael; Zeng, Daniel (Elsevier, 2003)
      As in many other applications, the rapid proliferation and unrestricted Web-based publishing of health-related content have made finding pertinent and useful healthcare information increasingly difficult. Although the development of healthcare information retrieval systems such as medical search engines and peer-reviewed medical Web directories has helped alleviate this information and cognitive overload problem, the effectiveness of these systems has been limited by low search precision, poor presentation of search results, and the required user search effort. To address these challenges, we have developed a domain-specific meta-search tool called Cancer Spider. By leveraging post-retrieval document clustering techniques, this system aids users in querying multiple medical data sources to gain an overview of the retrieved documents and locating answers of high quality to a wide spectrum of health questions. The system presents the retrieved documents to users in two different views: (1) Web pages organized by a list of key phrases, and (2) Web pages clustered into regions discussing different topics on a two-dimensional map (self-organizing map). In this paper, we present the major components of the Cancer Spider system and a user evaluation study designed to evaluate the effectiveness and efficiency of our approach. Initial results comparing Cancer Spider with NLM Gateway, a premium medical search site, have shown that they achieved comparable performances measured by precision, recall, and F-measure. Cancer Spider required less user searching time, fewer documents that need to be browsed, and less user effort.