The Illinois Digital Library Initiative Project:
Federating Repositories and Semantic Research

Hsinchun Chen
Professor, Management Information Systems
Director, Artificial Intelligence Lab
Management Information Systems Department, University of Arizona
Tucson, Arizona 85721, email: hchen@bpa.arizona.edu, http://ai.bpa.arizona.edu

Introduction

In this era of the Internet and distributed, multimedia computing, new and emerging classes of information systems applications have swept into the lives of office workers and people in general. Digital libraries, multimedia systems, geographic information systems, and collaborative computing to electronic commerce, virtual reality, and electronic video arts and games have created tremendous opportunities for information and computer science researchers and practitioners.

As applications become more pervasive, pressing, and diverse, several well-known information retrieval (IR) problems have become even more urgent. Information overload, a result of the ease of information creation and transmission via Internet and WWW, has become more troublesome (e.g., even stockbrokers and elementary school students, heavily exposed to various WWW search engines, are versed in such IR terminology as recall and precision). Significant variations in database formats and structures, the richness of information media (text, audio, and video), and an abundance of multilingual information content also have created severe information interoperability problems - structural interoperability, media interoperability, and multilingual interoperability.

Federal Initiatives: Digital Libraries and Others

The Information Infrastructure Technology and Applications (IITA) Working Group, the highest level of the country's National Information Infrastructure (NII) technical committee, held an invitational workshop in May 1995 to define a research agenda for digital libraries. (See http://Walrus.Stanford.EDU/diglib/pub/reports/iita-dlw/main.html)

The participants’ shared vision is an entire Net of distributed repositories, where objects of any type can be searched within and across different indexed collections [11]. In the short term, technologies must be developed to search across these repositories transparently, handling any variations in protocols and formats (i.e., addressing structural interoperability [8]). In the long term, technologies also must be developed to handle variations in content and meanings transparently. These requirements are steps along the way toward matching the concepts being explored by users with objects indexed in collections [10].

The ultimate goal, as described in the IITA report, is the Grand Challenge of Digital Libraries:

deep semantic interoperability - the ability of a user to access, consistently and coherently, similar (though autonomously defined and managed) classes of digital objects and services, distributed across heterogeneous repositories, with federating or mediating software compensating for site-by-site variations...Achieving this will require breakthroughs in description as well as retrieval, object interchange and object retrieval protocols. Issues here include the definition and use of metadata and its capture or computation from objects (both textual and multimedia), the use of computed descriptions of objects, federation and integration of heterogeneous repositories with disparate semantics, clustering and automatic hierarchical organization of information, and algorithms for automatic rating, ranking, and evaluation of information quality, genre, and other properties.

Attention to semantic interoperability has prompted several NSF/DARPA/NASA funded large-scale digital library initiative (DLI) projects to explore various artificial intelligence, statistical, and pattern recognition techniques, e.g., concept spaces and category maps in the Illinois project [12], textile and word sense dis-ambiguiation in the Berkeley project [14], voice recognition in the CMU project [13], and image segmentation and clustering in the UCSB project [6].

The ubiquity of online information as perceived by US leaders (e.g., ``Information President'' Clinton and ``Information Vice President'' Gore) as well as the general public and recognition of the importance of turning information into knowledge have continued to push information and computer science researchers toward developing scalable artificial intelligence techniques for other emerging information systems applications.

In the Santa Fe Workshop on Distributed Knowledge Work Environments: Digital Libraries, held in March, 1997, a panel of digital library researchers and practitioners suggested three areas of research for the planned Digital Library Initiative-2 (DLI-2): system-centered issues, collection-centered issues, and user-centered issues. Scalability, interoperability, adaptability and durability, and support for collaboration are the four key research directions under system-centered issues. System interoperability, syntactic (structural) interoperability, linguistic interoperability, temporal interoperability, and semantic interoperability are recognized by leading researchers as the most challenging and rewarding research areas. (See http://www.si.umich.edu/SantaFe/)

In a new NSF Knowledge Networking (KN) initiative, a group of domain scientists and information systems researchers was invited to a Workshop on Distributed Heterogeneous Knowledge Networks at Boulder, Colorado, in May, 1997. Scalable techniques to improve semantic bandwidth and knowledge bandwidth are considered among the priority research areas, as described in the KN report (see http://www.scd.ucar.edu/info/KDI/):

The Knowledge Networking (KN) initiative focuses on the integration of knowledge from different sources and domains across space and time. Modern computing and communications systems provide the infrastructure to send bits anywhere, anytime, in mass quantities - radical connectivity. But connectivity alone cannot assure (1) useful communication across disciplines, languages, cultures; (2) appropriate processing and integration of knowledge from different sources, domains, and non-text media; (3) efficacious activity and arrangements for teams, organizations, classrooms, or communities, working together over distance and time; or (4) deepening understanding of the ethical, legal, and social implications of new developments in connectivity, but not interactivity and integration. KN research aims to move beyond connectivity to achieve new levels of interactivity, increasing the semantic bandwidth, knowledge bandwidth, activity bandwidth, and cultural bandwidth among people, organizations, and communities.

The Illinois DLI Project: Federating Repositories of Scientific Literature

The Illinois DLI Project, one of six projects funded by the NSF/DARPA/NASA DLI, consists of two major components: (1) a production testbed based in a real library (SGML publisher stream deployed at the University of Illinois at Urbana-Champaign, UIUC) and (2) fundamental technology research for semantic interoperability (semantic indexes across subjects and media developed at the University of Arizona). We summarize our testbed effort in this section. Readers can find more details in [12].

The Illinois DLI production testbed was developed in the Grainger Engineering library at UIUC. It supports full SGML federated structure search on an experimental Web-based interface. The initial rollout was available at the UIUC campus in October 1997 and has been integrated with the library information services. The testbed consist of materials from 5 publishers, 55 engineering journals, and 40,000 full-text articles. The primary partners of the project include: American Institute of Physics, American Physical Society, American Astronomical Society, American Society of Civil Engineers, American Society of Mechanical Engineers, American Society of Agricultural Engineers, American Institute of Aeronautics and Astronautics, Institute of Electrical and Electronic Engineers, Institute of Electrical Engineers, and IEEE Computer Society. The testbed was implemented using SoftQuad (SGML rendering) and OpenText (full-text search), both commercial software.

The production testbed has been evaluated since October 1997. Six hundreds UIUC user subjects enrolled in introductory computer science classes have used the system; and their feedback has been collected and analyzed. We expect to have collected usage data for about 1500 subjects at the end of the project. Usage data consists of session observations and transaction logs. The Illinois DLI project developers and evaluators have worked together very closely on needs assessment and usability studies.

After 4 years of research effort, we believe the testbed successes include:

However, we also have experienced many testbed difficulties:

As the project comes to its end, several future direction are being explored:

Semantic Issues for Digital Libraries

In addition to testbed development effort, significant research has been conducted at all six DLI projects in the area of semantic retrieval and analysis for digital libraries. Among the semantic indexing and analysis techniques that are considered scalable and domain independent, the following classes of algorithms and methods have been examined and subjected to experimentation in various digital library, multimedia database, and information science applications:

The most fundamental techniques in IR involve identifying key features in objects. For example, automatic indexing and natural language processing (e.g., noun phrase extraction or object type tagging) are frequently used to automatically extract meaningful keywords or phrases from texts [9]. Texture, color, or shape-based indexing and segmentation techniques are often used to identify images [6]. For audio and video applications, voice recognition, speech recognition, and scene segmentation techniques can be used to identify meaningful descriptors in audio or video streams [13].

Several classes of techniques have been used for semantic analysis of texts or multimedia objects. Symbolic machine learning (e.g., ID3, version space), graph-based clustering and classification (e.g., Ward's hierarchical clustering), statistics-based multivariate analyses (e.g., latent semantic indexing, multi-dimensional scaling, regressions), artificial neural network-based computing (e.g., backpropagation networks, Kohonen self-organizing maps), and evolution-based programming (e.g., genetic algorithms) are among the popular techniques [1]. In this information age, we believe such techniques will serve as good alternatives for processing, analyzing, and summarizing large amounts of diverse and rapidly changing multimedia information.

The results from a semantic analysis process could be represented in the form of semantic networks, decision rules, or predicate logic. Many researchers have attempted to integrate such results with existing human-created knowledge structures such as ontologies, subject headings, or thesauri [7]. Spreading activation based inferencing methods often is used to traverse various large-scale knowledge structures [3].

One of the major trends in almost all emerging information systems applications is the focus on user-friendly, graphical, and seamless HCI. The Web-based browsers for texts, images, and videos have raised user expectations of rendering and manipulation of information. Recent advances in development languages and platforms such as Java, OpenGL, and VRML and the availability of advanced graphical workstations at affordable prices have also made information visualization a promising area for research [5]. Several of the digital library research teams including Arizona/Illinois, Xerox PARC, Berkeley, and Stanford, are pushing the boundary of visualization techniques for dynamic displays of large-scale information collections.

The Illinois DLI Project: Semantic Research

The Illinois DLI project, through partnership with the Artificial Intelligence Lab at the University of Arizona, has conducted research in the semantic retrieval and analysis areas. In particular, natural language processing, statistical analysis, neural network clustering, and information visualization techniques have been developed for different subject domain and media types.

Key results from these semantic research components include:

Statistical and neural network clustering proves useful and feasible interactively and for large-scale collections. Specifically, the AI lab has developed a noun phrasing technique for concept extraction, a concept space technique for building automatic thesaurus, and a self-organizing map (SOM) algorithm for building category maps. More details are provided below.

Two large-scale semantic indexing simulations were performed in 1996 and 1997, respectively. We analyzed 400,000 Inspec abstracts and 4,000,000 Compendex abstracts to generate about 1,000 engineering-specific concept spaces (automatic thesauri) using the NCSA supercomputers (Convex Exemplar and SGI Origin 2000). Results of such computations could be used for semantic retrieval and vocabulary switching across domains [4].

In this section we present an example of selected semantic retrieval and analysis techniques developed by The University of Arizona Artificial Intelligence Lab (AI Lab) for the Illinois DLI project. For detailed technical discussion, readers are referred to [4] [2].

A textual semantic analysis pyramid was developed by The University of Arizona AI Lab to assist in semantic indexing, analysis, and visualization of textual documents. The pyramid, as depicted in Figure 1, consists of 4 layers of techniques, from bottom to top: noun phrasing indexing, concept association, automatic categorization, and advanced visualization.


Figure 1: A textual semantic analysis pyramid


Figure 2: Tagged noun phrases


Figure 3: Associated terms for ``information retrieval''


Figure 4: Category map


Figure 5: VRML interface for category map

Discussions

The techniques discussed above were developed in the context of the Illinois DLI project, especially for the engineering domain. The techniques appear scalable and promising. We are currently in the process of fine-tuning these techniques for collections of different sizes and domains. Significant semantic research effort has been funded continuously by a multi-year DARPA project (1997-2000)

Acknowledgment

This project was funded primarily by:

Bibliography

1 H. Chen.
Machine learning for information retrieval: neural networks, symbolic learning, and genetic algorithms.
Journal of the American Society for Information Science, 46(3):194-216, April 1995.
 
2 H. Chen, A. L. Houston, R. R. Sewell, and B. R. Schatz.
Internet browsing and searching: User evaluations of category map and concept space techniques.
Journal of the American Society for Information Science, 49(7):582-603, May 1998.
 
3 H. Chen and D. T. Ng.
An algorithmic approach to concept exploration in a large knowledge network (automatic thesaurus consultation): symbolic branch-and-bound vs. connectionist Hopfield net activation.
Journal of the American Society for Information Science, 46(5):348-369, June 1995.
 
4 H. Chena, B. R. Schatz, T. D. Ng, J. P. Martinez, A. J. Kirchhoff, and C. Lin.
A parallel computing approach to creating engineering concept spaces for semantic retrieval: The Illinois Digital Library Initiative Project.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8):771-782, August 1996.
 
5 T. DeFanti and M. Brown.
Visualization: expanding scientific and engineering research opportunities.
IEEE Computer Society Press, NY, NY, 1990.
 
6 B. S. Manjunath and W. Y. Ma.
Texture features for browsing and retrieval of image data.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8):837-841, August 1996.
 
7 A. T. McCray and W. T. Hole.
The scope and structure of the first version of the UMLS semantic network.
In Proceedings of the Fourteenth Annual Symposium on Computer Applications in Medical Care, pages 126-130, Los Alamitos, CA, November 4-7 1990. Institute of Electrical and Electronics Engineers.
 
8 A. Paepcke, S. B. Cousins, H. Garcia-Molino, S. W. Hasson, S. P. Ketcxhpel, M. Roscheisen, and T. Winograd.
Using distributed objects for digital library interoperability.
IEEE COMPUTER, 29(5):61-69, May 1996.
 
9 G. Salton.
Automatic Text Processing.
Addison-Wesley Publishing Company, Inc., Reading, MA, 1989.
 
10 B. R. Schatz.
Information retrieval in digital libraries: bring search to the net.
Science, 275:327-334, January 17 1997.
 
11 B. R. Schatz and H. Chen.
Building large-scale digital libraries.
IEEE COMPUTER, 29(5):22-27, May 1996.
 
12 B. R. Schatz, B. Mischo, T. Cole, J. Hardin, A. Bishop, and H. Chen.
Federating repositories of scientific literature.
IEEE COMPUTER, 29(5):28-36, May 1996.
 
13 H. D. Wactlar, T. Kanade, M. A. Smith, and S. M. Stevens.
Intelligent access to digital video: Informedia project.
IEEE COMPUTER, 29(5):46-53, May 1996.
 
14 R. Wilensky.
Toward work-centered digital information services.
IEEE COMPUTER, 29(5):37-45, May 1996.

Hsinchun Chen
MIS Department
University of Arizona