Combining schema and instance information for integrating heterogeneous databases: An analytical approach and empirical evaluation
MetadataShow full item record
PublisherThe University of Arizona.
RightsCopyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author.
AbstractCritical to semantic integration of heterogeneous data sources, determining the semantic correspondences among the data sources is a very complex and resource-consuming task and demands automated support. In this dissertation, we propose a comprehensive approach to detecting both schema-level and instance-level semantic correspondences from heterogeneous data sources. Semantic correspondences on the two levels are identified alternately and incrementally in an iterative procedure. Statistical cluster analysis methods and the Self-Organizing Map (SOM) neural network method are used first to identify similar schema elements (i.e., relations and attributes). Based on the identified schema-level correspondences, classification techniques drawn from statistical pattern recognition, machine learning, and artificial neural networks are then used to identify matching tuples. Multiple classifiers are combined in various ways, such as bagging, boosting, concatenating, and stacking, to improve classification accuracy. Statistical analysis techniques, such as correlation and regression, are then applied to a preliminary integrated data set to evaluate the relationships among schema elements more accurately. Improved schema-level correspondences are fed back into the identification of instance-level correspondences, resulting in a loop in the overall procedure. Empirical evaluation using real-world and simulated data that has been performed is described to demonstrate the utility of the proposed multi-level, multi-technique approach to detecting semantic correspondences from heterogeneous data sources.
Degree ProgramGraduate College