back to DGRC HOME
back to DGRC HOME
about dgrc
conferences
contact
outreach by DGRC
people at DGRC
research
home
::: Automating the Integration of Heterogeneous Databases :::

Grant amount: $300,000 annually for three years
Research team: Dr. Eduard Hovy, USC/ISI; Dr. Andrew Philpot, USC/ISI; Dr. Jose Luis Ambite, USC/ISI

Quote: "We are excited about the opportunity to bring new techniques to bear on what has been for the EPA (and other government agencies) a thorny problem for many years. If we can help them streamline the data conversion and reporting process--from individual Air Quality Management districts throughout California to Sacramento, and from there (and other states) to the Federal EPA in North Carolina--we will not only ease their burden but also open the doors to, possibly, a network of air quality data streams that reaches worldwide. We find both the technical challenge and the social contribution of this work to be very satisfying."

Eduard Hovy, USC/ISI

Abstract:
AUTOMATING THE INTEGRATION OF HETEROGENEOUS DATABASES
Due to the wide range of geographic scales and complex tasks the Government must administer, its data is split in many different ways and is collected at different times by different agencies. The resulting massive data heterogeneity means one cannot effectively locate, share, or compare data across sources, let alone achieve computational data interoperability.

To date, all approaches to wrap data collections, or even to create mappings across comparable datasets, require manual effort. Despite some promising work, the automated creation of such mappings is still in its infancy, since equivalences and differences manifest themselves at all levels, from individual data values through metadata to the explanatory text surrounding the data collection as a whole. More general methods are required to effectively address this problem.

Viewing the data mapping problem as a variant of the cross-language mapping problem of Machine Translation (MT), we propose to employ the new statistical algorithms developed since 1990 in the MT community to discover correspondences across comparable datasets at all levels. If our automatically learned mappings are effective, we should be able to significantly reduce the amount of manual labor required in database wrapping.

To evaluate our work, we will collaborate with research partners at UC San DiegoÕs Supercomputer Center and at Washington University in St. Louis, who are submitting proposals linked with ours. Both of these groups are building networks to support data integration. After converting our learned mappings into the formats used by these groups we will measure the effectiveness of our methods in reducing or eliminating human involvement and speeding up the incorporation of new sources into the networks.

We will work with two sets of domain data. Air quality data will be provided by EPA staff at the California Air Resources Board in Sacramento, who periodically integrate data from some 35 regional Air Quality Management Districts throughout California into a single California-wide database, and pass this along to the Federal EPA in North Carolina. Fire emissions data will be provided by a different set of EPA offices, the USDA/Forest Service, and the Department of Interior. Intellectual merit of the proposed work: This work will apply emerging statistical techniques for machine translation (MT) to the problem of automating database schema integration. In MT, the techniques align words and word sequences across languages. This research will adapt and extend the techniques to consider not only data values (the analogue of words) but also data format/orthography, metadata information, and associated textual information (metadata descriptions, footnotes, etc.) in the alignment process, and to perform alignment learning at three levels: individual data cell level, set of cells (column) level, and multi-column level. Multi-level alignment has not been attempted in MT before. These powerful learning techniques have never been applied to metadata schema integration and/or database alignment or wrapping. Broader impacts resulting from the proposed work: 1. In general: To the extent this work succeeds, it has the potential to significantly reduce the amount of human work involved in creating single-point access to multiple heterogeneous databases. This problem is faced by thousands of large enterprises with numerous data collections, from Government agencies at all levels to the chemical and automotive industries to startup companies that link together and integrate websites. By automatically postulating mappings across databases/metadata, the proposed algorithms will enable the database wrapper builder (whether fully manual or semi-automated) to work more quickly and effectively. It will also help with the creation of metadata standards.

2. In particular: We will provide our results to our partner agencies in the EPA so that they can transform their data at will. Working with our partners at the Federal EPA, we will also after the first year work on mapping appropriate data collections of other US states and countries (such as Mexico).

 

This site was built and is maintained by the
Information Sciences Institute at the University of Southern California.
Please forward feedback here.