|
::: Automating the Integration of Heterogeneous Databases :::
Grant amount: $300,000 annually for three years
Research team: Dr. Eduard Hovy, USC/ISI; Dr. Andrew Philpot, USC/ISI; Dr. Jose Luis Ambite, USC/ISI
Quote: "We are excited about the opportunity to bring new techniques to bear on what has been for the EPA (and other government agencies) a thorny problem for many years. If we can help them streamline the data conversion and reporting process--from individual Air Quality Management districts throughout California to Sacramento, and from there (and other states) to the Federal EPA in North Carolina--we will not only ease their burden but also open the doors to, possibly, a network of air quality data streams that reaches worldwide. We find both the technical challenge and the social contribution of this work to be very satisfying."
Abstract:
AUTOMATING THE INTEGRATION OF HETEROGENEOUS DATABASES
Due to the wide range of geographic scales and complex tasks the Government must administer, its
data is split in many different ways and is collected at different times by different agencies. The
resulting massive data heterogeneity means one cannot effectively locate, share, or compare data
across sources, let alone achieve computational data interoperability.
To date, all approaches to wrap data collections, or even to create mappings across comparable
datasets, require manual effort. Despite some promising work, the automated creation of such
mappings is still in its infancy, since equivalences and differences manifest themselves at all levels,
from individual data values through metadata to the explanatory text surrounding the data collection as
a whole. More general methods are required to effectively address this problem.
Viewing the data mapping problem as a variant of the cross-language mapping problem of Machine
Translation (MT), we propose to employ the new statistical algorithms developed since 1990 in the
MT community to discover correspondences across comparable datasets at all levels. If our
automatically learned mappings are effective, we should be able to significantly reduce the amount of
manual labor required in database wrapping.
To evaluate our work, we will collaborate with research partners at UC San DiegoÕs Supercomputer
Center and at Washington University in St. Louis, who are submitting proposals linked with ours.
Both of these groups are building networks to support data integration. After converting our learned
mappings into the formats used by these groups we will measure the effectiveness of our methods in
reducing or eliminating human involvement and speeding up the incorporation of new sources into the
networks.
We will work with two sets of domain data. Air quality data will be provided by EPA staff at the
California Air Resources Board in Sacramento, who periodically integrate data from some 35 regional
Air Quality Management Districts throughout California into a single California-wide database, and
pass this along to the Federal EPA in North Carolina. Fire emissions data will be provided by a
different set of EPA offices, the USDA/Forest Service, and the Department of Interior.
Intellectual merit of the proposed work: This work will apply emerging statistical techniques for
machine translation (MT) to the problem of automating database schema integration. In MT, the
techniques align words and word sequences across languages. This research will adapt and extend the
techniques to consider not only data values (the analogue of words) but also data format/orthography,
metadata information, and associated textual information (metadata descriptions, footnotes, etc.) in the
alignment process, and to perform alignment learning at three levels: individual data cell level, set of
cells (column) level, and multi-column level. Multi-level alignment has not been attempted in MT
before. These powerful learning techniques have never been applied to metadata schema integration
and/or database alignment or wrapping.
Broader impacts resulting from the proposed work: 1. In general: To the extent this work succeeds,
it has the potential to significantly reduce the amount of human work involved in creating single-point
access to multiple heterogeneous databases. This problem is faced by thousands of large enterprises
with numerous data collections, from Government agencies at all levels to the chemical and
automotive industries to startup companies that link together and integrate websites. By automatically
postulating mappings across databases/metadata, the proposed algorithms will enable the database
wrapper builder (whether fully manual or semi-automated) to work more quickly and effectively. It
will also help with the creation of metadata standards.
2. In particular: We will provide our results to our partner agencies in the EPA so that they can
transform their data at will. Working with our partners at the Federal EPA, we will also after the first
year work on mapping appropriate data collections of other US states and countries (such as Mexico).
| |