The Database Group @ University of Toronto  The iBench Project
     Project Description | People | Publications | Systems 
 Project Description

The study of data integration is as old as the field of data management. However, given the maturity of this area it is surprising that rigorous empirical evaluations of research ideas are so scarce. We argue that a stronger focus on empirical work would benefit the integration community as a whole and identify one major roadblock for this work - the lack of comprehensive benchmarks, scenario generators, and publicly available implementations of quality measures. This makes it difficult to compare integration solutions, understand their generality, and understand their performance for different application scenarios. Based on this observation we discuss the requirements for such benchmarks. We argue that the major abstractions used in reasoning about integration problems have undergone a major convergence in the last decade and that this convergence is enabling the application of robust empirical methods to integration problems.

In the iBench project, we are developing a metadata generator for creating arbitrarily large and complex mappings, schemas and schema constraints. iBench can be used with a data generator to efficiently generate realistic data integration scenarios with varying degrees of size and complexity. iBench can be used to create benchmarks for different integration tasks including (virtual) data integration, data exchange, schema evolution, mapping operators like composition and inversion, and schema matching.

Some noteworthy features are:

  • Support for generating logical mappings, using the languages of s-t tgds and plain SO tgds.
  • Support for generating schema constraints, including primary keys (PKs) and random multi-attribute functional dependencies (FDs).
  • Support for generating mappings with flexible value invention (also known as Skolemization in the literature).
  • Support for native and user-defined primitives (e.g., different variants of Vertical Partitioning).
  • Sharing of source and target schema elements among multiple instances of mapping primitives.

Please also see our related project, BART, on generating dirty data to evaluate data cleaning algorithms.

  • Project Members
    Patricia Arocena external link (Researcher, University of Toronto)
    Boris Glavic external link (Assistant Professor, Illinois Institute of Technology)
    Radu Ciucanu external link (Research Assistant, University of Oxford)
    Renée J. Miller external link (Professor, University of Toronto)
  • Past Collaborators
    Mariana D'Angelo - Summer 2012 (Undergraduate Student, University of Toronto)
 We have released a first prototype implementation of iBench   external link   (available as a git repository).

iBench was first used to conduct a large scale empirical evaluation during 2012/2013.

 Real Metadata
  We maintain a public repository with example configuration files and real integration scenarios. Additions to this repository from the community are highly encouraged.
Copyright © 2015 Data Curation Lab @ University of Toronto | Last Updated August 1, 2015