The Database Group @ University of Toronto  Data Quality
     Project Description | People | Publications | Code 
 Project Description
  Identifying Data Quality Rules
  Dirty data is a serious problem for businesses leading to incorrect decision making, inefficient daily operations, and ultimately wasting both time and money. Dirty data often arises when domain constraints and business rules, meant to preserve data consistency and accuracy, are enforced incompletely or not at all in application code. In this work, we investigate the problem of discovering contextual data quality rules (conditional functional dependencies (CFDs)) over a given data instance. Our data driven tool returns a set of CFDs that hold over the data, and a list of non-conformant records that indicate the potentially dirty data records.
  Repairing Inconsistent Constraints
  Data repair techniques propose solutions to ensure that the integrity constraints and the data are consistent. However, over time, integrity constraints may become outdated (with respect to the data) as the data semantics evolve, business rules change, or as data is integrated with new sources. In this work, we investigate and quantify when a data repair versus a constraint repair is best, and we recommend a list of possible repairs to resolve the inconsistency.
  Continuous Data Cleaning
  In this work, we consider resolving inconsistencies by learning, over time, a model that can recommend a hybrid of both data repairs and constraint repairs.
  Combining Logical and Quantitative Data Cleaning
  Quantitative data cleaning relies on the use of statistical methods to identify and repair data quality problems while logical data cleaning tackles the same problems using various forms of logical reasoning over declarative dependencies. Each of these approaches has its strengths: the logical approach is able to capture subtle data quality problems using sophisticated dependencies, while the quantitative approach excels at ensuring that the repaired data has desired statistical properties. We propose a novel framework within which these two approaches can be used synergistically to combine their respective strengths.
  Leader collaborators Students
  • Nataliya Prokoshyna
  Please contact Fei Chiang external link to obtain code executables.
Copyright © 2011 The Database Group @ University of Toronto | Last Updated Sep, 2015