|
 |
| | | |
 |
|
 |
|
| Project Description |
|
| Identifying Data Quality Rules |
|
|
Dirty data is a serious problem for businesses leading to
incorrect decision making, inefficient daily operations, and
ultimately wasting both time and money. Dirty data often
arises when domain constraints and business rules, meant
to preserve data consistency and accuracy, are enforced incompletely or not at all in application code. In this work, we investigate
the problem of discovering contextual data quality rules (conditional
functional dependencies (CFDs)) over a given data instance. Our data driven tool
returns a set of CFDs that hold over the data, and a list of non-conformant records that indicate
the potentially dirty data records.
| |
| Repairing Inconsistent Constraints |
|
|
Data repair techniques propose solutions to ensure that the integrity constraints and the data are consistent. However, over time,
integrity constraints may become outdated (with respect to the data) as the data semantics evolve, business rules change,
or as data is integrated with new sources. In this work, we investigate and quantify when a data repair
versus a constraint repair is best, and we recommend a list of possible repairs to resolve the inconsistency.
| |
| Continuous Data Cleaning
|
|
|
In this work, we consider resolving inconsistencies by learning, over time,
a model that can recommend a hybrid of both data repairs and constraint repairs.
| |
| Combining Logical and Quantitative Data Cleaning |
|
|
Quantitative data cleaning relies on the use of statistical methods
to identify and repair data quality problems while logical data
cleaning tackles the same problems using various forms of logical
reasoning over declarative dependencies. Each of these approaches
has its strengths: the logical approach is able to capture subtle
data quality problems using sophisticated dependencies, while the
quantitative approach excels at ensuring that the repaired data has
desired statistical properties. We propose a novel framework within
which these two approaches can be used synergistically to combine
their respective strengths.
| |
| People |
|
Leader collaborators
Students
|
|
| Publications |
|
|
|
 |
|