Probabilistic Management of Duplicated Data
Datasets and Experimental Results



REAL DATASETS

 - CORA Dataset
 Data source:
  - Tables: Strings (All attributes merged) - Titles - Venues - Authors
     Full MySQL dump
     Original data: XML source from A Duplicate Detection Benchmark for XML (and Relational) Data project at HPI

 - Similarity Join Evaluation Results:  Precision/Recall Curves
 - Clustering Evaluation Results: Accuracy Results for Disjoint Clusterings
 - Probabilistic Databases: The following probabilistic database tables are created form the above datasets, using Weighted Jaccard similarity measure, MERGE-CENTER clustering algorithm and IB probability assignment algorithm.  
Threshold for similarity join 0.1 0.2 0.3 0.4 0.5 0.7 None (perfect clustering)
Strings Table .CSV .CSV .CSV .CSV .CSV .CSV .CSV
Titles Table .CSV .CSV .CSV .CSV .CSV .CSV .CSV
Authors Table .CSV .CSV .CSV .CSV .CSV .CSV .CSV
Venues .CSV .CSV .CSV .CSV .CSV .CSV .CSV
 - Full MySQL dump

 - Sample probabilistic queries: Examples

SYNTHETIC DATASETS
 
Data Generator: s-dbgen which is a modified version of dbgen.
 The datasets used in the paper: datasets.zip
 Dercriptions (from the paper):




Percentage of erroneous duplciates: The average number of the records in each cluster which are erroneous
Percentage of errors in duplicates: The number of errors injected in each erroneous record



CLUSTERING ACCURACY RESULTS

Uniform Datasets / Company Names (Samples)

Mixed Error Types:
  All dataGrouped by classes of datasetsGrouped by threshold - Best accuracy
Single Error Types:
  All dataGrouped/Sorted by threshold - Best accuracy

Zipfian Datasets / Company Names (Samples)

  All dataGrouped by classes of datasetsGrouped by threshold - Best accuracy

Uniform Datasets / DBLP Titles (Samples)

  All dataGrouped by classes of datasetsGrouped by threshold - Best accuracy




The Database Group @ University of Toronto
Copyright © 2009 The Database Group @ University of Toronto | Last Updated June 14th, 2009