An Evaluation of the Effectiveness of Clustering
Algorithms for Fuzzy Duplicate Detection
Experimental Results



CLUSTERING ACCURACY RESULTS

Uniform Datasets / Company Names (Samples)

Mixed Error Types:
  All dataGrouped by classes of datasetsGrouped by threshold - Best accuracy
Single Error Types:
  All dataGrouped/Sorted by threshold - Best accuracy

Zipfian Datasets / Company Names (Samples)

  All dataGrouped by classes of datasetsGrouped by threshold - Best accuracy

Uniform Datasets / DBLP Titles (Samples)

  All dataGrouped by classes of datasetsGrouped by threshold - Best accuracy

DATASETS
 
Data Generator: s-dbgen which is a modified version of dbgen.
 The datasets used in the paper: datasets.zip
 Dercriptions (from the paper):




Percentage of erroneous duplciates: The average number of the records in each cluster which are erroneous
Percentage of errors in duplicates: The number of errors injected in each erroneous record



SUMMARY





The Database Group @ University of Toronto
Copyright © 2008 The Database Group @ University of Toronto | Last Updated March 29th, 2008