The Database Group @ University of Toronto  The STRINGER Project
     Project Description | People | Publications | Demo 
         
 
   Project Description
 
  STRINGER: Duplicate Detection System for String Data
 
  String data is ubiquitous. Many important pieces of information such as object identifiers, people names and addresses, titles of documents and web adresses (URIs) are stored in the form of string data. However, string data is prone to different types of errors, including data entry errors (typos and misspellings) and differences in standards and notations. On the other hand, approximate duplicate detection in large data sources of string data is inherently a difficult task which often requires human labour and expertise. Although standard techniques can automatically detect several types of errors in data, in many cases fully automatic detection of duplicates may not be plausible. In certain cases even a human expert may not be able to identify duplicates. For example, it is relatively easy and straightforward to recognize that "Schwarzenegear" is a misspeling of "Schwarzenegger" and they both refer to the same entity, but it is not easy to judge whether "STRANGER project" is a misspeling of "STRINGER project" or is a new item referring to a different entity.
This project deals with string data and many important challenges raised in data sources with strings like the one stated above. The goal of this project is to provide accurate yet efficient automatic and semi-automatic techniques for approximate string matching in large data sources.
 
   People
 
 
   Publications
 
  • Framework For Evaluating Clustering Algorithms In Duplicate Detection
    O. Hassanzadeh, F. Chiang, H. C. Lee, and R. J. Miller.
    To Appear in Proceedings of the 35th International Conference on Very Large Data Bases (VLDB 2009)
  • Creating Probabilistic Databases from Duplicated Data 
    O. Hassanzadeh and R. J. Miller.
    To Appear in The VLDB Journal (Accepted in June 2009).
  • Accuracy of Approximate String Joins
    O. Hassanzadeh, M. Sadoghi and R. J. Miller.
    QDB'07 at VLDB. Vienna, Austria
  • Probabilistic Management Of Duplicated Data
    O. Hassanzadeh and Renée J. Miller.
    Technical Report CSRG-568
 
   Related Projects
 
  • ConQuer: Efficient Management of Inconsistent Databases 
  • LinQuer: Linkage Query Writer
  • LinkedCT: Linked Data Space for Clinical Trials
  • LinkedMDB: Linked Movie Data Base
 
   Datasets and Experimental Results
 
 
   Demo
  A demo of our project will be available soon.
 
Copyright © 2007-2009 The Database Group @ University of Toronto | Last Updated March 15th, 2009