The Database Group @ University of Toronto  The STRINGER Project
     Project Description | People | Publications | Demo 
         
 
   Project Description
 
  STRINGER: Duplicate Detection System for String Data
 
  String data is ubiquitous. Many important pieces of information such as object identifiers, people names and addresses, titles of documents and web adresses (URIs) are stored in the form of string data. However, string data is prone to different types of errors, including data entry errors (typos and misspellings) and differences in standards and notations. On the other hand, approximate duplicate detection in large data sources of string data is inherently a difficult task which often requires human labour and expertise. Although standard techniques can automatically detect several types of errors in data, in many cases fully automatic detection of duplicates may not be plausible. In certain cases even a human expert may not be able to identify duplicates. For example, it is relatively easy and straightforward to recognize that "Schwarzenegear" is a misspeling of "Schwarzenegger" and they both refer to the same entity, but it is not easy to judge whether "STRANGER project" is a misspeling of "STRINGER project" or is a new item referring to a different entity.
This project deals with string data and many important challenges raised in data sources with strings like the one stated above. The goal of this project is to provide accurate yet efficient automatic and semi-automatic techniques for approximate string matching in large data sources.
 
   People
 
 
   Publications
 
 
   Related Projects
 
  • ConQuer: Efficient Management of Inconsistent Databases 
  • LinQuer: Linkage Query Writer
 
   Datasets and Experimental Results
 
 
   Demo
 
 
Copyright © 2007-2011 The Database Group @ University of Toronto | Last Updated April 15th, 20011