The Database Group @ University of Toronto  The STRINGER Project
     Project Description | People | Publications | Demo 
   Project Description
  STRINGER: Duplicate Detection System for String Data
  String data is ubiquitous. Many important pieces of information such as object identifiers, people names and addresses, titles of documents and web adresses (URIs) are stored in the form of string data. However, string data is prone to different types of errors, including data entry errors (typos and misspellings) and differences in standards and notations. On the other hand, approximate duplicate detection in large data sources of string data is inherently a difficult task which often requires human labour and expertise. Although standard techniques can automatically detect several types of errors in data, in many cases fully automatic detection of duplicates may not be plausible. In certain cases even a human expert may not be able to identify duplicates. For example, it is relatively easy and straightforward to recognize that "Schwarzenegear" is a misspeling of "Schwarzenegger" and they both refer to the same entity, but it is not easy to judge whether "STRANGER project" is a misspeling of "STRINGER project" or is a new item referring to a different entity.
This project deals with string data and many important challenges raised in data sources with strings like the one stated above. The goal of this project is to provide accurate yet efficient automatic and semi-automatic techniques for approximate string matching in large data sources.
   Related Projects
  • ConQuer: Efficient Management of Inconsistent Databases 
  • LinQuer: Linkage Query Writer
   Datasets and Experimental Results
Copyright © 2007-2011 The Database Group @ University of Toronto | Last Updated April 15th, 20011