|
 |
|
|
| |
 |
|
 |
|
|
Project Description |
|
|
STRINGER: Duplicate Detection System for
String Data |
|
|
String
data is ubiquitous. Many important pieces of information such as object
identifiers, people names and addresses, titles of
documents and web adresses (URIs) are stored in the form of string
data. However, string data is prone to different types of errors,
including data entry errors (typos and misspellings) and differences in
standards
and notations. On the other hand, approximate duplicate detection in
large data sources of string data is inherently a difficult task which
often requires human labour and expertise. Although
standard techniques can automatically detect several types of errors in
data, in many cases fully automatic detection of
duplicates may
not be plausible. In certain cases even a human expert may not be able
to identify duplicates. For example, it is relatively easy and
straightforward to recognize that "Schwarzenegear" is a
misspeling of
"Schwarzenegger" and they both refer to the same entity, but it is not
easy to judge whether "STRANGER project" is a misspeling of "STRINGER
project" or is a new item referring to a different entity.
This
project deals with string data and many
important challenges raised in data sources with strings like the one
stated above. The goal of this project is to provide accurate yet
efficient automatic and semi-automatic techniques for approximate
string matching in large data sources. |
|
|
People |
|
|
|
|
Publications |
|
- Framework For Evaluating Clustering
Algorithms In Duplicate Detection
O. Hassanzadeh, F. Chiang, H. C. Lee, and R. J. Miller.
Proceedings of the VLDB Endowment VLDB Endowment, Volume 2, Issue 1, August 2009
- Creating Probabilistic Databases
from Duplicated Data
O. Hassanzadeh and R.
J. Miller.
The VLDB Journal. Volume 18, Issue 5, October 2009
- Accuracy of Approximate String Joins
O. Hassanzadeh, M. Sadoghi and R. J.
Miller.
QDB'07 at VLDB. Vienna, Austria
- Probabilistic Management Of Duplicated Data
O. Hassanzadeh and Renée J. Miller.
Technical Report CSRG-568
|
|
|
Related Projects |
|
- ConQuer: Efficient Management of Inconsistent Databases
- LinQuer:
Linkage Query Writer
|
|
|
Datasets and Experimental Results |
|
|
|
|
Demo |
|
|
|
 |
|