Welcome to Miller Lab homepage!

Our group works broadly in the area of data curation. This includes data integration, data quality, data cleaning, data provenance, and big data analytics.


  • 25 March 2017

    Dr. Bobbie Cochrane from IBM New York will be giving an informal tutorial on Blockchain in the Database Lab (BA7230) on Monday, March 27 at 12:10pm.

  • 08 March 2017

    Dr. Raul Castro Fernandez from MIT will give a talk on Data Discovery today in BA7230 at 11:00 AM. See details.

  • 10 January 2017

    Prof. Felix Naumann from Hasso-Plattner-Institut will give a talk on Data Profiling today at 11:00 AM in BA7231. See details.

  • 08 January 2017

    Christina’s work, VIQS: Visual Interactive Exploration of Query Semantics, is accepted at ACM ESIDA 2017!

  • 20 December 2016

    Prof. Fabian M. Suchanek from Télécom ParisTech University in Paris will give a talk on “A hitchhiker’s guide to Ontology” today at 11:00 AM in BA7231. See details.

  • 24 October 2016

    Prof. Wolfgang Gatterbauer from CMU will give a talk on “Approximate lifted inference with probabilistic databases” today at 11:00 AM in BA7231. See details.

  • 14 October 2016

    Jiang’s work, DeepSea: Progressive Workload-Aware Partitioning of Materialized Views in Scalable Data Analytics, is accepted at EDBT 2017!

  • 04 October 2016

    Prof. Olga Papaemmanouil from Brandeis University will give a talk on “Performance Management for Cloud Databases via Machine Learning” today at 11:00 AM in BA7231. See details.

Older posts…



The first metadata generator that can be used to evaluate a wide-range of integration tasks.

Data Quality

Dirty data is a serious problem for businesses leading to incorrect decision making, inefficient daily operations, and ultimately wasting both time and money. Dirty data often arises when domain constraints and business rules, meant to preserve data consistency and accuracy, are enforced incompletely or not at all in application code. In this work, we are studying how to detect errors in data. In a sister project, BART, we provide a scalable way to generate dirty data to systematically evaluate data cleaning systems.


A Linked Data Space for Clinical Drug Trials (a part of Linking Open Drug Data (LODD) project at W3C).

Data Exchange & Schema Mappings: the theory behind Clio

Data exchange is the problem of taking data structured under a source schema and creating an instance of a target schema that reflects the source data as accurately as possible. In this project, we address foundational and algorithmic issues related to the semantics of data exchange and to the query answering problem in the context of data exchange. These issues arise because, given a source instance, there may be many target instances that satisfy the constraints of the data exchange problem, or none at all.

Past Projects

  • xCurator Data Curation for Semistructured Web Data
  • Vivification in Business Intelligence
  • LinQuer Linkage Query Writer
  • BibBase3 (BibBase Triplified) The easiest way to set up and maintain your publications page. BibBase facilitates the dissemination of scientific publications over the Internet and transforms your publications page into a Linked Data Space.
  • Stringer Duplicate Detection System for String Data
  • Clio Creating and managing schema mappings
  • ConQuer Efficient management and querying of inconsistent data
  • Hyperion Data management support for dynamic peer-to-peer data sharing applications
  • Iliads Leveraging data and structure in ontology integration
  • Limbo Structure discovery in large datasets
  • ToMAS Managing and evolving schema mappings

Find Us

Bahen Centre, Room 7230
40 St George St, Toronto, ON, M5S 2E4



Renée J. Miller

Affiliated Researchers

Ken Pu
Periklis Andritsos
Patricia (Rodriguez-Gianolli) Arocena
Fei Chiang
Boris Glavic
Lise Getoor
Oktie Hassanzadeh
Gianni Mecca


Christina Christodoulakis
Jiang Du
Carolina Simões Gomes
Bahar Ghadiri
Farzaneh Mahdisoltani
Fatemeh Nargesian
Javed Siddique
Eric (Erkang) Zhu


Radu Ciucanu
Associate Professor, Université Clermont Auvergne
Adel Benaissa


Natasha Prokoshyna
MSc, IBM Toronto
Patricia (Rodriguez-Gianolli) Arocena
PhD, Post-Doc (former Research Associate and Associate Director of NSERC BIN)
Fei Chiang
PhD, Assistant Professor, Computer Science, McMaster University
Hojjat Ghaderi
Post-doc, Two Sigma Investments
Boris Glavic
Post-doc, Assitant Professor, Illinois Institute of Technology
Oktie Hassanzadeh
PhD, Researcher, IBM TJ Watson Research Labs
Reynold Xin
B.Eng, currently PhD Student at Berkeley, co-founder DataBricks
Flavio Rizzolo
PhD, Researcher at Statistics Canada
Ariel Fuxman
PhD, Research Scientist Google
Ken Pu
PhD, Adjunct Professor, Associate Professor, Computer Science, Faculty of Science, UOIT
Yaron Kanza
Post-doc, Technion – Israel Institute of Technology
Chul (Hyun) Lee
Post-doc, CTO and Co-Founder Thoora, now at LinkedIn
Anastasios Kementsietsidis
PhD, Senior Research Scientist Google
Yannis Velegrakis
PhD, Associate Professor, University of Trento
Periklis Andritsos
PhD, Assistant Professor, Faculty of Information, University of Toronto