Welcome to Miller Lab homepage!
Our group works broadly in the area of data curation. This includes data integration, data quality, data cleaning, data provenance, and big data analytics.
Prof. Jian Pei from Simon Fraser University will give a talk on Graph Computing Engines on October 18th at 3 PM in BA7231.
Dr. Dong Deng from MIT will give a talk on Finding Related Sets today on October 18th at 4 PM in BA7231.
Our demo paper Interactive Navigation of Open Data Linkages won VLDB ‘17 best demo award!
Dr. C. Mohan, an IBM fellow, will give a talk, “New Era in Distributed Computing with Blockchains and Databases”, on Wednesday, July 05, 2017, at 2 PM, in BA7231. See details.
Eric will be giving a practice talk of his paper “Auto-Join: Joining Tables by Leveraging Transformations”, accepted to VLDB’17, today at 2pm.
Xu chu from University of Waterloo will be giving a talk, “Data Cleaning: The Hardest Job in Data Science” on Monday, June 5, 2017 in BA7231.
Prof. Fei Chiang from McMaster University will give a talk on Data Cleanig on April 20, 2017 at 11am.
Dr. Bobbie Cochrane from IBM New York will be giving an informal tutorial on Blockchain in the Database Lab (BA7230) on Monday, March 27 at 12:10pm.
In recent years, enterprises and governments have accumulated massive amount of unprocessed analytical data sets, due to the speed of consuming through data warehousing not keeping up with the speed of data ingestion. This trend has led to the emergence of data lakes – collections of data sets stored in their raw formats, and data scientists are increasingly obtaining data directly from data lakes rather than traditional data warehouses. In this work, we attempt to answer the question “are data lakes positive asset to data scientists, or a source of endless trouble?” We use first-hand information gathering, including interviews and surveys, over diverse subjects including corporations, governments, and academic institutions, regarding the level of adoption of data lakes, sophistication of use, existing system architectures for data lakes, user interfaces, analytic tools, and critical problems that are hampering effective use of data lakes. We are conducting research to understand industry problems related to data lakes. Our research will create awareness in the data management research community of important practical problems involving data lakes and encourage new research. Survey Link. This research will be conducted by PhD candidate Eric Zhu and postdoc consultant Dr Clint V. Sieunarine under the supervision of Professor Renée Miller.
The first metadata generator that can be used to evaluate a wide-range of integration tasks.
Dirty data is a serious problem for businesses leading to incorrect decision making, inefficient daily operations, and ultimately wasting both time and money. Dirty data often arises when domain constraints and business rules, meant to preserve data consistency and accuracy, are enforced incompletely or not at all in application code. In this work, we are studying how to detect errors in data. In a sister project, BART, we provide a scalable way to generate dirty data to systematically evaluate data cleaning systems.
A Linked Data Space for Clinical Drug Trials (a part of Linking Open Drug Data (LODD) project at W3C).
Data exchange is the problem of taking data structured under a source schema and creating an instance of a target schema that reflects the source data as accurately as possible. In this project, we address foundational and algorithmic issues related to the semantics of data exchange and to the query answering problem in the context of data exchange. These issues arise because, given a source instance, there may be many target instances that satisfy the constraints of the data exchange problem, or none at all.
Bahen Centre, Room 7230
40 St George St, Toronto, ON, M5S 2E4
Canada