The Database Group @ University of Toronto  The LIMBO (scaLable InforMation BOtleneck) Project
B
     Project Description | People | Publications | Test Data 
         
 
  Project Description
 
  Clustering Categorical Data
 
  The problem of clustering becomes more challenging when the data is categorical, that is, when there is no inherent distance measure between data values. LIMBO is a scalable hierarchical categorical clustering algorithm that builds on the Information Bottleneck (IB) framework for quantifying the relevant information preserved when clustering. We use the IB framework to define a distance measure for categorical tuples. LIMBO handles large data sets by producing a memory bounded summary model for the data.
 
  Identifying Structure in Large Data Sets
 
  When doing data design, the information content (or redundancy) of the data is measured with respect to a prescribed model for the data. We consider the problem of doing data redesign in an environment where the prescribed model is unknown or incomplete and propose a set of LIMBO-based techniques for finding structural clues in an instance of data, which may contain errors, missing values, and duplicate records.
 
  Clustering Software Data
 
  The majority of the algorithms in the software clustering literature utilize structural information in order to decompose large software systems. Other approaches, such as using file names or ownership information, have also demonstrated merit. However, there is no intuitive way to combine information obtained from these two different types of techniques. Using LIMBO, we combine structural and non-structural information in an integrated fashion.
 
  People
 
 
  Publications
 
  • Information-Theoretic Software Clustering
    Periklis Andritsos and Vassilios Tzerpos.
    In IEEE Transactions on Software Engineering, 31(2), p.150-165, 2005.
  • LIMBO: a Scalable Algorithm to Cluster Categorical Data
    Periklis Andritsos.
    PhD Thesis, University of Toronto, Department of Computer Science, 2004.
  • Information-Theoretic Tools for Mining Database Structure from Large Data Sets
    Periklis Andritsos, Renée J. Miller and Panayiotis Tsaparas.
    In Proceedings of the ACM SIGMOD International Conference on the Management of Data (SIGMOD), 2004.
  • LIMBO: Scalable Clustering of Categorical Data
    Periklis Andritsos, Panayiotis Tsaparas, Renée J. Miller and Kenneth C. Sevcik.
    In Proceedings of the International Conference on Extending DataBase Technology (EDBT), 2004.
  • LIMBO: a Scalable Categorical Data Clustering Algorithm
    Periklis Andritsos, Panayiotis Tsaparas, Renée J. Miller and Kenneth C. Sevcik.
    Poster, IBM Center for Advance Studies Conference (CASCON), 2003.
  • LIMBO: Scalable Clustering of Categorical Data
    Periklis Andritsos, Panayiotis Tsaparas, Renée J. Miller and Kenneth C. Sevcik.
    Poster, 4th Annual MITACS Conference, 2003.
  • LIMBO: a Scalable Algorithm to Cluster Categorial Data
    Periklis Andritsos, Panayiotis Tsaparas, Renée J. Miller and Kenneth C. Sevcik.
    Technical Report CSRG-467, University of Toronto, 2003.
  • Software Clustering based on Information Loss Minimization
    Periklis Andritsos and Vassilios Tzerpos.
    In Proceedings of the 10th Working Conference on Reverse Engineering, 2003.
  • On Schema Discovery
    Renée J. Miller and Periklis Andritsos.
    In IEEE Data Engineering Bulletin, 26(3), p.41-47, 2003.
  • Clustering Categorical Data based on Information Loss Minimization
    Periklis Andritsos, Panayiotis Tsaparas, Renée J. Miller and Kenneth C. Sevcik.
    In Proceedings of the 2nd Hellenic Data Management Symposium (HDMS), 2003.
  • Using Categorical Clustering in Schema Discovery
    Periklis Andritsos and Renée J. Miller.
    In IJCAI Workshop on Information Integration on the Web, 2003.
Copyright © 2006 The Database Group @ University of Toronto | Last Updated July 12, 2007