Data Compression and Clustering: A Blind Approach to Classification

Carpentieri, Bruno

Data Compression is today essential for a wide range of applications: for example Internet and the World Wide Web infrastructures benefits from compression. New general compression methods are always being developed, in particular those that allow indexing over compressed data or error resilience. Compression also inspires information theoretic tools for pattern discovery and classification, in particular it is possible to use data compression as a metric for clustering. This leads to a powerful clustering strategy that does not use any “semantic” information on the data to be classified but does a “blind” and effective classification that is based only on the compressibility of digital data and not on its “meaning”. Here we experiment with this strategy and show its effectiveness.