Java - Calculate File Compression

3.9k Views Asked by At

Is there a way to get the possible compression ratio of a file just reading it?
You know, some files are more compressible then others... my software has to tell me the percentage of possible compression of my files.

e.g.
Compression Ratio: 50% -> I can save 50% of my file's space if I compress it
Compression Ratio: 99% -> I can save only 1% of my file's space if I compress it

3

There are 3 best solutions below

0
On BEST ANSWER

Firstly, you need to work on information theory. There are two theory about information theory field:

  1. According to Shannon, one can compute entropy (i.e. compressed size) of a source by using it's symbol probabilities. So, smallest compression size defined by an statistical model which produces symbol probabilities at each step. All algorithms use that approach implicitly or explicitly to compress data. Look that Wikipedia article for more details.
  2. According to Kolmogorov, smallest compression size can be found by finding smallest possible program which produces the source. In that sense, it cannot be compute-able. Some program partially use that approach to compress data (e.g. you can write a small console application which can produce 1 million digits of PI instead of zipping that 1 million digits of PI).

So, you can't find compressed size without evaluating actual compression. But, if you need an approximation, you can rely on Shannon's entropy theory and build a simple statistical model. Here is a very simple solution:

  1. Compute order-1 statistics for each symbol in the source file.
  2. Calculate entropy by using those statistics.

Your estimation will be more or less same as ZIP's default compression algorithm (deflate). Here is a more advanced version of same idea (be aware it uses lots of memory!). It actually uses entropy to determine blocks boundaries to apply segmentation for dividing file into homogeneous data.

3
On

Not possible without examining the file. The only thing you can do is have an approximate ratio by file extension based on statistics gathered from a relative large sample by doing actual compression and measuring. For example a statistical analysis will likely show that .zip, .jpg are not heavily compressible, but files like .txt and .doc might be heavily compressible.

The results of this will be for rough guidance only and will probably be way off in some cases as there's absolutely no guarantee of compressible-ness by file extension. The file could contain anything no matter what the extension say it may or may not be.

UPDATE: Assuming you can examine the file then you can use the java.util.zip APIs to read the raw file and compress it and see what the before/after difference is.

4
On

First, this will depend largely on the compression method you choose. And second, I seriously doubt it's possible without computation of time and space complexity comparable to actually doing the compression. I'd say your best bet is to compress the file, keeping track of the size of what you've already produced and dropping/freeing it (once you're done with it, obviously) instead of writing it out.

To actually do this, unless you really want to implement it yourself, it'll probably be easiest to use the java.util.zip class, in particular the Deflater class and its deflate method.