Find out MIME Type of compressed files downloaded from S3 for Java

Question

Find out MIME Type of compressed files downloaded from S3 for Java

4.2k Views Asked by cavpollo At 19 August 2025 at 02:49

A client is supposed to upload a compressed file into an S3 folder. Then the compressed file is downloaded and decompressed to perform various operations on its contained files. Originally we told our client to compress its files into a ZIP file, but this proved too difficult for our client. Instead it submitted a RAR file with ZIP extension... how clever. For obvious reasons one can't decompress a RAR file using a ZIP decompressing algorithm.

So, I'm looking for a way to find out the file type of the S3 downloaded files given that I'm working on a Java project with Amazon's SDK on a Linux OS. I'll take care of how to decompress the file depending on the obtained file type.

I've looked at many stack overflow questions, like this one, but none seem 100% effective just by looking at them (and its comments).

What would be the best approach to find out the compressed file's type?

Original Q&A

There are 1 best solutions below

**cavpollo** · Accepted Answer

TL;DR;

When one uploads a file to Amazon S3 programatically, one could specify the object's Content-Type. If one specifies none, as @Michael-bot clarifies, the value assigned by default will be binary/octet-stream. Or if one decides to upload the file through Amazon S3's GUI, the file gets its Content-Type from its file extension (sadly, not its contents). If you can trust whoever uploaded the file to set the Content-Type correctly, go ahead and look at the ObjectMetadata, but if you can't (like me), you would need another solution.

So, if you are looking for a solution that works on the most common file compression types, Files.probeContentType, Apache Tika and SimpleMagic seem to be acceptable solutions.

In the end I chose Files.probeContentType as it required no extra libraries and works just fine on a Linux machine (as long as the file doesn't have the wrong extension, for which there is a workaround: remove the file extension and let it do its magic).

The Test Setup

At first one would think that the response object when downloading the file from Amazon's S3 includes the file type. And it does contain this information, but the problem arises when the extension of the file doesn't match its contents.

import com.amazonaws.services.s3.model.S3Object;

final S3Object s3Object = ...;
final String contentType = s3Object.getObjectMetadata().getContentType();

This code would return application/zip even if the contents of the file are of a Rar file. So this solution doesn't work for me.

For this reason I took the time to build a sample project that tested various scenarios with the different approaches and libraries available. I'm using Java 8 by the way.

The files types tested are:

A Zip file with Zip extension and without extension
A Rar file with Rar extension, Zip extension and without extension
A 7z file with 7z extension, Zip extension and without extension
A Tar.xz with Tar.xz extension, Zip extension and without extension
A Tar.gz with Tar.gz extension, Zip extension and without extension

Beware, the implementations presented here are only for testing purposes. They are not in any way endorsed to be used in production code, as they don't consider file locking problems among other things that my imagination couldn't bother to consider. =)

MimetypesFileTypeMap

Implementation

import java.io.File;
import javax.activation.MimetypesFileTypeMap;

final File file = new File(basePath + "/" + fileName);
try {
    return MimetypesFileTypeMap.getDefaultFileTypeMap().getContentType(file);
} catch (final Exception exception) {
    return "<EXCEPTION: " + exception.getMessage() + ">";
}

Results

Rar with Rar extension is:       application/octet-stream
Rar with Zip extension is:       application/octet-stream
Zip with Zip extension is:       application/octet-stream
7z with 7z extension is:         application/octet-stream
7z with Zip extension is:        application/octet-stream
Tar.xz with Tar.xz extension is: application/octet-stream
Tar.xz with Zip extension is:    application/octet-stream
Tar.gz with Tar.gz extension is: application/octet-stream
Tar.gz with Zip extension is:    application/octet-stream
Rar without extension is:        application/octet-stream
Zip without extension is:        application/octet-stream
7z without extension is:         application/octet-stream
Tar.xz without extension is:     application/octet-stream
Tar.gz without extension is:     application/octet-stream

Conclusion

The value returned by this approach when a file type has not been recognized is application/octet-stream. It seems all scenarios failed so we should discard this approach.

URLConnection.guessContentTypeFromStream

Implementation

import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.BufferedInputStream;
import java.net.URLConnection;

final File file = new File(basePath + "/" + fileName);
try {
    final FileInputStream fileInputStream = new FileInputStream(file);
    final InputStream inputStream = new BufferedInputStream(fileInputStream);

    return URLConnection.guessContentTypeFromStream(inputStream);
} catch (final Exception exception) {
    return "<EXCEPTION: " + exception.getMessage() + ">";
}

Results

Rar with Rar extension is:       null
Rar with Zip extension is:       null
Zip with Zip extension is:       null
7z with 7z extension is:         null
7z with Zip extension is:        null
Tar.xz with Tar.xz extension is: null
Tar.xz with Zip extension is:    null
Tar.gz with Tar.gz extension is: null
Tar.gz with Zip extension is:    null
Rar without extension is:        null
Zip without extension is:        null
7z without extension is:         null
Tar.xz without extension is:     null
Tar.gz without extension is:     null

Conclusion

Again, this method fails all scenarios. It seems its support is very limited.

Files.probeContentType

Implementation

import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

try {
    final Path path = Paths.get(basePath + "/" + fileName);
    return Files.probeContentType(path);
} catch (final Exception exception) {
    return "<EXCEPTION: " + exception.getMessage() + ">";
}

Results

Rar with Rar extension is:       application/vnd.rar
Rar with Zip extension is:       application/zip
Zip with Zip extension is:       application/zip
7z with 7z extension is:         application/x-7z-compressed
7z with Zip extension is:        application/zip
Tar.xz with Tar.xz extension is: application/x-xz-compressed-tar
Tar.xz with Zip extension is:    application/zip
Tar.gz with Tar.gz extension is: application/x-compressed-tar
Tar.gz with Zip extension is:    application/zip
Rar without extension is:        application/vnd.rar
Zip without extension is:        application/zip
7z without extension is:         application/x-7z-compressed
Tar.xz without extension is:     application/x-xz
Tar.gz without extension is:     application/gzip

Conclusion

This method worked surprisingly well, but don't be fooled, there is a scenario where it consistently fails. If a file has the wrong extension (one that doesn't match is content) it will report the file type to be the extension. It should not happen very often, but if one is very picky this method is not to be used.

Also, some warn that his approach doesn't work well in Windows.

Workaround: If one manages to remove the extension from the filename, this would return the proper value for all the given scenarios.

Apache Tika (tika-eval 1.18)

There seem to be many flavors of this library (app, server, eval, etc), but many around the web complain about it being somewhat "dependency-heavy".

Implementation

import org.apache.tika.Tika;

try {
    return new Tika().detect(new File(basePath + "/" + fileName));
} catch (final Exception exception) {
    return "<EXCEPTION: " + exception.getMessage() + ">";
}

Results

Rar with Rar extension is:       application/x-rar-compressed
Rar with Zip extension is:       application/x-rar-compressed
Zip with Zip extension is:       application/zip
7z with 7z extension is:         application/x-7z-compressed
7z with Zip extension is:        application/x-7z-compressed
Tar.xz with Tar.xz extension is: application/x-xz
Tar.xz with Zip extension is:    application/x-xz
Tar.gz with Tar.gz extension is: application/gzip
Tar.gz with Zip extension is:    application/gzip
Rar without extension is:        application/x-rar-compressed
Zip without extension is:        application/zip
7z without extension is:         application/x-7z-compressed
Tar.xz without extension is:     application/x-xz
Tar.gz without extension is:     application/gzip

Conclusion

All files were properly identified, but as it has its advantages it also has its disadvantages.

Pros:

Maintained by Apache.
Does not get fooled by extensions.

Cons:

Really heavy, specially if one only wants to check get the file type. The Tika-eval Jar weights +40MB.

URLConnection

Implementation

import java.net.URL;
import java.net.URLConnection;

try {
    final URL url = new URL("file://" + basePath + "/" + fileName);
    final URLConnection urlConnection = url.openConnection();
    return urlConnection.getContentType();
} catch (final Exception exception) {
    return "<EXCEPTION: " + exception.getMessage() + ">";
}

Results

Rar with Rar extension is:       content/unknown
Rar with Zip extension is:       application/zip
Zip with Zip extension is:       application/zip
7z with 7z extension is:         content/unknown
7z with Zip extension is:        application/zip
Tar.xz with Tar.xz extension is: content/unknown
Tar.xz with Zip extension is:    application/zip
Tar.gz with Tar.gz extension is: application/octet-stream
Tar.gz with Zip extension is:    application/zip
Rar without extension is:        content/unknown
Zip without extension is:        content/unknown
7z without extension is:         content/unknown
Tar.xz without extension is:     content/unknown
Tar.gz without extension is:     content/unknown

Conclusion

It hardly identifies any file compression format, and guides itself by the extension, not its contents.

SimpleMagic 1.14

This project seems to be updated at least once a year.

Implementation

import com.j256.simplemagic.ContentInfo;
import com.j256.simplemagic.ContentInfoUtil;

try {
    final ContentInfoUtil util = new ContentInfoUtil();
    final ContentInfo info = util.findMatch(basePath + "/" + fileName);

    return info.getMimeType();
} catch (final Exception exception) {
    return "<EXCEPTION: " + exception.getMessage() + ">";
}

Results

Rar with Rar extension is:       application/x-rar
Rar with Zip extension is:       application/x-rar
Zip with Zip extension is:       application/zip
7z with 7z extension is:         application/x-7z-compressed
7z with Zip extension is:        application/x-7z-compressed
Tar.xz with Tar.xz extension is: <EXCEPTION: null>
Tar.xz with Zip extension is:    <EXCEPTION: null>
Tar.gz with Tar.gz extension is: application/x-gzip
Tar.gz with Zip extension is:    application/x-gzip
Rar without extension is:        application/x-rar
Zip without extension is:        application/zip
7z without extension is:         application/x-7z-compressed
Tar.xz without extension is:     <EXCEPTION: null>
Tar.gz without extension is:     application/x-gzip

Conclusion

It worked for almost all our scenarios, but it seems that for the most "obscure" compression formats like Tar.xz it failed to detect them (and threw an exception in the process).

MimeUtil 2.1.3

This project has not been modified since 2010, so don't expect support or updates. It is just listed here for the sake of completion.

Implementation

import eu.medsea.mimeutil.MimeUtil2;

try {
    final MimeUtil2 mimeUtil = new MimeUtil2();
        mimeUtil.registerMimeDetector("eu.medsea.mimeutil.detector.MagicMimeMimeDetector");

    return MimeUtil2.getMostSpecificMimeType(mimeUtil.getMimeTypes(basePath + "/" + fileName)).toString();
} catch (final Exception exception) {
    return "<EXCEPTION: " + exception.getMessage() + ">";
}

Results

Rar with Rar extension is:       application/x-rar
Rar with Zip extension is:       application/x-rar
Zip with Zip extension is:       application/zip
7z with 7z extension is:         application/octet-stream
7z with Zip extension is:        application/octet-stream
Tar.xz with Tar.xz extension is: application/octet-stream
Tar.xz with Zip extension is:    application/octet-stream
Tar.gz with Tar.gz extension is: application/x-gzip
Tar.gz with Zip extension is:    application/x-gzip
Rar without extension is:        application/x-rar
Zip without extension is:        application/zip
7z without extension is:         application/octet-stream
Tar.xz without extension is:     application/octet-stream
Tar.gz without extension is:     application/x-gzip

Conclusion

It identifies some of the most popular file types, but fails with Tar.xz and 7z.

file - Command Line

Not the prettiest solution, but it had to be tried: Ubuntu file command.

Implementation

import java.io.BufferedReader;
import java.io.InputStreamReader;

try {
    final Process process = Runtime.getRuntime().exec("file --mime-type " + basePath + "/" + fileName);

    final BufferedReader stdInput = new BufferedReader(new InputStreamReader(process.getInputStream()));

    String text = "";

    String s;
    while ((s = stdInput.readLine()) != null) {
        text += s;
    }

    return text.split(": ")[1];
} catch (final Exception exception) {
    return "<EXCEPTION: " + exception.getMessage() + ">";
}

Results

Rar with Rar extension is:       application/x-rar
Rar with Zip extension is:       application/x-rar
Zip with Zip extension is:       application/zip
7z with 7z extension is:         application/x-7z-compressed
7z with Zip extension is:        application/x-7z-compressed
Tar.xz with Tar.xz extension is: application/x-xz
Tar.xz with Zip extension is:    application/x-xz
Tar.gz with Tar.gz extension is: application/gzip
Tar.gz with Zip extension is:    application/gzip
Rar without extension is:        application/x-rar
Zip without extension is:        application/zip
7z without extension is:         application/x-7z-compressed
Tar.xz without extension is:     application/x-xz
Tar.gz without extension is:     application/gzip

Conclusion

It works for all our scenarios, but again, this relies on the command File being present on the System running the code.

Find out MIME Type of compressed files downloaded from S3 for Java

There are 1 best solutions below

TL;DR;

The Test Setup

MimetypesFileTypeMap

Implementation

Results

Conclusion

URLConnection.guessContentTypeFromStream

Implementation

Results

Conclusion

Files.probeContentType

Implementation

Results

Conclusion

Apache Tika (tika-eval 1.18)

Implementation

Results

Conclusion

URLConnection

Implementation

Results

Conclusion

SimpleMagic 1.14

Implementation

Results

Conclusion

MimeUtil 2.1.3

Implementation

Results

Conclusion

file - Command Line

Implementation

Results

Conclusion

Related Questions in JAVA

Related Questions in AMAZON-S3

Related Questions in MIME-TYPES

Related Questions in COMPRESSED-FILES

Trending Questions

Popular # Hahtags

Popular Questions