A client is supposed to upload a compressed file into an S3 folder. Then the compressed file is downloaded and decompressed to perform various operations on its contained files. Originally we told our client to compress its files into a ZIP file, but this proved too difficult for our client. Instead it submitted a RAR file with ZIP extension... how clever. For obvious reasons one can't decompress a RAR file using a ZIP decompressing algorithm.
So, I'm looking for a way to find out the file type of the S3 downloaded files given that I'm working on a Java project with Amazon's SDK on a Linux OS. I'll take care of how to decompress the file depending on the obtained file type.
I've looked at many stack overflow questions, like this one, but none seem 100% effective just by looking at them (and its comments).
What would be the best approach to find out the compressed file's type?
TL;DR;
When one uploads a file to Amazon S3 programatically, one could specify the object's
Content-Type
. If one specifies none, as @Michael-bot clarifies, the value assigned by default will bebinary/octet-stream
. Or if one decides to upload the file through Amazon S3's GUI, the file gets itsContent-Type
from its file extension (sadly, not its contents). If you can trust whoever uploaded the file to set theContent-Type
correctly, go ahead and look at theObjectMetadata
, but if you can't (like me), you would need another solution.So, if you are looking for a solution that works on the most common file compression types, Files.probeContentType, Apache Tika and SimpleMagic seem to be acceptable solutions.
In the end I chose
Files.probeContentType
as it required no extra libraries and works just fine on a Linux machine (as long as the file doesn't have the wrong extension, for which there is a workaround: remove the file extension and let it do its magic).The Test Setup
At first one would think that the response object when downloading the file from Amazon's S3 includes the file type. And it does contain this information, but the problem arises when the extension of the file doesn't match its contents.
This code would return
application/zip
even if the contents of the file are of a Rar file. So this solution doesn't work for me.For this reason I took the time to build a sample project that tested various scenarios with the different approaches and libraries available. I'm using Java 8 by the way.
The files types tested are:
Beware, the implementations presented here are only for testing purposes. They are not in any way endorsed to be used in production code, as they don't consider file locking problems among other things that my imagination couldn't bother to consider. =)
MimetypesFileTypeMap
Implementation
Results
Conclusion
The value returned by this approach when a file type has not been recognized is
application/octet-stream
. It seems all scenarios failed so we should discard this approach.URLConnection.guessContentTypeFromStream
Implementation
Results
Conclusion
Again, this method fails all scenarios. It seems its support is very limited.
Files.probeContentType
Implementation
Results
Conclusion
This method worked surprisingly well, but don't be fooled, there is a scenario where it consistently fails. If a file has the wrong extension (one that doesn't match is content) it will report the file type to be the extension. It should not happen very often, but if one is very picky this method is not to be used.
Also, some warn that his approach doesn't work well in Windows.
Apache Tika (tika-eval 1.18)
There seem to be many flavors of this library (app, server, eval, etc), but many around the web complain about it being somewhat "dependency-heavy".
Implementation
Results
Conclusion
All files were properly identified, but as it has its advantages it also has its disadvantages.
Pros:
Cons:
URLConnection
Implementation
Results
Conclusion
It hardly identifies any file compression format, and guides itself by the extension, not its contents.
SimpleMagic 1.14
This project seems to be updated at least once a year.
Implementation
Results
Conclusion
It worked for almost all our scenarios, but it seems that for the most "obscure" compression formats like Tar.xz it failed to detect them (and threw an exception in the process).
MimeUtil 2.1.3
This project has not been modified since 2010, so don't expect support or updates. It is just listed here for the sake of completion.
Implementation
Results
Conclusion
It identifies some of the most popular file types, but fails with Tar.xz and 7z.
file - Command Line
Not the prettiest solution, but it had to be tried: Ubuntu file command.
Implementation
Results
Conclusion
It works for all our scenarios, but again, this relies on the command
File
being present on the System running the code.