I have a huge tar.gz file with lots of images in it. I need to find the md5 hash of each images. I am not able to find hash of images inside the tar file but same code works for normal folders and images. Is there any way to find hash without extracting the tar?
public static String digestAndBuildImageEntry(Path filePath) throws NoSuchAlgorithmException {
try (InputStream is = Files.newInputStream(filePath);
BufferedInputStream buffered = new BufferedInputStream(is)) {
byte[] data = Files.readAllBytes(filePath);
byte[] hashByte = MessageDigest.getInstance("MD5").digest(data);
String hash = hashByte.toString();
return hash;
} catch (Exception ex) {
return null;
}
}
I get below exception when i run this code
Caused by: java.nio.file.FileSystemException: /Users/myuser/old/file.tar.gz/1.jpg: Not a directory
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
at sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
at java.nio.file.Files.readAttributes(Files.java:1737)
at java.nio.file.FileTreeWalker.getAttributes(FileTreeWalker.java:219)
at java.nio.file.FileTreeWalker.visit(FileTreeWalker.java:276)
at java.nio.file.FileTreeWalker.walk(FileTreeWalker.java:322)
at java.nio.file.FileTreeIterator.<init>(FileTreeIterator.java:72)
at java.nio.file.Files.walk(Files.java:3574)
at java.nio.file.Files.walk(Files.java:3625)
at com.example.demo.ImageDeduplication.listFiles(ImageDeduplication.java:78)
at com.example.demo.SparkSQL.lambda$1(SparkSQL.java:82)
at org.apache.spark.sql.UDFRegistration.$anonfun$register$352(UDFRegistration.scala:775)
... 17 more
Below Path variables worked
- /Users/myuser/old/1.jpg - worked
- /Users/myuser/old/ - able to iterate and get all file inside the folder
- /Users/myuser/old/file.tar.gz - gives the hash of the entire tar file
Not working for
- /Users/myuser/old/file.tar.gz/1.jpg - says not a directory
Apache Commons Compress has classes that can stream
tar.gzformat. From examples and docs it would be something like this:Another option to quickly access files inside of tar.gz is to mount it as virtual file system by commons-vfs