So I've had this system that scrapes and compresses files for a while now using bz2 compression. The way it does so is using the following block of code I found on SO a few months back:
Let's assume for the purposes of this post the filename is always file.XXXX
where XXXX
is the relevant extension. We start with .txt
### How to compress a text file
filepath_compressed = "file.tar.bz2"
with open("file.txt", 'rb') as data:
tarbz2contents = bz2.compress(data.read(), 9)
with bz2.BZ2File(filepath_compressed, 'wb') as f_comp:
f_comp.write(tarbz2contents)
Now, to decompress it, I've always got it to work using a decompression software I have called Keka which decompresses the .tar.bz2
file to .tar
, then I run it through Keka again to get an "extensionless" file which I then add a .txt
to on my mac and then it works.
Now, to do decompress programmatically, I've tried a few things. I've tried the stuff from this post and the code from this post. I've tried using BZ2Decompressor and BZ2File and everything. I just seem to be missing something and I'm not sure what it is.
Here is what I have so far, and I'd like to know what is wrong with this code:
import bz2, tarfile, shutil
# Decompress to tar
with bz2.BZ2File("file.tar.bz2") as fr, open("file.tar", "wb") as fw:
shutil.copyfileobj(fr, fw)
# Decompress from tar to txt
with tarfile.open("file.tar", "r:") as tar:
tar.extractall("file_out.txt")
This code crashes because of a "tarfile.ReadError: truncated header
" problem. I think the first context manager outputs a binary text file, and I tried decoding that but that failed too. What am i missing here i feel like a noob.
If you would like a minimum runnable piece of code to replicate this, add the following to make a dummy file:
lines = ["Line 1","Line 2", "Line 3"]
with open("file.txt", "w") as f:
for line in lines:
f.write(line+"\n")
The thing that you're making is not a
.tar.bz2
file, but rather a.bz2.bz2
file. You are compressing twice with bzip2 (the second time with no effect), and there is no tar file generation anywhere to be seen.