How to *properly* compress and decompress a text file using bz2 and python

684 Views Asked by At

So I've had this system that scrapes and compresses files for a while now using bz2 compression. The way it does so is using the following block of code I found on SO a few months back:

Let's assume for the purposes of this post the filename is always file.XXXX where XXXX is the relevant extension. We start with .txt

### How to compress a text file
filepath_compressed = "file.tar.bz2"
with open("file.txt", 'rb') as data:
    tarbz2contents = bz2.compress(data.read(), 9)
    with bz2.BZ2File(filepath_compressed, 'wb') as f_comp:
        f_comp.write(tarbz2contents)

Now, to decompress it, I've always got it to work using a decompression software I have called Keka which decompresses the .tar.bz2 file to .tar, then I run it through Keka again to get an "extensionless" file which I then add a .txt to on my mac and then it works.

Now, to do decompress programmatically, I've tried a few things. I've tried the stuff from this post and the code from this post. I've tried using BZ2Decompressor and BZ2File and everything. I just seem to be missing something and I'm not sure what it is.

Here is what I have so far, and I'd like to know what is wrong with this code:

import bz2, tarfile, shutil

# Decompress to tar
with bz2.BZ2File("file.tar.bz2") as fr, open("file.tar", "wb") as fw:
    shutil.copyfileobj(fr, fw)
    
# Decompress from tar to txt
with tarfile.open("file.tar", "r:") as tar:
    tar.extractall("file_out.txt")

This code crashes because of a "tarfile.ReadError: truncated header" problem. I think the first context manager outputs a binary text file, and I tried decoding that but that failed too. What am i missing here i feel like a noob.


If you would like a minimum runnable piece of code to replicate this, add the following to make a dummy file:

lines = ["Line 1","Line 2", "Line 3"]

with open("file.txt", "w") as f:
    for line in lines:
        f.write(line+"\n")
1

There are 1 best solutions below

0
On

The thing that you're making is not a .tar.bz2 file, but rather a .bz2.bz2 file. You are compressing twice with bzip2 (the second time with no effect), and there is no tar file generation anywhere to be seen.