The following code is able to read in a bzipped file:
offset = 24
# Open the object
fobj = open(filey,'rb')
# Read the data
buffer = fobj.read()
# Apply bz2 compression
buffer_unbzip,places_to_bzip = bzip_blocks_decompress_all(buffer,offset)
where the bzip_blocks_decompress_all function is defined as below:
def bzip_blocks_decompress_all(data,offset):
import bz2
frames = bytearray()
places_to_bzip = []
while offset < len(data):
block_cmp_bytes = abs(int.from_bytes(data[offset:offset + 4], 'big', signed=True))
offset += 4
frames += bz2.decompress(data[offset:offset + block_cmp_bytes])
places_to_bzip.append([offset,offset+block_cmp_bytes])
offset += block_cmp_bytes
return frames,places_to_bzip
So I have the locations of where objects are bzipped (places_to_bzip). So my thinking is that we should be able to do something like the following:
# Try to compress using bz2 just based on some of the places_to_bzip
a1 = buffer[places_to_bzip[0][0]:places_to_bzip[0][1]]
a2 = buffer_unbzip[places_to_bzip[0][0]:places_to_bzip[0][1]]
# Convert a2 back to a1 with a bzip compression
a3 = bz2.compress(a2)
print(len(a1))
print(len(a2))
print(len(a3))
104
104
70
Why is this not recompressing properly? Below is the output from a1 and a2 for testing:
print(a1)
b'BZh51AY&SY\xe6\xb1\xacS\x00\x00\x02_\xab\xfe(@\x00\x10\x00@\x04\x00@\x00@\x800\x02\x00\x00\x01\x00@\x08\x00\x00\x18 \x00T4\x8d\x004\x01\xa0\x91(\x01\x90\xd3\xd2\x14\xac\xd6v\x85\xf0\x0fD\x85\xc3A}\xe09\xbc\xe1\x8b\x04Y\xbfb$"\xcc\x13\xc0B\r\x99\xf1Qa%S\x00|]\xc9\x14\xe1BC\x9a\xc6\xb1L'
print(a2)
bytearray(b'\x00\x0b\x00\x02\x05z\x00\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00X\x00\x00\x00\x00\x002\x04@\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01h\x00\x00\x00\x00\x002\x04@\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')
Per my comments,
buffer_unbzipcontains the decompressed data only, and offsets inplaces_to_bzipare the start/end offsets of slices in the original compressed data. The offset of the unbzipped frames is not known.Below I've reverse-engineered the input file and generated one, then used the OP's code to extract the data. The code is modified to also return the start/end of each unbzipped frame and then walks the offsets re-compressing and comparing each frame's compression data:
Output: