Does slicing a bytes object create a whole new copy of data in Python?

5.1k Views Asked by At

Say I have very large bytes object (after loading binary file) and I want to read parts by parts and advance the starting position until it meets the end. I use slicing to accomplish this. I'm worried that python will create completely new copy each time I ask for a slice instead of simply giving me the address of the memory pointing to the position I want.

Simple example:

data = Path("binary-file.dat").read_bytes()
total_length = len(data)
start_pos = 0

while start_pos < total_length:
   bytes_processed = decode_bytes(data[start_pos:])  # <---- ***
   start_pos += bytes_processed 

In the above example does python creates completely new copy of bytes object starting from the start_pos due to the slicing. If so what is the best way to avoid data copy and use just a pointer to pass to the relevant position of the bytes array.

1

There are 1 best solutions below

3
user3840170 On

Yes, slicing a bytes object does create a copy, at least as of CPython 3.9.12. The closest the documentation comes to admitting this is in the description of the bytes constructor:

In addition to the literal forms, bytes objects can be created in a number of other ways:

  • A zero-filled bytes object of a specified length: bytes(10)
  • From an iterable of integers: bytes(range(20))
  • Copying existing binary data via the buffer protocol: bytes(obj)

which suggests any creation of a bytes object creates a separate copy of the data. But since I had a hard time finding an explicit confirmation that slicing does the same, I resorted to an empirical test.

>>> b = b'\1' * 100_000_000
>>> qq = [b[1:] for _ in range(20)]

After executing the first line, memory usage of the python3 process in top was about 100 MB. The second executed after a considerable delay, making memory usage rise to the level of 2G. This seems pretty conclusive. PyPy 7.3.9 targetting Python 3.8 behaves largely the same; though of course, PyPy’s garbage collection is not as eager as CPython’s, so the memory is not freed as soon as the bytes objects become unreachable.

And it makes sense. If slicing created a reference to the original buffer instead of copying it, you could create a memory leak by allocating a large bytes object, slicing one byte from it, then dropping a reference to the original. (The V8 JavaScript engine used to have this exact problem.)

To avoid copying the underlying buffer, wrap your bytes in a memoryview and slice that:

>>> bm = memoryview(b)
>>> qq = [bm[1:] for _ in range(50)]