How to monitor the total size of output produced by a subprocess in real time?

55 Views Asked by At

The code below is a toy example of the actual situation I am dealing with1. (Warning: this code will loop forever.)

import subprocess
import uuid
class CountingWriter:
    def __init__(self, filepath):
        self.file = open(filepath, mode='wb')
        self.counter = 0

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_value, traceback):
        self.file.close()

    def __getattr__(self, attr):
        return getattr(self.file, attr)

    def write(self, data):
        written = self.file.write(data)
        self.counter += written
        return written

with CountingWriter('myoutput') as writer:
    with subprocess.Popen(['/bin/gzip', '--stdout'],
                          stdin=subprocess.PIPE,
                          stdout=writer) as gzipper:
        while writer.counter < 10000:
            gzipper.stdin.write(str(uuid.uuid4()).encode())
            gzipper.stdin.flush()
            writer.flush()
            # writer.counter remains unchanged

        gzipper.stdin.close()

In English, I start a subprocess, called gzipper, which receives input through its stdin, and writes compressed output to a CountingWriter object. The code features a while-loop, depending on the value of writer.counter, that at each iteration, feeds some random content to gzipper.

This code does not work!

More specifically, writer.counter never gets updated, so execution never leaves the while-loop.

This example is certainly artificial, but it captures the problem I would like to solve: how to terminate the feeding of data into gzipper once it has written a certain number of bytes.

Q: How must I change the code above to get this to work?


FWIW, I thought that the problem had to do with buffering, hence all the calls to *.flush() in the code. They have no noticeable effect, though. Incidentally, I cannot call gzipper.stdout.flush() because gzipper.stdout is not a CountingWriter object (as I had expected), but rather it is None, surprisingly enough.


1 In particular, I am using a /bin/gzip --stdout subprocess only for the sake of this example, because it is a more readily available alternative to the compression program that I am actually working with. If I really wanted to gzip-compress my output, I would use Python's standard gzip module.

1

There are 1 best solutions below

3
On

Your "writer" is an arbitrary Python object - subprocess piping needs real files - as those will be used by their O.S. handlers in the subprocess. The only reason you get any data written to the output file at all is because you proxied getattr - so the code in subprocess have retrieved the fileno() for your proxied file - the real, operating system level, file is the only thing seen in the actual subprocess (gzip) - not your writer object.

What can be done, instead, is promote counter to a property which will call stat on your output file:

import subprocess
import uuid
import os

class CountingWriter:
    def __init__(self, filepath):
        self.filepath = filepath
        
    @property
    def counter(self):
        if not hasattr(self, "file"):
            return 0

        return os.stat(self.filepath).st_size

    def __enter__(self):
        # by bringing the actual file openning into the `__enter__`,
        # we avoid side effects just by instantiating the object.
        self.file = open(self.filepath, mode='wb')
        return self

    def __exit__(self, exc_type, exc_value, traceback):
        self.file.close()
        self.file = self.filepath 

    def __getattr__(self, attr):
        return getattr(self.file, attr)

    def write(self, data):
        return self.file.write(data)