Memory leak while retrieving data from a proxy class

232 Views Asked by At

I am multi-processing data from a series of files. To achieve the purpose, I built a class to distribute the data. I started 4 processes that will visit the same class and retrieve data. The problem is, if I use the class method (retrieve()) to retrieve data, the memory will keep going up. If I don't, the memory is stable, even though the data keeps refreshing by getData(). How to keep a stable memory usage while retrieving data? Or any other way to achieve the same goal?

import pandas as pd
from multiprocessing import Process, RLock
from multiprocessing.managers import BaseManager 

class myclass():
    def __init__(self, path):
        self.path = path
        self.lock = RLock()
        self.getIter()

    def getIter(self):
        self.iter = pd.read_csv(self.path, chunksize=1000)

    def getData(self):
        with self.lock:
            try:
                self.data = next(self.iter)
            except:
                self.getIter()
                self.data = next(self.iter)

    def retrieve(self):
        return self.data

def worker(c):
    while True:
        c.getData()
        # Uncommenting the following line, memory usage goes up
        data = c.retrieve()

#Generate a testing file
with open('tmp.csv', 'w') as f:
    for i in range(1000000):
        f.write('%f\n'%(i*1.))

BaseManager.register('myclass', myclass)
bm = BaseManager()
bm.start()
c = bm.myclass('tmp.csv')

for i in range(4):
    p = Process(target=worker, args=(c,))
    p.start()
1

There are 1 best solutions below

0
On

I wasn't able to find out the cause nor solving it, but after changing the data type for the returning variable from pandas.DataFrame to a str (json string), the problem goes.