This is just a sample minimal test to reproduce memory leakage issue in remote Dask kubernetes cluster.
def load_geojson(pid):
import requests
import io
r = requests.get("https://github.com/datasets/geo-countries/raw/master/data/countries.geojson")
temp = r.json()
import sys
size_temp = sys.getsizeof(temp)
del temp
return size_temp
L_geojson = client.map(load_geojson, range(200))
del L_geojson
Observation: Steady increase in worker memory(Bytes Storage) by approx 30 MB on each run and keeps on increasing until whole memory is used. Another test I tried with urllib, I observed there was a random increase and decrease in memory on each run.
Desired behavior: Memory should be cleaned up after the reference L_geojson is deleted.
Could someone please help with this memory leakage issue?
I can confirm an increase in memory and "full garbage collections took X% CPU time recently" messages,. If I allow the futures to run, memory also increases, but more slowly.
Using
fsspecdoes not have this problem, as you found withurllib, and this is what Dask typically uses for its IO (fsspecswitched fromrequeststo usingaiohttpfor communication).Your modified function might look like
but you still get garbage collection warnings.