python importlib seems to be sharing data between instances

111 Views Asked by At

Ok, so I am having a weird one. I am running python in a SideFX Hython (their custom build) implementation that is using PDG. The only real difference between Hython and vanilla Python is some internal functions for handling geometry data and compiled nodes, which shouldn't be an issue even though they are being used.

The way the code runs, I am generating a list of files from the disk which creates PDG work items. Those work items are then processed in parallel by PDG. Here is the code for that:

import importlib.util
import pdg
import os
from pdg.processor import PyProcessor
import json

class CustomProcessor(PyProcessor):
    def __init__(self, node):
        PyProcessor.__init__(self,node)
        self.extractor_module = 'GeoExtractor'

    def onGenerate(self, item_holder, upstream_items, generation_type):
        for upstream_item in upstream_items:
            new_item = item_holder.addWorkItem(parent=upstream_item, inProcess=True)
        return pdg.result.Success
    
    def onCookTask(self, work_item):
        spec = importlib.util.spec_from_file_location("callback", "Geo2Custom.py")
        GE = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(GE)
        GE.convert(f"{work_item.attribValue('directory')}/{work_item.attribValue('filename')}{work_item.attribValue('extension')}", work_item.index, f'FRAME { work_item.index }', self.extractor_module)
        return pdg.result.Success

def bulk_convert (path_pattern, extractor_module = 'GeoExtractor'):
    type_registry = pdg.TypeRegistry.types()
    try:
        type_registry.registerNode(CustomProcessor, pdg.nodeType.Processor, name="customprocessor", label="Custom Processor", category="Custom")
    except Exception:
        pass
    whereItWorks = pdg.GraphContext("testBed")
    whatWorks    = whereItWorks.addScheduler("localscheduler")
    whatWorks.setWorkingDir(os.getcwd (), '$HIP')
    
    whereItWorks.setValues(f'{whatWorks.name}', {'maxprocsmenu':-1, 'tempdirmenu':0, 'verbose':1})
    
    findem       = whereItWorks.addNode("filepattern")
    whereItWorks.setValue(f'{findem.name}', 'pattern', path_pattern, 0)
    
    generic      = whereItWorks.addNode("genericgenerator")
    whereItWorks.setValue(generic.name, 'itemcount', 4, 0)
     
    custom       = whereItWorks.addNode("customprocessor")
    custom.extractor_module = extractor_module
    node1 = [findem]
    node2 = [custom]*len(node1)
    
    for n1, n2 in zip(node1, node2):
        whereItWorks.connect(f'{n1.name}.output', f'{n2.name}.input')
        n2.cook(True)
        for node in whereItWorks.graph.nodes():
            node.dirty(False)
        whereItWorks.disconnect(f'{n1.name}.output', f'{n2.name}.input')
    
    print ("FULLY DONE")
import os
import hou
import traceback

import CustomWriter
import importlib

def convert (filename, frame_id, marker, extractor_module = 'GeoExtractor'):
    Extractor = importlib.__import__ (extractor_module)
    base, ext = os.path.splitext (filename)
    if ext == '.sc':
        base = os.path.splitext (base)[0]
    dest_file = base + ".custom"

    geo = hou.Geometry ()
    geo.loadFromFile (filename)
    try:
        frame = Extractor.extract_geometry (geo, frame_id)
    except Exception as e:
        print (f'F{ frame_id } Geometry extraction failed: { traceback.format_exc () }.')
        return None

    print (f'F{ frame_id } Geometry extracted. Writing file { dest_file }.')
    try:
        CustomWriter.write_frame (frame, dest_file)
    except Exception as e:
        print (f'F{ frame_id } writing failed: { e }.')

    print (marker + " SUCCESS")

The onCookTask code is run when the work item is processed.

Inside of the GeoExtractor.py program I am importing the geometry file defined by the work item, then converting it into a couple Pandas dataframes to collate and process the massive volumes of data quickly, which is then passed to a custom set of functions for writing binary files to disk from the Pandas data.

Everything appears to run flawlessly, until I check my output binaries and see that they escalate in file size much more than they should, indicating that either something is being shared between instances or not cleared from memory and subsequent loads of the extractor code is appending the dataframes which are named the same.

I have run the GeoExtractor code sequentially with the python instance closing between each file conversion using the exact same code and the files are fine, growing only very slowly as the geometry data volume grows, so the issue has to lie somewhere in the parallelization of it using PDG and calling the GeoExtractor.py code over and over for each work item.

I have contemplated moving the importlib stuff to the __init__() for the class leaving only the call to the member function in the onCookTask() function. Maybe even going so far as to pass a unique variable for each work item which is used inside GeoExtractor to create a closure of the internal functions so they are unique instances in memory.

I tried to do a stripped down version of GeoExtractor and since I'm not sure where the leak is, I just ended up pulling out comments with proprietary or superfluous information and changing some custom library names, but the file ended up kinda long so I am including a pastebin: https://pastebin.com/4HHS8D2W

As for CustomGeometry and CustomWriter, there is no working form of either of those libraries that will be NDA safe, so unfortunately they have to stay blackboxed. The CustomGeometry is a handful of container classes which organize all of the data coming out of the geometry, and the writer is a formatter/writer for the binary format we are utilizing. I am hoping the issue wouldn't be in either of them.

Edit 1: I fixed an issue in the example code. Edit 2: Added larger examples.

0

There are 0 best solutions below