Array inside object becomes empty, despite having contents when passed in using multiprocessing

99 Views Asked by At

Im new to python and having trouble passing in an object in a function. Basically, I'm trying to read a large file over 1.4B lines.

I am passing in an object that contains information on the file. One of these is a very large array containing the location of the start of each line in the file.

This is a large array and by passing just the object reference I wish to have just one instance of the array which is then shared by multiple processes although I don't know if this is actually happening.

The array when passed into the process_line function is then empty leading to errors. This is the problem.

Here is where the function is being called (see the p.starmap)

with open(file_name, 'r') as f:
    

    line_start = file_inf.start_offset
    # Iterate over all lines and construct arguments for `process_line`
    while line_start < file_inf.file_size:
        line_end = min(file_inf.file_size, line_start + line_size)    #end = minimum of either file size and line start + line size

        # Save `process_line` arguments
        args = [file_name, line_start, line_end, file_inf.line_offset ]                        #arguments for process_line
        line_args.append(args)

        # Move to the next line
        line_start = line_end
        
print(line_args[1])      
with multiprocessing.Pool(cpu_count) as p:  #run the process_line function on each line
# Run lines in parallel
# starmap() is like map() except the the we have multiple arguments in a list so we use starmap  
    line_result = p.starmap(process_line, line_args)    #maps the process_line function to each line

This is the function:

def process_line(file_name,line_start, line_end, file_obj):
line_results = register()
c2 = register()
c1 = register()
with open(file_name, 'r') as f:
    # Moving stream position to `line_start`
    f.seek(file_obj[line_start])
    i = 0
    if line_start == 63400:
        print ("hello")
    # Read and process lines until `line_end`
    
    for line in f:
        line_start += 1
        if line_start > line_end:
            line_results.__append__(c2)
            c2.clear()
            break
        c1 = func(line)
        c2.__add__(c1)   
        i= i+1       
return line_results.countf

where file_obj contains line_offset which is the array in question.

Now If I remove the multiprocessing and just use: line_result = starmap(process_line, line_args)

the array is passed in just fine. Although without multiprocessing

Also if I pass in just the array instead of the entire object then it also works but now for some reason only 2 processes work (on Linux, on Windows using task manager only 1 works while the rest just use memory but not CPU). Instead of an expected 20 which is critical for this task.

Processes

Is there any solution to this? please help

0

There are 0 best solutions below