Efficiently processing a very large unicode string into csv

181 Views Asked by At

Usually I'm able to find the answers to my dilemmas pretty quickly on this site but perhaps this problem requires a more specific touch;

I have a ~50 million long unicode string I download from a Tektronix Oscilloscope. Getting this assigned is a pain in a** for memory (sys.getsizeof() reports ~100 MB)

The problem lies in that I need to turn this into a CSV so that I can grab 10,000 of the 10 million Comma Sep Values (this is fixed)... 1) I have tried split(",") method, using this, the RAM usage on the python kernel SPIKES another 300 MB....BUT the process is VERY efficient (except when I loop this ~100 times in one routine...somewhere between iterations 40-50, the kernel spits back a memory error.) 2) I wrote my own script that after downloading the absurdly long string, scans the the number of commas until I see 10,000 and stops, turning all the values between the commas into floats and populating an np array. This is pretty efficient from a memory usage perspective (from before importing file to after running script, memory usage only changes by 150MB.) However it is MUCH slower, and usually results in a kernel crash shortly after completion of the 100x loops.

Below is the code used to process this file, and if you PM me, I can send you a copy of the string for experimenting (however I'm sure it may be easier to generate one)

Code 1 (using split() method)

PPStrace = PPSinst.query('CURV?')
PPStrace = PPStrace.split(',')
PPSvals = []
for iii in range(len(PPStrace)): #does some algebra to values
    PPStrace[iii] = ((float(PPStrace[iii]))-yoff)*ymult+yzero

maxes=np.empty(shape=(0,0))
iters=int(samples/1000)
for i in range(1000): #looks for max value in 10,000 sample increments, adds to "maxes"
    print i
    maxes = np.append(maxes,max(PPStrace[i*iters:(i+1)*iters]))
PPS = 100*np.std(maxes)/np.mean(maxes)
print PPS," % PPS Noise"

Code 2 (self generated script);

PPStrace = PPSinst.query('CURV?')
walkerR=1
walkerL=0
length=len(PPStrace)
maxes=np.empty(shape=(0,0))
iters=int(samples/1000) #samples is 10 million, iters then is 10000

for i in range(1000):
    sample=[] #initialize 10k sample list
    commas=0 #commas are 0
    while commas<iters: #if the number of commas found is less than 10,000, keep adding values to sample
        while PPStrace[walkerR]!=unicode(","):#indexes commas for value extraction
            walkerR+=1
            if walkerR==length:
                break
        sample.append((float(str(PPStrace[walkerL:walkerR]))-yoff)*ymult+yzero)#add value between commas to sample list
        walkerL=walkerR+1
        walkerR+=1
        commas+=1
    maxes=np.append(maxes,max(sample))
PPS = 100*np.std(maxes)/np.mean(maxes)
print PPS,"% PPS Noise"

Also tried Pandas Dataframe with StringIO for CSV conversion. That thing gets memory error just trying to read it into a frame.

I am thinking the solution would be to load this into a SQL table and then pull CSV in 10,000 sample chunks (which is intended purpose of the script). But I would love to not do this!

Thanks for all your help guys!

2

There are 2 best solutions below

1
On

Have you tried class cStringIO? It's just like file IO, but uses a string as a buffer instead of a specified file. Frankly, I expect that you're suffering from as chronic speed problem. Your self-generated script should is the right approach. You might get some speed-up if you read a block at a time, and then parse that while the next block is reading.


For parallel processing, use the multiprocessing package. See the official documentation or this tutorial for details and examples.

Briefly, you create a function that embodies the process you want to run in parallel. You then create a process with that function as the target parameter. Then start the process. When you want to merge its thread back to the main program, use join.

3
On

Take a look at numpy.frombuffer (http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.frombuffer.html). This lets you specify a count and an offset. You should be able to put the big string into a buffer and then process it in chunks to avoid huge memory spikes.


EDIT 2016-02-01

Since frombuffer needs to have a fixed byte width I tried numpy.fromregex and it seems to be able to parse the string quickly. It has to do the whole thing though which might cause some memory issues. http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.fromregex.html

Something like this:

buf = StringIO.StringIO(big_string)
output = numpy.fromregex(buf, r'(-?\d+),', dtype=[('val', np.int64)])
# output['val'] is the array of values