Efficiently reading a text file in Cython

75 Views Asked by At

I have a very large text file (58 million lines) that I want to process using Cython. It looks like this:

 262.000    1.000        2.2900
 263.000    1.000       -2.1562
 264.000    1.000       -1.7201
 265.000    1.000        1.1220
 266.000    1.000       -0.0108
 267.000    1.000        0.4371
 268.000    1.000       -0.1519
 269.000    1.000        0.1389

And so on. I need to go line-by-line and get the three values; the first two should be converted to integers, then the last one remains a float. I use a dictionary (c++ map) to convert the first two numbers to two new numbers, then use those new numbers as indices to a 2D array and at that position, add the absolute value of the third value.

By far my biggest bottleneck is just processing the data. The fastest algorithm that I have been able to write uses Pandas to read in the file and converts it to a Numpy array, but this still takes about 30 seconds, with the whole script taking about 35 seconds.

The code for this implementation is below:

@cython.boundscheck(False)
@cython.wraparound(False)
def GetInteractionMatrices(self, path: str) -> np.ndarray:
    cdef:
        np.ndarray[np.float64_t, ndim=2] res_res
        np.ndarray[np.float64_t, ndim=2] data
        int length
        int i
        int atomResidue_i
        int atomResidue_j
        double value

    res_res = np.zeros((self.numResidues,self.numResidues), dtype=np.float64)
            
    data = np.abs(pd.read_csv(path, sep="\s+", header=None, skiprows=1).to_numpy())
    length = data.shape[0]

    # Needed to ensure the data is contigious in memory
    res_res = res_res.copy(order='C')
    data = data.copy(order='C')
    
    cdef np.float64_t[:, ::1] res_resView = res_res
    cdef double[:, ::1] dataView = data
    
    for i in range(length):
        atomResidue_i = self.atomResidueDict[<int>dataView[i, 0]]
        atomResidue_j = self.atomResidueDict[<int>dataView[i, 1]]
        
        if atomResidue_i == atomResidue_j: continue
        
        value = dataView[i, 2]

        res_resView[atomResidue_i, atomResidue_j] += value

    return res_res

But some more consideration of the algorithm led to me realize that I don't really need to read the data in memory at all, and I can just process it as a data stream. However, my best implementation of this is about 2x slower than the above code, taking about 55 seconds.

@cython.boundscheck(False)
@cython.wraparound(False)
def GetInteractionMatrices(self, path: str) -> np.ndarray:
    cdef:
        np.ndarray[np.float64_t, ndim=2] res_res
        np.ndarray[np.float64_t, ndim=1] dataTemp
        int i
        int atomResidue_i
        int atomResidue_j
        double value

        FILE* p
        char *token
        char line[50]
        char path_c[100]

    res_res = np.zeros((self.numResidues,self.numResidues), dtype=np.float64)
    dataTemp = np.zeros(3, dtype=np.float64)
    
    # Needed to ensure the data is contigious in memory
    res_res = res_res.copy(order='C')

    cdef np.float64_t[:, ::1] res_resView = res_res
    cdef double[:] dataTempView = dataTemp
    
    strncpy(path_c, path.encode('utf-8'), sizeof(path_c)-1)
    path_c[sizeof(path_c)-1] = b'\0'
    p = fopen(<char*>path_c, "r")
    if p is NULL:
        PyErr_SetFromErrnoWithFilenameObject(OSError, <char*>path_c)
    fseek(p, 0, SEEK_SET)  # Move to the beginning of the file

    while fgets(line, 50, p) != NULL:
        i=0
        token = strtok(line, ' ')
        while token != NULL and i < 3:
            dataTempView[i] = atof(token)
            token = strtok(NULL, ' ')
            i += 1

        atomResidue_i = self.atomResidueDict[<int>dataTempView[0]]
        atomResidue_j = self.atomResidueDict[<int>dataTempView[1]]
        
        if atomResidue_i == atomResidue_j: continue

        res_resView[atomResidue_i, atomResidue_j] += np.abs(dataTempView[2])

    return res_res

Surprising, using numpy to get the absolute value alone adds 30 additional seconds to the process, which I think is probably the only advantage of reading the data into memory.

I still think this has the potential to be faster, I suspect an additional inefficiency is with the nested while loops and splitting the string based on spaces, though at this point I'm not sure how I could further optimize it.

-----EDIT 1-----

I took some suggestions and made the following changes:

  • Replaced np.abs with abs which Cython optimizes
  • Converted my dictionary to a Numpy array

This managed to bring the time down from 55 seconds to only 14!

I then tried to implement my own parsing function using knowledge of the specific format of the file, but I seem to have made it worse. If anyone would like to take a look at it, I'd love some feedback on it:

cdef ParseLine(bytes line):
cdef:
    int i = 0
    int num1 = 0
    int num2 = 0
    double num3 = 0

# Ignore leading spaces
while line[i] == 32:
    i += 1

# Build first number
while 48 <= line[i] <= 57:
    num1 = line[i] - 48 + num1 * 10
    i += 1

# Ignore decimal point and any trailing zeros
i += 4

# Ignore spaces
while line[i] == 32:
    i += 1

# Build second number
while 48 <= line[i] <= 57:
    num2 = line[i] - 48 + num2 * 10
    i += 1

# Ignore decimal point and any trailing zeros
i += 4

# Ignore spaces or sign (we would've taken abs anyway)
while line[i] == 32 or line[i] == 45:
    i += 1

# Build third number
while 48 <= line[i] <= 57:
    num3 = line[i] - 48 + num3 * 10
    i += 1

# Ignore decimal point

num3 =  num3 + (line[i+1] - 48) * 0.1
num3 =  num3 + (line[i+2] - 48) * 0.01
num3 =  num3 + (line[i+3] - 48) * 0.001
num3 =  num3 + (line[i+4] - 48) * 0.0001

return num1, num2, num3

This feels inefficient, but from what I've been able to find, this is basically how atoi and atof are implemented. Though this is about 50 seconds slower, so there must be some optimizations somewhere.

-----EDIT 2-----

If anyone else is in the position, changing the function signature from cdef ParseLine(bytes line) to cdef (int, int, double) ParseLine(char* line) (note char* rather than bytes) sped up the code from ~60 seconds to ~3 seconds! I think it's because bytes is a Python object which has a large access overhead, and char* is a C object, which is optimized by Cython.

0

There are 0 best solutions below