I have a very large text file (58 million lines) that I want to process using Cython. It looks like this:
262.000 1.000 2.2900
263.000 1.000 -2.1562
264.000 1.000 -1.7201
265.000 1.000 1.1220
266.000 1.000 -0.0108
267.000 1.000 0.4371
268.000 1.000 -0.1519
269.000 1.000 0.1389
And so on. I need to go line-by-line and get the three values; the first two should be converted to integers, then the last one remains a float. I use a dictionary (c++ map) to convert the first two numbers to two new numbers, then use those new numbers as indices to a 2D array and at that position, add the absolute value of the third value.
By far my biggest bottleneck is just processing the data. The fastest algorithm that I have been able to write uses Pandas to read in the file and converts it to a Numpy array, but this still takes about 30 seconds, with the whole script taking about 35 seconds.
The code for this implementation is below:
@cython.boundscheck(False)
@cython.wraparound(False)
def GetInteractionMatrices(self, path: str) -> np.ndarray:
cdef:
np.ndarray[np.float64_t, ndim=2] res_res
np.ndarray[np.float64_t, ndim=2] data
int length
int i
int atomResidue_i
int atomResidue_j
double value
res_res = np.zeros((self.numResidues,self.numResidues), dtype=np.float64)
data = np.abs(pd.read_csv(path, sep="\s+", header=None, skiprows=1).to_numpy())
length = data.shape[0]
# Needed to ensure the data is contigious in memory
res_res = res_res.copy(order='C')
data = data.copy(order='C')
cdef np.float64_t[:, ::1] res_resView = res_res
cdef double[:, ::1] dataView = data
for i in range(length):
atomResidue_i = self.atomResidueDict[<int>dataView[i, 0]]
atomResidue_j = self.atomResidueDict[<int>dataView[i, 1]]
if atomResidue_i == atomResidue_j: continue
value = dataView[i, 2]
res_resView[atomResidue_i, atomResidue_j] += value
return res_res
But some more consideration of the algorithm led to me realize that I don't really need to read the data in memory at all, and I can just process it as a data stream. However, my best implementation of this is about 2x slower than the above code, taking about 55 seconds.
@cython.boundscheck(False)
@cython.wraparound(False)
def GetInteractionMatrices(self, path: str) -> np.ndarray:
cdef:
np.ndarray[np.float64_t, ndim=2] res_res
np.ndarray[np.float64_t, ndim=1] dataTemp
int i
int atomResidue_i
int atomResidue_j
double value
FILE* p
char *token
char line[50]
char path_c[100]
res_res = np.zeros((self.numResidues,self.numResidues), dtype=np.float64)
dataTemp = np.zeros(3, dtype=np.float64)
# Needed to ensure the data is contigious in memory
res_res = res_res.copy(order='C')
cdef np.float64_t[:, ::1] res_resView = res_res
cdef double[:] dataTempView = dataTemp
strncpy(path_c, path.encode('utf-8'), sizeof(path_c)-1)
path_c[sizeof(path_c)-1] = b'\0'
p = fopen(<char*>path_c, "r")
if p is NULL:
PyErr_SetFromErrnoWithFilenameObject(OSError, <char*>path_c)
fseek(p, 0, SEEK_SET) # Move to the beginning of the file
while fgets(line, 50, p) != NULL:
i=0
token = strtok(line, ' ')
while token != NULL and i < 3:
dataTempView[i] = atof(token)
token = strtok(NULL, ' ')
i += 1
atomResidue_i = self.atomResidueDict[<int>dataTempView[0]]
atomResidue_j = self.atomResidueDict[<int>dataTempView[1]]
if atomResidue_i == atomResidue_j: continue
res_resView[atomResidue_i, atomResidue_j] += np.abs(dataTempView[2])
return res_res
Surprising, using numpy to get the absolute value alone adds 30 additional seconds to the process, which I think is probably the only advantage of reading the data into memory.
I still think this has the potential to be faster, I suspect an additional inefficiency is with the nested while loops and splitting the string based on spaces, though at this point I'm not sure how I could further optimize it.
-----EDIT 1-----
I took some suggestions and made the following changes:
- Replaced
np.abswithabswhich Cython optimizes - Converted my dictionary to a Numpy array
This managed to bring the time down from 55 seconds to only 14!
I then tried to implement my own parsing function using knowledge of the specific format of the file, but I seem to have made it worse. If anyone would like to take a look at it, I'd love some feedback on it:
cdef ParseLine(bytes line):
cdef:
int i = 0
int num1 = 0
int num2 = 0
double num3 = 0
# Ignore leading spaces
while line[i] == 32:
i += 1
# Build first number
while 48 <= line[i] <= 57:
num1 = line[i] - 48 + num1 * 10
i += 1
# Ignore decimal point and any trailing zeros
i += 4
# Ignore spaces
while line[i] == 32:
i += 1
# Build second number
while 48 <= line[i] <= 57:
num2 = line[i] - 48 + num2 * 10
i += 1
# Ignore decimal point and any trailing zeros
i += 4
# Ignore spaces or sign (we would've taken abs anyway)
while line[i] == 32 or line[i] == 45:
i += 1
# Build third number
while 48 <= line[i] <= 57:
num3 = line[i] - 48 + num3 * 10
i += 1
# Ignore decimal point
num3 = num3 + (line[i+1] - 48) * 0.1
num3 = num3 + (line[i+2] - 48) * 0.01
num3 = num3 + (line[i+3] - 48) * 0.001
num3 = num3 + (line[i+4] - 48) * 0.0001
return num1, num2, num3
This feels inefficient, but from what I've been able to find, this is basically how atoi and atof are implemented. Though this is about 50 seconds slower, so there must be some optimizations somewhere.
-----EDIT 2-----
If anyone else is in the position, changing the function signature from cdef ParseLine(bytes line) to cdef (int, int, double) ParseLine(char* line) (note char* rather than bytes) sped up the code from ~60 seconds to ~3 seconds! I think it's because bytes is a Python object which has a large access overhead, and char* is a C object, which is optimized by Cython.