In a basic I had the next process.
import csv
reader = csv.reader(open('huge_file.csv', 'rb'))
for line in reader:
process_line(line)
See this related question. I want to send the process line every 100 rows, to implement batch sharding.
The problem about implementing the related answer is that csv object is unsubscriptable and can not use len.
>>> import csv
>>> reader = csv.reader(open('dataimport/tests/financial_sample.csv', 'rb'))
>>> len(reader)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: object of type '_csv.reader' has no len()
>>> reader[10:]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable
>>> reader[10]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable
How can I solve this?
Just make your
reader
subscriptable by wrapping it into alist
. Obviously this will break on really large files (see alternatives in the Updates below):Further reading: How do you split a list into evenly sized chunks in Python?
Update 1 (list version): Another possible way would just process each chuck, as it arrives while iterating over the lines:
Update 2 (generator version): I haven't benchmarked it, but maybe you can increase performance by using a chunk generator:
There is a minor gotcha, as @totalhack points out: