I am trying to load 100 billion (thousands of columns, millions of rows) multi-dimensional time series datapoints into InfluxDB from a CSV file.
I am currently doing it through line protocol as follows (my codebase is in Python):
f = open(args.file, "r")
l = []
bucket_size = 100
if rows > 10000:
bucket_size = 10
for x in tqdm(range(rows)):
s = f.readline()[:-1].split(" ")
v = {}
for y in range(columns):
v["dim" + str(y)] = float(s[y + 1])
time = (get_datetime(s[0])[0] - datetime(1970, 1, 1)).total_seconds() * 1000000000
time = int(time)
body = {"measurement": "puncte", "time": time, "fields": v }
l.append(body)
if len(l) == bucket_size:
while True:
try:
client.write_points(l)
except influxdb.exceptions.InfluxDBServerError:
continue
break
l = []
client.write_points(l)
final_time = datetime.now()
final_size = get_size()
seconds = (final_time - initial_time).total_seconds()
As the code above shows, my code is reading the dataset CSV file and preparing batches of 10000 data points, then sending the datapoints using client.write_points(l)
.
However, this method is not very efficient. In fact, I am trying to load 100 billion data points and this is taking way longer than expected, loading only 3 Million rows with 100 columns each has been running for 29 hours and still has 991 hours to finish!!!!
I am certain there is a better way to load the dataset into InfluxDB. Any suggestions for faster data loading?
Try loading the data in multiple concurrent threads. This should give a speedup on multi-CPU machines.
Another option is to feed the CSV file directly to time series database without additional transformations. See this example.