Efficiently loading large time series data into InfluxDB

1.1k Views Asked by At

I am trying to load 100 billion (thousands of columns, millions of rows) multi-dimensional time series datapoints into InfluxDB from a CSV file.

I am currently doing it through line protocol as follows (my codebase is in Python):

f = open(args.file, "r")
l = []
bucket_size = 100
if rows > 10000:
    bucket_size = 10
for x in tqdm(range(rows)):
    s = f.readline()[:-1].split(" ")
    v = {}
    for y in range(columns):
        v["dim" + str(y)] = float(s[y + 1])
    time = (get_datetime(s[0])[0] - datetime(1970, 1, 1)).total_seconds() * 1000000000
    time = int(time)
    body = {"measurement": "puncte", "time": time, "fields": v }
    l.append(body)
    if len(l) == bucket_size:
        while True:
            try:
                client.write_points(l)
            except influxdb.exceptions.InfluxDBServerError:
                continue
            break
        l = []
client.write_points(l)

final_time = datetime.now()
final_size = get_size()

seconds = (final_time - initial_time).total_seconds()

As the code above shows, my code is reading the dataset CSV file and preparing batches of 10000 data points, then sending the datapoints using client.write_points(l).

However, this method is not very efficient. In fact, I am trying to load 100 billion data points and this is taking way longer than expected, loading only 3 Million rows with 100 columns each has been running for 29 hours and still has 991 hours to finish!!!!

I am certain there is a better way to load the dataset into InfluxDB. Any suggestions for faster data loading?

1

There are 1 best solutions below

0
On

Try loading the data in multiple concurrent threads. This should give a speedup on multi-CPU machines.

Another option is to feed the CSV file directly to time series database without additional transformations. See this example.