I need to read large volume data(app. 800M records) from teradata, my code is working fine for a million record. for larger sets its taking time to build metadata. Could someone please suggest how to make it faster. Below is the code snippet which I am using for my application.
def get_partitions(num_partitions):
list_range =[]
initial_start=0
for i in range(num_partitions):
amp_range = 3240//num_partitions
start = (i*amp_range+1)*initial_start
end = (i+1)*amp_range
list_range.append((start,end))
initial_start = 1
return list_range
@delayed
def load(query,start,end,connString):
df = pd.read_sql(query.format(start, end),connString)
engine.dispose()
return df
connString = "teradatasql://{user}:{password}@{hostname}/?logmech={logmech}&encryptdata=true"
results = from_delayed([load(query,start, end,connString) for start,end in get_partitions(num_partitions)])
The build time is probably taken in finding out the metadata of your table. This is done by fetching the whole of the first partition and analysing it.
You would be better off either specifying it explcitly, if you know the dtypes upfront, e.g.,
{col: dtype, ...}
for all the columns, or generating it from a separate query that you limit to just as many rows as it takes to be sure you have the right types: