Reading large volume data from Teradata using Dask cluster/Teradatasql and sqlalchemy

309 Views Asked by At

I need to read large volume data(app. 800M records) from teradata, my code is working fine for a million record. for larger sets its taking time to build metadata. Could someone please suggest how to make it faster. Below is the code snippet which I am using for my application.

def get_partitions(num_partitions):
    list_range =[]
    initial_start=0
    for i in range(num_partitions):
        amp_range = 3240//num_partitions
        start = (i*amp_range+1)*initial_start
        end   = (i+1)*amp_range
        list_range.append((start,end))
        initial_start = 1
    return list_range

@delayed
def load(query,start,end,connString):
    df = pd.read_sql(query.format(start, end),connString)
    engine.dispose()
    return df

connString = "teradatasql://{user}:{password}@{hostname}/?logmech={logmech}&encryptdata=true"

results = from_delayed([load(query,start, end,connString) for start,end in get_partitions(num_partitions)])
1

There are 1 best solutions below

0
On

The build time is probably taken in finding out the metadata of your table. This is done by fetching the whole of the first partition and analysing it.

You would be better off either specifying it explcitly, if you know the dtypes upfront, e.g., {col: dtype, ...} for all the columns, or generating it from a separate query that you limit to just as many rows as it takes to be sure you have the right types:

meta = dask.compute(load(query, 0,10 ,connString))

results = from_delayed(
    [
        load(query,start, end,connString) for start,end in 
        get_partitions(num_partitions)
    ],
    mete=meta.loc[:0, :]  # zero-length version of table
)