I searched for this through internet, but couldn't find any code. What comes to mind is to first load the data into Pandas (Ram) then load it into Cudf (GPU's ram).
import cudf
from sqlalchemy import create_engine
db_url = "postgresql://username:password@localhost:5432/database_name"
engine = create_engine(db_url)
query = "SELECT * FROM your_table_name"
pandas_df = pd.read_sql(query, engine)
cudf_df = cudf.DataFrame.from_pandas(pandas_df)
print(cudf_df)
However with this approach while in WSL2 Environment, it takes longer to load the data and after the operation we still have the loaded data in ram (pandas Dataframe) which we need to drop.
is there more efficient way do achieve this?
Thanks for the question. cuDF doesn't have a way to get transactions from external SQL databases directly to GPU. Your solution with
pandas.read_sqlis a reasonable choice.If you are able to export the database to a Parquet file, you could use cudf's GPU-accelerated Parquet reader to load the data very quickly. That would probably give nearly-optimal performance. Also be aware of features like GPUDirect Storage that can boost performance on supported GPUs:
You can also use dask-sql (https://dask-sql.readthedocs.io/en/latest/) to run a SQL query directly on the GPU. dask-sql supports cuDF for GPU acceleration. You would load the data some other way (such as via
pandas.read_sqlorcudf.read_parquet), and then use dask-sql to execute queries on the cudf DataFrame in memory.