Given a set of files - each one of which is read into a distinct dataframe
- how might a pandasql
query reference them?
In the following snippet we have a list
of dataframes
: but the same question would apply to a dict
:
import pandas as pd
from pandasql import sqldf
# Read in a set of 10 files each containing columns `id` and `estimate`
dfs = [pd.read_csv('file%d.csv' %d) for d in range(1,10+1)]
sql_res = sqldf("select d2.estimate - d1.estimate \
from dfs[1] d1 join dfs[2] d2 on d2.id = d1.id", locals())
The dfs[1]
and dfs[2]
are showing what I'd like to do - but are not valid syntax. Any suggestions on how to structure this kind of problem in a way that pandasql
can support?
You can tell pandasql a list of table names/aliases instead of just passing
locals()
, as per the docstring ofPandasSQL.__call__
(can't find online version of the docs):Note that you must put all tables that you want to query there, though.
Here is a small example, using the
PandasSQL
class instead ofsqldf
as recommended in the docstring: