I'm building an Arrow Flight SQL server in Go wrapping DuckDB to query a large number of parquet files(each with a different schema). I want those parquet files to be loaded into DuckDB on-demand(i.e. when a query is made over it) instead of loading all those files into DuckDB database at startup.
DuckDB CLI supports such behaviour by allow SQL queries over a group of files matching a glob pattern which is perfect. However when I start a simple Go Arrow Flight SQL server to connect to DuckDB and pass a single parquet file to use as the data source, I get an error when the server starts up.
Here is a snippet of my test code:
func CreateDB() (*sql.DB, error) {
// Connect to the DuckDB database.
db, err := sql.Open("duckdb", "parquet:/path/to/local/parquet/file.parquet")
if err != nil {
log.Fatal(err)
return nil, err
}
defer db.Close()
fmt.Println("Connection to DuckDB created successfully!")
return db, nil
}
func main() {
host := flag.String("host", "localhost", "IP/Hostname to bind the server to")
port := flag.String("port", "31337", "Port to bind the server to")
db, err := CreateDB()
if err != nil {
log.Fatal(err)
}
defer db.Close()
srv := NewDuckDBFlightSQLServer(db)
server := flight.NewServerWithMiddleware(nil)
server.RegisterFlightService(flightsql.NewFlightServer(srv))
server.Init(net.JoinHostPort(*host, *port))
server.SetShutdownOnSignals(os.Interrupt, os.Kill)
fmt.Println("DuckDB Flight SQL Server started on ", server.Addr(), "...")
if err := server.Serve(); err != nil {
log.Fatal(err)
}
}
The error seen when running this code is:
could not open database: Binder Error: Unrecognized storage type "parquet"
2023/09/12 10:27:13 could not open database: Binder Error: Unrecognized storage type "parquet"
Basically it doesn't like parquet:
prefix in the sql.Open()
call. Removing that prefix gives a different(and a more obvious) error
could not open database: IO Error: The file "/path/to/local/parquet/file.parquet" exists, but it is not a valid DuckDB database file!
Connecting to DuckDB with in-memory database(passing second arg of sql.Open()
as empty string) works fine and the server starts successfully.
Whats the right way to achieve this in Go? I do not find reference implementation(or starter docs) for "DuckDB + parquet + arrowFlightSQL" in Go yet. Does the duckDB driver in Go not support querying parquet files directly?
I've tested this on Go 1.18.10.
First create a database the same way you do for sqlite:
and after that you can load you parquet file this way:
Finally you can query your data this way:
See the doc here
Hope it helps.