How to read parquet files using DuckDB in Go?

834 Views Asked by At

I'm building an Arrow Flight SQL server in Go wrapping DuckDB to query a large number of parquet files(each with a different schema). I want those parquet files to be loaded into DuckDB on-demand(i.e. when a query is made over it) instead of loading all those files into DuckDB database at startup.

DuckDB CLI supports such behaviour by allow SQL queries over a group of files matching a glob pattern which is perfect. However when I start a simple Go Arrow Flight SQL server to connect to DuckDB and pass a single parquet file to use as the data source, I get an error when the server starts up.

Here is a snippet of my test code:

func CreateDB() (*sql.DB, error) {
    // Connect to the DuckDB database.
    db, err := sql.Open("duckdb", "parquet:/path/to/local/parquet/file.parquet")
    if err != nil {
        log.Fatal(err)
        return nil, err
    }
    defer db.Close()
    fmt.Println("Connection to DuckDB created successfully!")
    return db, nil
}

func main() {
    host := flag.String("host", "localhost", "IP/Hostname to bind the server to")
    port := flag.String("port", "31337", "Port to bind the server to")

    db, err := CreateDB()
    if err != nil {
        log.Fatal(err)
    }
    defer db.Close()

    srv := NewDuckDBFlightSQLServer(db)
    server := flight.NewServerWithMiddleware(nil)
    server.RegisterFlightService(flightsql.NewFlightServer(srv))
    server.Init(net.JoinHostPort(*host, *port))
    server.SetShutdownOnSignals(os.Interrupt, os.Kill)

    fmt.Println("DuckDB Flight SQL Server started on ", server.Addr(), "...")

    if err := server.Serve(); err != nil {
        log.Fatal(err)
    }
}

The error seen when running this code is:

could not open database: Binder Error: Unrecognized storage type "parquet"
2023/09/12 10:27:13 could not open database: Binder Error: Unrecognized storage type "parquet"

Basically it doesn't like parquet: prefix in the sql.Open() call. Removing that prefix gives a different(and a more obvious) error

could not open database: IO Error: The file "/path/to/local/parquet/file.parquet" exists, but it is not a valid DuckDB database file!

Connecting to DuckDB with in-memory database(passing second arg of sql.Open() as empty string) works fine and the server starts successfully.

Whats the right way to achieve this in Go? I do not find reference implementation(or starter docs) for "DuckDB + parquet + arrowFlightSQL" in Go yet. Does the duckDB driver in Go not support querying parquet files directly?

I've tested this on Go 1.18.10.

1

There are 1 best solutions below

0
On

First create a database the same way you do for sqlite:

db, err := sql.Open("duckdb", "mydb.db")
if err != nil {
    log.Fatal(err)
    return nil, err
}

and after that you can load you parquet file this way:

_, _ = db.execute("install 'parquet'")
_, _ = db.execute("load 'parquet'")
_, _ = db.execute("create view mydata as select * from read_parquet('path/to/your/parquet/file'")

Finally you can query your data this way:

_, _ = db.execute("select * from mydata")

See the doc here

Hope it helps.