Has anyone worked with the Amazon Quantum Ledger Database (QLDB) Amazon ion files? If so, do you know how to extract the "data" part to formulate tables? Maybe use python to scrape the data? I am trying to get the "data" information from these files which are stored in s3 (I don't have access to QLDB so I cannot query directly) and then upload the results to Glue.
I am trying to perform an ETL job using GLue, but Glue doesn't like Amazon Ion files so I need to either query data from these files or scrape the files for relevant information.
Thanks. PS : by "data" information I mean this:
{
PersonId:"4tPW8xtKSGF5b6JyTihI1U",
LicenseNumber:"LEWISR261LL",
LicenseType:"Learner",
ValidFromDate:2016–12–20,
ValidToDate:2020–11–15
}
ref : https://docs.aws.amazon.com/qldb/latest/developerguide/working.userdata.html
Nosiphiwe,
AWS Glue is able to read Amazon Ion input. Many other services and applications can't, though, so it's a good idea to use Glue to convert the Ion data to JSON. Note that Ion is a super-set of JSON, adding some data types to JSON, so converting Ion to JSON may cause some down-conversion.
One good way to get access to your QLDB documents from the QLDB S3 export is to use Glue to extract the document data, store it in S3 as JSON, and query it with Amazon Athena. The process would go as follows:
Take a look at the PySpark script below. It extracts just the revision metadata and data payload from the QLDB export files.
The QLDB export maps the table for each document, but separately from the revisions data. You'll have to do some extra coding to include the table name in your revision data in the output. The code below doesn't do this, so you'll end up with all of your revisions in one table in the output.
Also note that you'll get whatever revisions happen to be in the exported data. That is, you might get multiple document revisions for a given document ID. Depending on your intended use of the data, you may need to figure out how to grab just the latest revision of each document ID.
I hope this helps!