I have a sufficiently large dataset that I would like to bulk index the JSON objects in AWS OpenSearch.
I cannot see how to achieve this using any of: boto3, awswrangler, opensearch-py, elasticsearch, elasticsearch-py.
Is there a way to do this without using a python request (PUT/POST) directly?
Note that this is not for: ElasticSearch, AWS ElasticSearch.
Many thanks!
I finally found a way to do it using opensearch-py, as follows.
First establish the client,
Phew! Let's create the data now:
TIP! Create the index if you need to -
This is where things go a bit nutty. I hadn't realised that every single bulk action needs an, er,
action
e.g. "index", "search" etc. - so let's define that nowYou can read all about the bulk REST API, there.
The next quirk is that the OpenSearch bulk API requires Newline Delimited JSON (see https://www.ndjson.org), which is basically JSON serialized as strings and separated by newlines. Someone wrote on SO that this "bizarre" API looked like one designed by a data scientist - far from taking offence, I think that rocks. (I agree ndjson is weird though.)
Hideously, now let's build up the full JSON string, combining the data and actions. A helper fn is at hand!
OK so now we can finally invoke the bulk API. I suppose you could mix in all sorts of actions (out of scope here) - go for it!
That's probably the most boring punchline ever but there you have it.
You can also just get (geddit)
.bulk()
to just useindex=
and set the action to:Hey presto!
Now, choose your poison - the other solution looks crazily shorter and neater.
PS The well-hidden opensearch-py documentation on this are located here.