I am trying to use the Mongo DB aggregation pipeline in a Synapse Spark notebook. The use case is to convert ObjectID
type fields, like the _id
field, to string with $addFields.
However, my attempts are met with
IllegalArgumentException: Unrecognized configuration specified: (pipeline,[{ $addFields: { _id: { '$toString': '$_id' } } } ])
I have been trying to brute force quote mark combinations, looking at examples from Copilot and documentation. Here is one example:
pipeline = "[{ '$addFields': { '_id': { '$toString': $_id } } } ]"
df = spark.read\
.format("cosmos.olap")\
.option("spark.synapse.linkedService", "CosmosDbMongoDb1")\
.option("spark.cosmos.container", "<COLLECTION NAME>")\
.option("pipeline", pipeline)\
.load()
display(df.limit(10))
Did I make some noob mistake in literal format, or is this a case of missing support in the Spark connector?
EDIT: This paragraph in the connector docs could very well point to an answer for someone more experienced with MongoDB.
Custom aggregation pipelines must be compatible with the partitioner > strategy. For example, aggregation stages such as $group do not work > with any partitioner that creates more than one partition.
I found Scala code example in Azure documentation that works for converting
ObjectId
field to string:This solution also answers the question here