Converting pyspark column to ObjectId for mongo spark connector v10.x

221 Views Asked by At

I've seen this question but it is not working for me. I think the version of my tools is the problem.

Here is a sample code

from pyspark.sql import SparkSession
from pyspark.sql.functions import struct

# Create a SparkSession
spark = SparkSession.builder.master("local[*]").appName("pySpark") \
                    .config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.12:10.1.0') \
                    .config('spark.mongodb.read.connection.uri', 'mongodb://127.0.0.1:27017/') \
                    .config('spark.mongodb.write.connection.uri', 'mongodb://127.0.0.1:27017/') \
                    .config('spark.mongodb.write.database' ,'projectdb') \
                    .config('spark.mongodb.write.collection', 'cc') \
                    .config('spark.mongodb.write.convertJson', True) \
                    .getOrCreate()

data = [("123456789012",), ("678907890876",)]
df = spark.createDataFrame(data, ["_id"])
df = df.withColumn("_id", struct(df["_id"].alias("oid")))

# Print the DataFrame schema
df.printSchema()
df.show()

df.write.format("mongodb").mode("append").save()

It's output is

root
 |-- _id: struct (nullable = false)
 |    |-- oid: string (nullable = true)

+--------------+                                                                
|           _id|
+--------------+
|{123456789012}|
|{678907890876}|
+--------------+

but the mongo document stores _id as follows

> db.cc.find()
{ "_id" : { "oid" : NumberLong("123456789012") } }

I want it like

{"_id" : ObjectId(123456789012)}

I'm using mongo version 5.0.15, pyspark version 3.4.0, spark-mongo connector version is 10.1.0. Any help is appreciated.

0

There are 0 best solutions below