I've seen this question but it is not working for me. I think the version of my tools is the problem.
Here is a sample code
from pyspark.sql import SparkSession
from pyspark.sql.functions import struct
# Create a SparkSession
spark = SparkSession.builder.master("local[*]").appName("pySpark") \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.12:10.1.0') \
.config('spark.mongodb.read.connection.uri', 'mongodb://127.0.0.1:27017/') \
.config('spark.mongodb.write.connection.uri', 'mongodb://127.0.0.1:27017/') \
.config('spark.mongodb.write.database' ,'projectdb') \
.config('spark.mongodb.write.collection', 'cc') \
.config('spark.mongodb.write.convertJson', True) \
.getOrCreate()
data = [("123456789012",), ("678907890876",)]
df = spark.createDataFrame(data, ["_id"])
df = df.withColumn("_id", struct(df["_id"].alias("oid")))
# Print the DataFrame schema
df.printSchema()
df.show()
df.write.format("mongodb").mode("append").save()
It's output is
root
|-- _id: struct (nullable = false)
| |-- oid: string (nullable = true)
+--------------+
| _id|
+--------------+
|{123456789012}|
|{678907890876}|
+--------------+
but the mongo document stores _id
as follows
> db.cc.find()
{ "_id" : { "oid" : NumberLong("123456789012") } }
I want it like
{"_id" : ObjectId(123456789012)}
I'm using mongo version 5.0.15, pyspark version 3.4.0, spark-mongo connector version is 10.1.0. Any help is appreciated.