Dears, I'm going to load data from MongoDB into Oracle DB using python (pymongo) is running too slow and finally got error 'cursor not found'. also if I use $match $date in my pipe line, and read only some few days it's working, but actually I need to load whole history for 10 years.
pipeline = [
{"$match":{"create": {"$gte": s_days_ago,"$lte": e_days_ago}}},
{'$project':{
'_id':'$_id',
'address':'$address',
'latitude':'$latitude',
'longitude':'$longitude',
'country':'$country',
'state':'$state',
'city':'$city',
'create':'$create',
}}
]
so any suggestion would be appreciated for these two questions :
- help to read fastest
- how to prevent 'cursor not found error' after a long time
My Full code is mentioned below :
from pymongo import MongoClient
from pymongo import ReadPreference
from pandas import DataFrame
from datetime import datetime
from datetime import timedelta
import cx_Oracle
import pandas as pd
E_DAYS_AGO = 18
S_DAYS_AGO = 20
today = datetime.now()
s_days_ago = today - timedelta(days=S_DAYS_AGO)
e_days_ago = today - timedelta(days=E_DAYS_AGO)
connection = MongoClient('10.1.1.1',27017)
db = connection.get_database('mydb', read_preference=ReadPreference.SECONDARY)
collection = db.location
pipeline = [
{'$project':{
'_id':'$_id',
'address':'$address',
'latitude':'$latitude',
'longitude':'$longitude',
'country':'$country',
'state':'$state'
}}
]
df = pd.DataFrame()
for doc in collection.aggregate(pipeline):
df = df.append(doc,ignore_index=True)
df = df.fillna('')
df['_id'] = df['_id'].apply(lambda x: (str(x).replace('DBRef(',''))) \
.apply(lambda x: (str(x).replace(', ObjectId(',''))) \
.apply(lambda x: (str(x).replace('))',''))) \
.apply(lambda x: (str(x).replace("'",'')))
df = df.reindex(columns= ['_id','address','latitude','longitude','state','country'])
db = cx_Oracle.connect('bi/[email protected]:1521/orcl')
cursor = db.cursor()
cursor.execute("TRUNCATE TABLE location")
sql='INSERT INTO BI.location VALUES (:1,:2,:3,:4,:5,:6,:7,:8)'
df_list = df.values.tolist()
n = 0
for i in df.iterrows():
cursor.execute(sql,df_list[n])
n += 1
db.commit()
cursor.close
db.close
This fragment:
Reads all the results into a dataframe before processing them one at a time into Oracle. I would recommend your process them one at time i.e. create a single element dataframe and pass that onto Oracle . That is what is causing the delay. You them compound this by running an
apply()in the dataframe which does a second iteration over the data. Why not do theapply()processing while you are creating the dataframe?In essence, pymongo is not slow, you are processing the whole dataset on the pymongo side twice and then doing simple writes on the Oracle side.
The dataframe looks like a pretty heavyweight object. I think you could do all that preprocessing in a dict, but I am not an Oracle expert so I am not sure what you are doing at the Oracle end.