import data to oracle DB from MongoDB using pymongo running too slow

262 Views Asked by At

Dears, I'm going to load data from MongoDB into Oracle DB using python (pymongo) is running too slow and finally got error 'cursor not found'. also if I use $match $date in my pipe line, and read only some few days it's working, but actually I need to load whole history for 10 years.

pipeline = [
 {"$match":{"create": {"$gte": s_days_ago,"$lte": e_days_ago}}},
 {'$project':{
  '_id':'$_id',
  'address':'$address',
  'latitude':'$latitude',
  'longitude':'$longitude',
  'country':'$country',
  'state':'$state',
  'city':'$city',
  'create':'$create',
}}
]

so any suggestion would be appreciated for these two questions :

  1. help to read fastest
  2. how to prevent 'cursor not found error' after a long time

My Full code is mentioned below :

from pymongo import MongoClient
from pymongo import ReadPreference
from pandas import DataFrame
from datetime import datetime
from datetime import timedelta
import cx_Oracle
import pandas as pd

E_DAYS_AGO = 18
S_DAYS_AGO = 20
today = datetime.now()    
s_days_ago = today - timedelta(days=S_DAYS_AGO)
e_days_ago = today - timedelta(days=E_DAYS_AGO)

connection = MongoClient('10.1.1.1',27017)
db = connection.get_database('mydb', read_preference=ReadPreference.SECONDARY)
collection = db.location

pipeline = [
  {'$project':{
  '_id':'$_id',
  'address':'$address',
  'latitude':'$latitude',
  'longitude':'$longitude',
  'country':'$country',
  'state':'$state'
     }}
    ]

 df = pd.DataFrame() 

 for doc in collection.aggregate(pipeline): 
 df = df.append(doc,ignore_index=True)

 df = df.fillna('')
 df['_id'] = df['_id'].apply(lambda x: (str(x).replace('DBRef(',''))) \
                     .apply(lambda x: (str(x).replace(', ObjectId(',''))) \
                     .apply(lambda x: (str(x).replace('))',''))) \
                     .apply(lambda x: (str(x).replace("'",''))) 
    
 df = df.reindex(columns= ['_id','address','latitude','longitude','state','country'])

 db = cx_Oracle.connect('bi/[email protected]:1521/orcl')
 cursor = db.cursor()

 cursor.execute("TRUNCATE TABLE location")
 sql='INSERT INTO BI.location VALUES (:1,:2,:3,:4,:5,:6,:7,:8)'

 df_list = df.values.tolist()
 n = 0
 for i in df.iterrows():
     cursor.execute(sql,df_list[n])
     n += 1

 db.commit()
 cursor.close
 db.close
1

There are 1 best solutions below

0
Joe Drumgoole On

This fragment:

 for doc in collection.aggregate(pipeline): 
 df = df.append(doc,ignore_index=True)

Reads all the results into a dataframe before processing them one at a time into Oracle. I would recommend your process them one at time i.e. create a single element dataframe and pass that onto Oracle . That is what is causing the delay. You them compound this by running an apply() in the dataframe which does a second iteration over the data. Why not do the apply() processing while you are creating the dataframe?

In essence, pymongo is not slow, you are processing the whole dataset on the pymongo side twice and then doing simple writes on the Oracle side.

The dataframe looks like a pretty heavyweight object. I think you could do all that preprocessing in a dict, but I am not an Oracle expert so I am not sure what you are doing at the Oracle end.