Random key lookup on LMDB/python vs BerkeleyBD/python (How to make LMDB lookup faster)

242 Views Asked by At

I have this program written in python that uses berkeleydb to store data (event logs) which i migrated to lmdb. My problem is, before an event gets written, the program does a lookup if the event already exists. I noticed that the berkeleydb version is much faster in doing the single value lookup using 13k+ records (as if the lmdb version is 1 second slower for every lookup) even with transactions enabled in berkeleydb. Any idea how to speed up the lmdb version? Note that I've had 70gb+ (about 30 million records) worth of data already stored in my berkeleydb and doing additional processing on those events takes me more than an hour so I thought switching to lmdb would decrease the processing time.

My LMDB environment was opened this way (I event set the readahead to False (but the database size is just about 35mb so I don't think it matters):

env = lmdb.open(db_folder, map_size=100000000000, max_dbs=4, readahead=False) database = env.open_db('events'.encode())

My berkeleydb was opened this way:

env = db.DBEnv() env.open(db_folder, db.DB_INIT_MPOOL | db.DB_CREATE | db.DB_INIT_LOG | db.DB_INIT_TXN | db.DB_RECOVER, 0) database = db.DB(env)

BerkeleyDB version of check:

if event['eId'].encode('utf-8') in database:
                                       duplicate_count += 1



 else:
                                        try:
                                                txn = env.txn_begin(None)
                                                database[event['eId'].encode('utf-8')] = json.dumps(event).encode('utf-8')
                                        except:
                                                if txn is not None:
                                                        txn.abort()
                                                        txn = None
                                                raise
                                        else:
                                                txn.commit()
                                                txn = None
                                        event_count += 1

lmdb version:

with env.begin(buffers=True, db=database) as txn:
                                        if (txn.get(event['eId'].encode()) is not None):
                                                dup_event_count += 1

                                        else:

                                                txn.put(event['eId'].encode(), json.dumps(event).encode('utf-8'))
                                                        event_count += 1

Solution:

Place with env.begin outside the loop:

@case('rand lookup')
    def test():
        with env.begin() as txn:
            for word in words:
                txn.get(word)
        return len(words)

@case('per txn rand lookup')
    def test():
        for word in words:
            with env.begin() as txn:
                txn.get(word)
        return len(words)
1

There are 1 best solutions below

2
mark On

Figured this out myself. What I'm doing is a per transaction random lookup. I just had to place with env.begin outside of the for loop (not visible in my example) as suggested in this example: https://raw.githubusercontent.com/jnwatson/py-lmdb/master/examples/dirtybench.py