Equal distribution of couchdb document keys for parallel processing

196 Views Asked by At

I have a couchdb db instance where each document has a unique id (string). I would like to go over each document in the db and perform some external operation based on the contents of each document (for ex: connecting to another web server to get specific details etc). However, instead of sequentially going over each document, is it possible to first get a list of k buckets of these document keys represented by the starting key + ending key (id being the key), then to query for all documents in each of these buckets separately & do the external operation on each bucket's documents in parallel ?

I currently use couchdb-python for accessing my db + views. For ex, this is the code I currently use:

for res in db.view("mydbviews/id"):
  doc = db[res.id]
  do_external_operation(doc) # Time consuming operation

It would be great if I could do something like 'parallel for' for the above loop.

1

There are 1 best solutions below

0
On

Assuming that you're only emitting one result per document in the view, then presumably running the view with start and end keys along with some python parallelisation technique is sufficient here. As @Ved says, the bigger issue here is parallel processing, rather than generating the subsets of documents. I'd recommend the multiprocessing module, like so:

def work_on_subset(viewname, key_low, key_high):
    rows = db.view(viewname, startkey=key_low, endkey=key_high)
    for row in rows:
        pass # Do your work here

viewname = '_design/designname/_view/viewname'
key_list = [('a', 'z'), ('1', '10')] # Or whatever subset you want
pool = multiprocessing.Pool(processes=10) # Or however many you want
result = []
for (key_low, key_high) in key_list:
    result.append(pool.apply_async(work_on_subset, args=(viewname, key_low, key_high)))
pool.close()
pool.join()