Is it possible to implement <populate> such that it utilizes <insert> instead of <insert1>?

88 Views Asked by At

Is it possible to implement populate such that it utilizes insert instead of insert1 ? For large data sets, I am getting a server time out error as using insert1 takes hours to populate a table. Here is what I have for:

def make(self, key):
        
   keys = Session.fetch('KEY') # get the primary key(s) from session .fetch('KEYS') for multiple
        

   filename = 'data/AJ0{mouse_id}_{session_id}'.format(**key) # get the filename of the session you are interested in
   mat = spio.loadmat(filename, squeeze_me=True,struct_as_record=False) #load the data in .mat format
   data = mat[list(mat)[-1]] # unpack the dictionaries to select the specific data

   activity_arr = data.deResp
   n_trials, n_neuron = activity_arr.shape
        
   for neuro_id in range(0, n_neuron):
       for trial_id in range(0, n_trials):
           key['neuro_id'] = neuro_id
           key['activity'] = activity_arr[trial_id, neuro_id]
           self.insert1(Key, skip_duplicates=True)

What I would like is something like this

for neuro_id in range(0, n_neuro):
    key['neuro_id'] = np.asarray([neuro_id]*n_trials)
    key['activity'] = activity_arr[0:n_trials,neuro_id]
    self.insert(key, skip_duplicates=True)

Sort of inserting a list of dictionaries and being able to call the populate method. Any suggestions?

In the past, I implemented the make function below, but with this one I am not able to call populate on it.

def make(self):
        
    activity_arr = data.deResp
    n_trials, n_neuron = activity_arr.shape
        
    for neuro_id in range(0, n_neuron):
        trial_ids = np.arange(0,n_trials)
        mouse_id = np.asarray([self.mouse_id]*n_trials)
        neuro_ids = np.asarray([neuro_id]*n_trials)
        sess = np.asarray([self.session_id]*n_trials)
        acts=activity_arr[0:n_trials,neuro_id]
        arr=np.vstack((mouse_id,sess,trial_ids,neuro_ids,acts)).T
        self.insert(list(arr), skip_duplicates=True)
1

There are 1 best solutions below

0
On

Implementing populate in a different way would probably not solve the problem of getting a server timeout in this case. When you call populate it pulls the keys from upstream in the pipeline and calls your make function on them, each make function is handled in a mysql transaction that can timeout if your make function takes too long.

There are a couple ways to address this:

1: Increase the connection timeout within your make function. This is a bandaid fix that will at least get your tables populating.

ex:

import datajoint as dj
conn = dj.conn()
conn.query("SET SESSION wait_timeout=10").fetchall()
print(conn.query("SHOW VARIABLES LIKE 'wait_timeout'").fetchall())

the timeout is in seconds, the print statement will show you if it changed properly. I suggest changing it before you call populate and then change it back to the original timeout after you are done calling populate.

2: ping the server to refresh the timeout. This may not help in your case because I suspect that it is the spio.loadmat() function that is taking so long in this case.

use this code to ping the server between pieces of code that take a long time to execute:

import datajoint as dj
conn = dj.conn()
conn.ping()

3: Optimize your code to be more performant so it does not time out. In this case I dont think you can do this in a meaningful way as all the computational overhead is likely when you call spio.loadmat() to load a matlab binary.

A note about the populate method:

when you run it with the suppress_errors = true flag, instead of error-ing out the function will instead move on to the next key and append the error and the key associated with it to a list which the function returns. This behavior may assist you when dealing with populates that take a long time as you can come back later and address only the keys that failed to populate.