I have two data frames one have start Data and End Date, second data is having Just date. Basically One frame is having group and other have child data. So I want to join all the date which comes between start date and End date. Below is the code I am trying but understandably it takes quite a long time as I have large data.
def is_between_dates(date, start, end):
return start <= date <= end
result_df = pd.DataFrame()
for idx, row in alarm_data.iterrows():
mask = event_data['Date'].apply(lambda x: is_between_dates(x, row['AlarmStart'], row['AlarmEnd']))
temp_df2 = event_data[mask].copy()
temp_df2['AlarmStart'] = row['AlarmStart']
temp_df2['AlarmEnd'] = row['AlarmEnd']
result_df = pd.concat([result_df, temp_df2])
I have like 2 million data in event data frame and 200000 data in alarm data which is expected. I don't want to use spark to achieve this as it will create infrastructure burden.
Some direction will be helpful
Original answer
Your question needs to be rephrased as it is not obvious what you want to achieve without reading it multiple times inspecting the code. Nevertheless, if I understood correctly, you have a large
event_datawithDate, and you want to add information about the correspondingAlarmStartandAlarmEndfromalarm_data.The main performance issue in your code is the
pandas.concat, which creates a new dataframe every time, 200000 times in your case. This is clearly inefficient, since a large amount of memory is being copied a number of times. Moreover,dataframe.iterrows()is also really slow for any large dataframe, and is heavily discouraged (read more about it here).What I would suggest instead is to make a copy of
event_data(preferably with dummy columns forAlarmStartandAlarmEnd) and modify it in place. I assume there is no overlap between the different intervals betweenAlarmStartandAlarmEnd, so that you can uniquely assign these values to eachevent. UseAlarmStartandAlarmEndfromalarm_datawithpandas.IntervalIndexto create your list of bins which you can pass topandas.cutaspandas.cut(event_data['Date'], bins, labels=False)to categorise your events and obtain the corresponding bin number. Use these bin numbers to indexAlarmStartandAlarmEndand substitute the values inevent_data. Note thatpandas.cutdoes not accept overlappingIntervalIndex, so this will be the option for performance only if there is no overlap between your intervals. Also, the intervals might need to be ordered.I hope the explanation was clear enough. Read the examples from the documentation and try playing with a subset of your data first.
Example
Following comments that this does not really answer the question, I use a similar dataset as the other answer for demonstration. Note that this WILL NOT work if your intervals overlap (You did not specify this in the question). The other answer is more general, but I believe this is quicker if the intervals do not overlap, although I haven't tested either solutions to any large dataset.