I have a huge NetFlow database, (it contains a Timestamp, Source IP, Dest IP, Protocol, Source and Dest Port Num., Packets Exchanged, Bytes and more). I want to create custom attributes based on the current and previous rows.
I want to calculate new columns based on the source ip and timestamp of the current row. This what i want to do logically:
- Get the source ip for the current row.
- Get the Timestamp for the current row.
- Based on the source IP, and Timestamp, I want to get all the Previous rows of the entire dataframe, that matches the source IP, and the communicaton happened in the last half an hour. This is very important.
- For the rows(Flows, in my example), that matches the criteria (source ip and happened in the last half hour), I want to count the sum and mean of all the packets and all the bytes.
Snippets of relevant code:
df = pd.read_csv(path, header = None, names=['ts','td','sa','da','sp','dp','pr','flg','fwd','stos','pkt','byt','lbl'])
df['ts'] = pd.to_datetime(df['ts'])
def prev_30_ip_sum(ts,sa,size):
global joined
for (x,y) in zip(df['sa'], df['ts']):
...
return sum
df['prev30ipsumpkt'] = df.apply(lambda x: prev_30_ip_sum(x['ts'],x['sa'],x['pkt']), axis = 1)
I know that there's probably a better, more efficient way to do this, but I'm sadly not the best programmer.
Thanks.
refer this to understand timedelta