Resampling pandas DataFrame for every nth row with different parameters on each column?

849 Views Asked by At

I've got minute by minute stock data with col_headings Date, Time, Open, High, Low, Close, Volume. I need to resample for every n'th row (in this example every 3rd row), but with different parameters on several columns. The code I have so far is:

import pandas as pd
import numpy as np

spy = pd.read_csv("C:\\Users\\PC\\Desktop\\spy_test.csv")

#Sample for every n minutes
n = 3
b = n-1


spy_date = pd.DataFrame(spy['Date'])
date = spy_date.iloc[b::n, :]

spy_time = pd.DataFrame(spy['Time'])
time = spy_time.iloc[b::n, :]
time = time.reset_index(drop=True)
spy_open = pd.DataFrame(spy['Open'])
open = spy_open.iloc[::n, :]
open = open.reset_index(drop=True)

spy_high = pd.DataFrame(spy['High'])
high_s = spy_high.iloc[::n, :].max()
high = pd.DataFrame(high_s)
high = high.reset_index(drop=True)

spy_low = pd.DataFrame(spy['Low'])
low_s = spy_low.iloc[::n, :].min()
low = pd.DataFrame(low_s)
low = low.reset_index(drop=True)

spy_close = pd.DataFrame(spy['Close'])
close = spy_close.iloc[::n, :]
close = close.reset_index(drop=True)

spy_volume = pd.DataFrame(spy['Volume'])
volume_s = spy_volume.iloc[n::3, :].sum()
volume = pd.DataFrame(volume_s)
volume = volume.reset_index(drop=True)

joined = [date, time, open, high, low, close, volume]

result = pd.concat(joined, axis=1)
result.columns = ['Date', 'Time', 'Open', 'High', 'Low', 'Close', 'Volume']

print(result)

This code worked except that the High, Low, and Volume columns only returned a single value. Where I wanted the max/min/sum of each 3 minute block, it returned the high/low/sum for the entire column.

Also, if you know of an easier way to do this that a beginner would more or less understand, I am all ears. I've only been coding a couple weeks so I really have no clue what I'm doing.

Note: I considered using qcut but as far as I can tell I would need to figure out how many bins I would want ahead of time. As I'm going to be running varying sets of data through this same basic format it seemed less than ideal.

1

There are 1 best solutions below

5
On BEST ANSWER

Since you have military time convert it to timedelta with the following first. Also you will likely have to add leading zeros since you have integers. I have done this with zfill.

df['Time'] = pd.to_timedelta(pd.to_datetime(df.Time.map(lambda x: str(x).zfill(4)),format='%H%M').dt.time.astype(str))
df.set_index('Time').groupby(['Date', pd.Timegrouper('3T')]).agg({'Open':'last', 'High':'max', 'Low':'min', 'Close':'last','Volume':'sum'})