I would like to understand if my code is working correctly.
The dataframe, df2 is a vertically stacked time series of a stock's feature.
stock_id | log_target_vol_corr_32_clusters_stnd |
---|---|
1 | 0.4 |
1 | 0.8 |
1 | 0.7 |
2 | 0.3 |
2 | 0.4 |
2 | 0.0 |
3 | 0.4 |
3 | 0.8 |
3 | 0.7 |
4 | 0.9 |
4 | 0.9 |
4 | 0.1 |
5 | 0.9 |
5 | 0.9 |
5 | 0.1 |
Notice that stocks (1 & 3) and (4 & 5) have the same feature values therefore I want to group them together into a cluster. Ultimately, I want to find all the stock ids belonging to each cluster.
## find stock ids of clusters having same feature values
column = 'log_target_vol_corr_32_clusters_stnd'
remaining_stocks = df2['stock_id'].unique().astype(int)
clusters = {}
for s in remaining_stocks:
print(s)
clusters[s] = []
a1 = df2[df2['stock_id'] == s ][column]
remaining_stocks = np.delete(remaining_stocks,np.where(remaining_stocks==s))
for s1 in remaining_stocks:
a2 = df2[df2['stock_id'] == s1 ][column]
if np.array_equal(a1,a2):
print(s1)
remaining_stocks = np.delete(remaining_stocks,np.where(remaining_stocks==s1))
clusters[s].append(s1)
print(remaining_stocks)
Could you please explain what is the error in this code?
I wrote the following code and seem to get more than the actual numbers of clusters in the dataframe.