Thank you so much for any advice in advance. I'm trying to calculate slope over windows in a dataset but I need the window size to be variable and dependent on values contained within the dataset. To best demonstrate what I mean, take my dataset (df) which contains genetic data and consists of three columns, "Scaffold", "centiMorgan", and "Position". There are several hundred rows of data in the real dataset and the centiMorgan and Position columns are continuous variables. The toy dataset of only a few lines is represented:
df
Scaffold centiMorgan Position
Scaffold01 0.0 10004
Scaffold01 0.1 10006
Scaffold01 0.5 10008
Scaffold02 1.5 10450
Scaffold02 2.9 11100
Scaffold02 3.0 11102
Scaffold03 3.8 12600
Scaffold04 4.6 12610
I want to calculate the slope of a sliding window in this dataset such that the window corresponds to all the rows in which the Position column are within a certain numerical distance from each other (a set threshold).
For example, in the case of my toy dataset above, I need to be able to calculate the distance between the first entry in the Position column and then the second entry in the Position column. If this value ( the "distance" between positions) is greater than my set threshold (e.g. 20), then the slope of that window is calculated and returned as a list (that will then be written to a text file).
For example, using a set threshold of 20, I would attempt to use a for loop to calculate position distances row by row, until I had a >20 break, storing the values of the centiMorgan and the position of the first row in the window, and then storing the last centiMorgan and last position in the window into the same list.
However, if the "distance" between Position values is less than the set threshold, then the window must continue to loop through the lines in the file until that "distance" value exceeds the set threshold. At that point, all of the rows that were passed over, that exist within that threshold-value window, will be treated as the window and the slope calculated and appended to the list of slopes that is being compiled.
So in the example of this toy dataset, given a set threshold window value of 20, the first three lines would be one window as the Position values (10004, 10006, and 10008) all exist within less than 20 count from each other. Thus the slope would be calculated from the first point and the last point in the window.
Scaffold centiMorgan Position
Scaffold01 0.0 10004
Scaffold01 0.1 10006
Scaffold01 0.5 10008
To calculate slope we use the formula:
m = (Y2 - Y1 / X2 - X1 )
whereas the centiMorgan column corresponds to the Y and the Position corresponds to the X and the first line in the window is the first X1 and Y1 values and the last row in the window contains the X2 and Y2 columns.
However at the break between Scaffold01 position of 10008 and Scaffold02 position of 10450, because this difference exceeds our threshold of 20, we must begin a new window and then loop through the lines again until the next break of >20. So, the following windows would be:
Scaffold02 1.5 10450
Scaffold02 2.9 11100
Scaffold02 3.0 11102
Scaffold03 3.8 12600
Scaffold04 4.6 12610
My first attempt at conducting this analysis is through a pandas approach, using .diff to calculate the distances between position values, and then using the values in the subsequently created column to subset the windows:
import pandas as pd
df = pd.dataframe()
threshold = 20
# Subset position column:
posCol = ['position']
# create a new column 'Dist' in original df with results of .diff
# use .fillna(0) to fill in 0 where there is no previous row
df['Dist'] = posCol.diff().fillna(0)
# use a nested for/while loop to read through every line in df[Dist], the result column from .diff
for dist in df[Dist]:
while dist < threshold
# write entire row to a new dataframe called window1
# extract the first line of window1
# extract the last line of window1
# calculate the slope of window1 using the first and last points
if dist > 20:
break
# Move on to the next window and repeat.
I am still thinking through this analysis and haven't run any code yet, but wanted to check to see if my approach makes sense or is optimal. Any advice or feedback is greatly appreciated, either with pandas or python. Thank you.
IIUC you can use
itertools.groupbyfor the task:Prints: