I have a txt. file that looks as follows:
Name 1
@Name( ) Value WATER WHP
Date Unit Unit Unit
-------------- ---------- ---------- ---------- ----------
Name 1 20081220 2900.00 0.00 3300.00
Name 1 20081221 0.00 0.00 3390.00
Name 1 20081222 2500.00 0.00 2802.00
Name 1 20081223 0.00 0.00 3022.00
Name 1 20081224 0.00 0.00 3022.00
I used the following code to import into python:
df = pd.read_csv(r'test_prd.txt', skiprows=6, engine="python", header=None)
df.columns = ['Test']
df.drop(df.tail(1).index, inplace = True) # because of file format
df = df.Test.str.split(expand=True)
df.rename(columns ={0:'Name', 1:'Number', 2:'Date', 3:'Value', 4:'Water', 5:'WHP'}
,inplace=True)
df['Date'] = pd.to_datetime(df['Date']).dt.floor('D').dt.strftime('%Y-%m-%d')
df['Note'] = (df['Value']).apply(lambda x: 'yes' if x==0 else '')
del df['Water']
del df['WHP']
df['Name'] = df['Name'].astype(str) + ' ' + df['Number'].astype(str)
del df['Number']
After using this code the data frame looks like:
Name Date Value Note
0 Name 1 2008-12-20 2900.00
1 Name 1 2008-12-21 0.00 Yes
2 Name 1 2008-12-22 2500.00
3 Name 1 2008-12-23 0.00 Yes
4 Name 1 2008-12-24 0.00 Yes
... ... ... ... ...
78 Name 2009-03-15 0.00 Yes
79 Name 2009-03-16 3000.00
80 Name 2009-03-17 0.00 Yes
... ... ... ... ...
I want to print the periods of time (start date - end date) for which the 'Value' column equals zero, i.e, when 'Note'=Yes. Any other row were the value is non-zero can be removed from the data frame. If there is a standalone value of zero (preceded and followed by a non-zero value), the start and end date would be the same.
The expected output should look like this:
Name Start Date End Date Value Note
1 Name 2008-12-21 2008-12-21 0.00 Yes
2 Name 2008-12-23 2009-03-15 0.00 Yes
3 Name 2009-03-17 *** 0.00 Yes
... ... ... ... ...
I was trying to use a conditional if statement or df.loc but I don't know my way around Python enough to put it together. Any advice would be appreciated.
First let's import your dataframe using
read_csv
Be sure to replace the value of
sep
with the correct column separator. Here I assume the separator is one or more whitespaces, if not please adapt it.Be also sure to convert the
"Date"
column to a datetime, by using theparse_dates
parameter, and that"Value"
column is of type float.Now with
df
being your dataframe, this snippet should do what you want.The point here is to use some pandas function to achieve what you want in an efficient way. Don't use loops, if your dataframe is big, you'll need a lot of time to execute your code.
"Start"
column where I check which row is the starting of a zero interval series of rows. I do it by shifting the"Value"
row forward by 1 position, and comparing each row."Start"
column has a True value for each row where an interval should be started."Value"
rows.cumsum
to sum over the "Start" column. This will make a new column which I can use to group together the intervals which should be joined.groupby
withapply
to join together the groups, and for each group create a single row of a new dataframe where I can take the earliest and latest date from the"Date"
column.With the rows you have posted, the final result is: