FIlrer csv table to have just 2 columns. Python pandas pd .pd

82 Views Asked by At

i got .csv file with lines like this :

result,table,_start,_stop,_time,_value,_field,_measurement,device
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:35Z,44.61,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:40Z,17.33,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:45Z,41.2,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:51Z,33.49,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:56Z,55.68,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:57Z,55.68,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:13:02Z,25.92,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:13:08Z,5.71,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0

I need to make them look like this:

                   time  value
0  2022-10-24T12:12:35Z  44.61
1  2022-10-24T12:12:40Z  17.33
2  2022-10-24T12:12:45Z  41.20
3  2022-10-24T12:12:51Z  33.49
4  2022-10-24T12:12:56Z  55.68

I will need that for my anomaly detection code so I dont have to manualy delete columns and so on. At least not all of them. I cant do it with the program that works with the mashine that collect wattage info. I tried this but it doeasnt work enough:

df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')
df['_time'] = pd.to_datetime(df['_time'], format='%Y-%m-%dT%H:%M:%SZ')
df = pd.pivot(df, index = '_time', columns = '_field', values = '_value')
df.interpolate(method='linear') # not neccesary

It gives this output:

            0
9      83.908
10     80.342
11     79.178
12     75.621
13     72.826
...       ...
73522  10.726
73523   5.241
3

There are 3 best solutions below

0
On

Here is the canonical way to project down to a subset of columns in the pandas ecosystem.

df = df[['_time', '_value']]
0
On

You can simply use the keyword argument usecols of pandas.read_csv :

df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv', usecols=["_time", "_value"])

NB: If you need to read the entire data of your (.csv) and only then select a subset of columns, Pandas core developers suggest you to use pandas.DataFrame.loc. Otherwise, by using df = df[subset_of_cols] synthax, the moment you'll start doing some operations on the (new?) sub-dataframe, you'll get a warning :

SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

So, in your case you can use :

df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')
df = df.loc[:, ["_time", "_value"]] #instead of df[["_time", "_value"]]

Another option is pandas.DataFrame.copy,

df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')
df = df[["_time", "_value"]].copy()
0
On

.read_csv has a usecols parameter to specify which columns you want in the DataFrame.

df = pd.read_csv(f,header=0,usecols=['_time','_value'] )
print(df)

                  _time  _value
0  2022-10-24T12:12:35Z   44.61
1  2022-10-24T12:12:40Z   17.33
2  2022-10-24T12:12:45Z   41.20
3  2022-10-24T12:12:51Z   33.49
4  2022-10-24T12:12:56Z   55.68
5  2022-10-24T12:12:57Z   55.68
6  2022-10-24T12:13:02Z   25.92
7  2022-10-24T12:13:08Z    5.71