Python pandas: Assign values in column based on predecessor values

65 Views Asked by At

I am new to Python, and learning it because I would like to be able to use it in KNIME (www.knime.com). KNIME is a data pipeliner/transformer (similar to Talend, Alteryx or Azure Data Factory), and the data table objects it uses can be easily converted to pandas objects. Since some time ago, KNIME comes with an integrated Python environment and can execute Python code.

So what is my question?

I have a data table that I would like to perform a specific operation on, which I have in KNIME only be able to do by using Java code, and then only by bending the rules of what Java is allowed to do in there. With Python, this operation should hopefully be a bit more straightforward to implement.

In the data table, there is a column that looks more or less like

Arrival
something else
something else
Departure
something else
Arrival
something else
Departure
something else
something else
something else
Arrival

This should be converted to

At Sea
At Port
At Port
At Port
At Sea
At Sea
At Port
At Port
At Sea
At Sea
At Sea
At Sea

The logic is:

  • if the value is Arrival, it becomes At Sea
  • if the value is Departure, it becomes At Port
  • if the value is something else, it becomes
    • At Port if the nearest predecessor that is either Arrival or Departure is Arrival
    • At Sea if the nearest predecessor that is either Arrival or Departure is Departure

From what I have read so far in tutorials and intros to Python and Pandas, you should NEVER use loops to transform data in Pandas objects, since the runtime is much, much faster if you use vector processing and SQL query-like processing. I would have known how to do this with loops, but I am apparently not supposed to do so.

My question is: how would an experienced Pandas programmer go about this? I don't need a final and complete solution, I am more interested in knowing how I should tackle a problem like this in a "pandas" way.

My guess is that it would have something to do with extracting the column to a series, splitting it based on the occurrences of Arrival and Departure, processing each sub-series and then putting all of them together again. But honestly, I don't know how to start.

If you could help set me on the right track, that would be amazing!

Thanks a lot, Jan

I tried solving it with loops, but then got discouraged by reading https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-pandas-dataframe

UPDATE: Copied the code I used for my solution into the original ticket. Maybe someone finds this and can make some suggestions on how to do this in a more "pandas" way.

import knime.scripting.io as knio
import pandas as pd

# Create a pandas data frame with the contents of the KNIME node input table
inputTable = knio.input_tables[0].to_pandas()

# Resetting index for the main table to integer from 0 to n
inputTable.reset_index(drop=True, inplace=True)

# Create a series out of the colummn Event
eventSeries = inputTable.loc[:, "Event"]

# Create a new series with "none" values to use for writing the Sea/Port info into
length = eventSeries.size
seaPortValueSeries = pd.Series("none", index=range(length))
seaPortValueSeries = seaPortValueSeries.astype(str)

# Iterate over the elements of the series, writing a value into a second
# series that contains only "At Sea" and "At Port" as entries.
# For the first iteration, it makes no sense to check predecessors
# or retrieve predecessor values, so a shortened version is used for that iteration
for i in range(length):

if i == 0:
    
        if eventSeries[i] == "Arrival":
            seaPortValueSeries[i] = "At_Sea"
        if eventSeries[i] == "Departure":
            seaPortValueSeries[i] = "In_Port"
    else:

        if eventSeries[i] == "Arrival":
            seaPortValueSeries[i] = "At_Sea"
        elif eventSeries[i] == "Departure":
            seaPortValueSeries[i] = "In_Port"
        elif eventSeries[i-1] == "Arrival":
            seaPortValueSeries[i] = "In_Port"
        elif eventSeries[i-1] == "Departure":
            seaPortValueSeries[i] = "At_Sea"
        else:
            seaPortValueSeries[i] = seaPortValueSeries[i-1]

# Append the second series as an additional column into the data frame
inputTable["Sea_Port"] = seaPortValueSeries

knio.output_tables[0] = knio.Table.from_pandas(inputTable)
0

There are 0 best solutions below