Fill null values in simple dataframe with sum of surrounding values

72 Views Asked by At

I am looking to compute null values 'inside' a dataframe. Basically, each of the boundary 'cells' of this dataframe contain a value, and all the interior values are null.

So I want to fill these null values by summing the surrounding 4 cells and dividing by 4, such that the value at any given cell is h i,j = (1/4)(h i-1,j + h i+1,j + h i, j-1 + h i, j+1).

Col0 Col1 Col2 Col4
100 95 90 85
95 NaN NaN 80
90 NaN NaN 75
85 NaN NaN 70
80 NaN NaN 65
75 70 65 60

I am unsure how to iterate over this dataset and apply the above formula.

My expected output, based on my Excel version of this:

Col0 Col1 Col2 Col4
100 95 90 85
95 90 85 80
90 85 80 75
85 80 75 70
80 75 70 65
75 70 65 60

My initial idea was to to use the following loop:

for i in df:
        i.fillna(
(i[:, :, 1:] + i[:, :, :-1] + i[:, :-1, :] + i[:, 1:, :])/4, inplace=True
)

I.e. fill each NaN value with the sum of the four surrounding cells divided by four.

But this doesn't work, it just returns 'cannot unpack non-iterable int object'

Does anyone have an idea of how I can (a) Correctly develop a formula to access all surrounding cell values; and (b) How to actually apply this to calculating these values?

I can do this straightforwardly in Excel which allows you to iterate this type of calculation relatively easily, but I am struggling to conceptually transfer it to Python.

I tried the above code, but it doesn't work and I can't apply my conceptual understanding to Python well.

1

There are 1 best solutions below

0
On

Given the input csv as below:

Col0,Col1,Col2,Col4
100,95,90,85
95,NaN,NaN,80
90,NaN,NaN,75
85,NaN,NaN,70
80,NaN,NaN,65
75,70,65,60

And given the desired output csv as below:

Col0,Col1,Col2,Col4
100.0,95.0,90.0,85.0
95.0,95.0,88.3,80.0
90.0,92.5,85.3,75.0
85.0,88.8,81.3,70.0
80.0,79.6,72.7,65.0
75.0,70.0,65.0,60.0

This is the code:

import sys

with open('input.csv', 'r') as file:
    lines1 = file.readlines()

lines2 = []

for x in range(len(lines1)):
    lines1[x] = lines1[x].strip()
    if len(lines1[x]) > 0:
        lines2.append(y.strip() for y in lines1[x].split(","))

header_list = list(lines2[0])

lines2 = [[float(x) if x != 'NaN' else None for x in inner_list] for inner_list in lines2[1:]]

row_index_exceeded = len(lines2)
column_index_exceeded = len(lines2[0])

def generate_neighbours(row_index,column_index):
    global row_index_exceeded
    global column_index_exceeded
    return_list = []
    if row_index-1 != -1:
        return_list.append([row_index-1,column_index])
    if row_index+1 != row_index_exceeded:
        return_list.append([row_index+1,column_index])
    if column_index-1 != -1:
        return_list.append([row_index,column_index-1])
    if column_index+1 != column_index_exceeded:
        return_list.append([row_index,column_index+1])
    return return_list

def process_list_of_lists(input_list):
    for row_index in range(len(input_list)):
        for column_index in range(len(input_list[row_index])):
            current_cell = input_list[row_index][column_index]
            if current_cell == None:
                neighbours = generate_neighbours(row_index,column_index)
                num_values = 0
                sum_values = 0
                for x in neighbours:
                    if input_list[x[0]][x[1]] != None:
                        num_values = num_values + 1
                        sum_values = sum_values + input_list[x[0]][x[1]]
                if num_values == 0:
                    print("Critical error. Top-left cells require values.")
                    sys.exit()
                input_list[row_index][column_index] = float(sum_values/num_values)
                return input_list

while True:
    break_inner_loop = False
    for x in range(len(lines2)):
        for y in lines2[x]:
            if (y == None) and (break_inner_loop == False):
                lines2 = process_list_of_lists(lines2)
                break_inner_loop = True
    if break_inner_loop == False:
        break

with open('output.csv', 'w') as file:
    file.write(','.join(header_list)+"\n")
    for x in lines2:
        y = [f'{num:.1f}' for num in x]
        file.write(','.join(y)+"\n")

print("Execution complete. Check output csv file.")

This is different from your sample output because I am not sure of the logic of your sample output.