Using Pandas to read and write values from an Excel column with line breaks (text wrapped)

166 Views Asked by At

I have a Dataframe containing a point column with multiple values separated using commas, hyphens, alphabets and other values are text wrapped meaning there are more than two pairs of values(see highlighted rows). I have managed to split the values separated by hyphens, and those separated by alphabets using my python script below:

import os
import pandas as pd

data_path= r'pppppp'
table = 'ddddd.xlsx'

file = os.path.join(data_path, table)

df = pd.read_excel(file) 


point = df.iloc[:, 1].copy()  # Create a copy as a DataFrame

print(point)

def split_column(value):
    # Handle NaN values
    if pd.isna(value):
        return pd.NaT, pd.NaT

    #Remove any '?' and characters after it
    value = str(value).rstrip('...').split('...')[0]

    # Function to split the values based on the provided criteria
    if '-' in value:
        parts = value.split('-')
        left, right = parts[-1], parts[0]
        return right[-7:], left[-7:]
    elif ',' in value:
        parts = value.split(',')
        left, right = parts[-1].strip(), parts[0].strip()
        return right[-7:], left[-7:]
    elif any(c.isalpha() for c in value):
        # Find the index of the first alphabet character
        index = next((i for i, c in enumerate(value) if c.isalpha()), None)
        if index is not None:
            left = str(value)[:index]
            right = str(value)[index + 1:index + 8]
            return left[-7:], right
    else:
        return str(value)[-7:], str(value)[:-7]

# Apply the split_column function to the specified column
points = point.dropna().explode().str.split('\n').explode()
pointseparated = points.apply(split_column).apply(pd.Series)

print(pointseparated)

# Rename the resulting columns
singlepoint = pointseparated.rename(columns={0: 'start', 1:'end'})

#If only one value is found, assign it to start
if singlepoint.shape[1] == 1:
    singlepoint.column = ['start']

#Reset the index of both DataFrames
df.reset_index(drop=True, inplace=True)
singlepoint.reset_index(drop=True, inplace=True)

# Concatenate the original Dataframe and the new columns
results_df = pd.concat([df, singlepoint], axis=1)

# Export the Dataframe to an Excel Worksheet
results_df.to_excel(test.xlsx, index=False)

I generated the start and end columns by splitting the values in the point column as shown in the code above. The challenge I now have is in splitting the point column row values which have line breaks meaning having more than one set of values in them in the point column as shown below. In the example below, I would like to return the start and end values of the point values after the linebreak.

number      point                                         start      end
06-102.1    5103023A5104021A                              5103023    5104021
            5104055…  

06-221      4411020-4411030                               4411020  (4411141)
            4411140A4411141A

19-083      4114057,                                      4114057    4114010
            4114048O4114010O     
                         
19-130      5002038, 5002040, 5002042, 5002043, 5002044   5002038  (5002044)

20-038      4514011, 4714026, 4517019                     4514011  (4517019)

Also I am trying to write all the values in the point column that are separated by a comma, only to the start column and not in the end column as shown above.The last two values in the end column enclosed with brackets() should not be written instead all the point values should be returned on the start column.

How can I add these changes in my script?

0

There are 0 best solutions below