Data Formatting Using Python3

23 Views Asked by At

I am trying to work on a data modification using Python3. The formatted data will then be saved and consumed further as a CSV.

I have a really messed up data (500k rows) but I will simplify the format for the question. Let's say we have data :

data=Global{abc{198},cdf{121},nvm,121}

As we can see the Global is a master group which contains everything inside a curly bracket. Inside Global there are other group as well like abc , cdf and few individual records nvm & 121 .

Now I have to extract the individual groups, records and then append the master group and then make it a pipe delimited.

The code I wrote as following:

import re

data = "Global{abc{198},cdf{121},nvm,121}"
regex_pattern = "\\b([\\w-]+)\\{([^{}]+)\\}"

def extract_text(input_text):
    result = []
    while '{' in input_text:
        match = re.search(regex_pattern, input_text)
        if match:
            prefix = match.group(1)
            result.append(prefix + '{' + match.group(2) + '}')
            input_text = input_text[:match.start()] + input_text[match.end():]
        else:
            break
    return ' | '.join(result)

result = extract_text(data)
print(result)

Which gives me result as:

abc{198} | cdf{121} | Global{,,nvm,121}

The logic I have used, grabs everything inside the curly bracket and then append it with first word before opening curly bracket.

But my expected output is:

Global{abc{198}} | Global{cdf{121}} | Global{nvm,121}

I am trying to build a logic here. Any suggestion would be appreciate.

I am providing the actual data and expected data as below:

raw data:

GLOBAL-VPN-121{GLOBAL-VPN-ALL{AUS-VPN-128{npm_192.168.101.1/24:192.167.101.1/24,npm_121.147.101.1:121.147.101.1},npm_192.168.101.1:192.168.101.1,GLOBAL-VPN-SUB{HK-VPN-128{npm_192.168.101.1/24:192.167.101.1/24,npm_121.147.101.1:121.147.101.1}},npm_192.168.101.1:192.168.101.1}}

Expected Data

GLOBAL-VPN-121{GLOBAL-VPN-ALL{AUS-VPN-128{npm_192.168.101.1/24:192.168.101.1/24,npm_121.147.101.1:121.147.101.1}}} | GLOBAL-VPN-121{GLOBAL-VPN-ALL{npm_192.168.101.1:192.168.101.1}} | GLOBAL-VPN-121{GLOBAL-VPN-ALL{GLOBAL-VPN-SUB{HK-VPN-128{npm_192.168.101.1/24:192.168.101.1/24,npm_121.147.101.1:121.147.101.1}}{}}} | GLOBAL-VPN-121{GLOBAL-VPN-ALL{npm_192.168.101.1:192.168.101.1}}
1

There are 1 best solutions below

0
bla On

The task you're attempting involves parsing and restructuring nested data, which can be quite complex. To achieve your desired output, you need a recursive approach that can handle multiple levels of nesting. The challenge is to extract nested groups and individual records, then reconstruct them with the master group appended to each.

Let's break down the steps:

Parse the Nested Structure: We need to recursively parse the nested structure. When we encounter a group (e.g., abc{...}), we extract the group and its contents, then continue parsing the contents.

Reconstruct the Data: After extracting a group or an individual record, we prepend it with the master group and format it as required.

Handle Edge Cases: The data may contain varying levels of nesting, so our solution must be robust enough to handle these cases.

Here is a Python function that tries to accomplish this:

import re

def extract_and_format(data, master_group):
    def recurse(text, path):
        # Regular expression to find groups
        regex_pattern = r"(\w+){([^{}]*)}"
        results = []
    
        # Find all matches
        matches = list(re.finditer(regex_pattern, text))
        if not matches:
            # If no more groups, append the current path with remaining text
            if text.strip():
                results.append(f"{path}{{{text}}}")
            return results

        last_index = 0
        for match in matches:
            # Append text before the group
            pre_text = text[last_index:match.start()]
            if pre_text.strip():
                results.append(f"{path}{{{pre_text}}}")

            # Recursively process inside the group
            group_name, inner_text = match.groups()
            new_path = f"{path}{{{group_name}"
            results.extend(recurse(inner_text, new_path))

            last_index = match.end()

        # Append text after the last group
        post_text = text[last_index:]
        if post_text.strip():
            results.append(f"{path}{{{post_text}}}")

        return results

    # Extract the master group name and its contents
    master_match = re.match(r"(\w+){(.*)}", data)
    if not master_match:
        raise ValueError("Invalid format for the master group")

    master_group, master_contents = master_match.groups()

    # Start recursive processing
    return " | ".join(recurse(master_contents, master_group))

# Example usage
raw_data = "GLOBAL-VPN-121{GLOBAL-VPN-ALL{AUS-   VPN-128{npm_192.168.101.1/24:192.167.101.1/24,npm_121.147.101.1:121.147.101.1},npm_192.168.101.1:192.168.101.1,GLOBAL-VPN-SUB{HK-VPN-128{npm_192.168.101.1/24:192.167.101.1/24,npm_121.147.101.1:121.147.101.1}},npm_192.168.101.1:192.168.101.1}}"
formatted_data = extract_and_format(raw_data, "GLOBAL-VPN-121")
print(formatted_data)

This code uses recursion to navigate through the nested structure and reconstructs the data in the desired format. It should work for the given example and similar structures. However, please test it thoroughly with your actual dataset to ensure it handles all cases correctly.