Need to parse a file and create a data structure out of it

98 Views Asked by At

We want to parse a file and create a data structure of some sort to be used later (in Python). The content of file looks like this:

plan HELLO
   feature A 
       measure X :
          src = "Type ,N ame"
       endmeasure //X

       measure Y :
        src = "Type ,N ame"
       endmeasure //Y

       feature Aa
           measure AaX :
              src = "Type ,N ame"
           endmeasure //AaX

           measure AaY :
              src = "Type ,N ame"
           endmeasure //AaY
           
           feature Aab
              .....
           endfeature // Aab
         
       endfeature //Aa
 
   endfeature // A
   
   feature B
     ......
   endfeature //B
endplan

plan HOLA
endplan //HOLA

So there's a file that contain one or more plans and then each plan contains one or more feature, further each feature contains a measure that contains info (src, type, name) and feature can further contain more features.

We need to parse through the file and create a data structure that would have

                     plan (HELLO) 
            ------------------------------
             ↓                          ↓ 
          Feature A                  Feature B
  ----------------------------          ↓
   ↓           ↓             ↓           ........
Measure X    Measure Y    Feature Aa
                         ------------------------------
                            ↓           ↓             ↓ 
                       Measure AaX   Measure AaY   Feature Aab
                                                        ↓
                                                        .......

I am trying to parse through the file line by line and create a list of lists that would contain plan -> feature -> measure, feature

2

There are 2 best solutions below

3
trincot On BEST ANSWER

Here is a function that would turn your string into a dictionary:

def getplans(s):
    stack = [{}]
    for line in s.splitlines():
        if "=" in line:  # leaf
            key, value = line.split("=", 1)
            stack[-1][key.strip()] = value.strip(' "')
        elif line.strip()[:3] == "end":
            stack.pop()
        elif line.strip():
            collection, name, *_ = line.split()
            stack.append({})
            stack[-2].setdefault(collection + "s", {})[name] = stack[-1]
    return stack[0]

Here is an example call:

s = """plan HELLO
   feature A 
       measure X :
          src = "Type, Name"
       endmeasure //X

       measure Y :
        src = "Type, Name"
       endmeasure //Y

       feature Aa
           measure AaX :
              src = "Type, Name"
           endmeasure //AaX

           measure AaY :
              src = "Type, Name"
           endmeasure //AaY
           
           feature Aab
                measure Car :
                  src = "Model, Make"
               endmeasure //car
           endfeature // Aab
         
       endfeature //Aa
 
   endfeature // A
   
   feature B
       measure Hotel :
          src = "Stars, Reviews"
       endmeasure //Hotel
    endfeature //B
endplan

plan HOLA
endplan //HOLA
"""

import json
print(json.dumps(getplans(s), indent=4))

The output:

{
    "plans": {
        "HELLO": {
            "features": {
                "A": {
                    "measures": {
                        "X": {
                            "src": "Type ,N ame"
                        },
                        "Y": {
                            "src": "Type ,N ame"
                        }
                    },
                    "features": {
                        "Aa": {
                            "measures": {
                                "AaX": {
                                    "src": "Type ,N ame"
                                },
                                "AaY": {
                                    "src": "Type ,N ame"
                                }
                            },
                            "features": {
                                "Aab": {
                                    "measures": {
                                        "Car": {
                                            "src": "Model, Make"
                                        }
                                    }
                                }
                            }
                        }
                    }
                },
                "B": {
                    "measures": {
                        "Hotel": {
                            "src": "Stars, Reviews"
                        }
                    }
                }
            }
        },
        "HOLA": {}
    }
}

If your input has some other syntax -- not included in your question -- you'll probably need to tune the script further to deal with that.

0
gog On

For a quick and dirty parsing, you can do some regex substitutions, e.g.

text = re.sub(
    r'(?mx)^ \s* (plan|feature|measure) \s+ (\w+) .*',
    r'<\1 name="\2">',
    text)
text = re.sub(
    r'(?mx)^ \s* end (plan|feature|measure) .*',
    r'</\1>',
    text)
text = re.sub(
    r'(?mx)^ \s* (\w+) \s*=\s* (.*)',
    r'<\1>\2</\1>',
    text)

This will convert it to XML, which you can parse with built-in tools, e.g. ETree.