Converting XML to JSON in Python with separated schema

96 Views Asked by At

I am looking to convert incoming XML data to JSON to allow for more efficient processing of the data in Python. The XML is of a non-standard format where the schema is defined above the relevant values section (example below).

I am able to read in the schema correctly but am having an issue with creating the correct nesting of tags within the values part of the XML.

Caveats:

  • It is possible to have multiples values for all blocks other than Block 1
  • Sub-blocks, such as Block 3_1, can be empty and should be represented as an empty list in the JSON

I'm looking to solve the problem in a generalised way (avoiding: for subblock in "Block 3") so it can adapt to small changes in structure/naming convention.

Example XML:

<body>
    <schema>
        <name>Block 1</name>
        <attributes>
            <string>block_1_attribute_1_name</string>
        </attributes>
        <subblocks>
            <block>
                <name>Block 2</name>
                <attributes>
                    <string>block_2_attribute_1_name</string>
                </attributes>
            </block>
            <block>
                <name>Block 3</name>
                <attributes>
                    <string>block_3_attribute_1_name</string>
                </attributes>
                <subblocks>
                    <block>
                        <name>Block 3_1</name>
                        <attributes>
                            <string>block_3_1_attribute_1_name</string>
                        </attributes>
                    </block>
                </subblocks>
            </block>
        </subblocks>
    </schema>
    <profiles>
        <values>
            <string>block_1_attribute_1_value_1</string>
        </values>
        <subblocks>
            <subblock>
                <values>
                    <string>block_2_attribute_1_value_1</string>
                </values>
            </subblock>
            <subblock>
                <values>
                    <string>block_3_attribute_1_value_1</string>
                </values>
                <subblocks>
                    <subblock>
                        <values>
                            <string>block_3_1_attribute_1_value_1</string>
                        </values>
                    </subblock>
                </subblocks>
            </subblock>
            <subblock>
                <values>
                    <string>block_3_attribute_1_value_2</string>
                </values>
                <subblocks>
                    <subblock>
                        <values>
                            <string>block_3_1_attribute_1_value_2</string>
                        </values>
                    </subblock>
                </subblocks>
            </subblock>
            <subblock>
                <values>
                    <string>block_3_attribute_1_value_3</string>
                </values>
                <subblocks>
                    <!-- empty subblock -->
                    <subblock/>  
                </subblocks>
            </subblock>
        </subblocks>
    </profiles>
</body>

Example Output:

{
    "Block 1": {
        "block_1_attribute_1_name": "block_1_attribute_1_value",
        "Block 2": [
            {
                "block_2_attribute_1_name": "block_2_attribute_1_value"
            }
        ],
        "Block 3": [
            {
                "block_3_attribute_1_name": "block_3_attribute_1_value_1",
                "Block 3_1": [
                    {
                        "block_3_1_attribute_1_name": "block_3_1_attribute_1_value_1"
                    }
                ]
            },
            {
                "block_3_attribute_1_name": "block_3_attribute_1_value_2",
                "Block 3_1": [
                    {
                        "block_3_1_attribute_1_name": "block_3_1_attribute_1_value_2"
                    }
                ]
            },
            {
                "block_3_attribute_1_name": "block_3_attribute_1_value_3",
                "Block 3_1": []
            },
        ]
    }
}

I've tried writing code that so far gets me the schema. This works but isn't nested (doesn't necessarily have to be). I'm not sure where to start with the rest.

def get_schema(root):
    """Retrieves the schema from the XML root element.

    Args:
        root (Element): The root element of the XML string.

    Returns:
        list: A list of tuples representing the schema. Each tuple contains
            the name of a given schema block and a list of attribute names
            associated with that element.
    """
    schema = []
    for name in root.findall(".//name"):
        attribute_names = [elem.text for elem in name.getnext().findall(".//string")]
        schema.append((name.text, attribute_names))
    return schema
1

There are 1 best solutions below

1
Robert Haas On

You can use the package xmltodict (fetch it with pip install xmltodict) to easily convert the XML string to a Python dictionary and then modify its structure if necessary. In a second step, you can convert the modified dictionary to a JSON string if you require it.

There are also other packages like lxml to parse XML strings. They may differ in how well they can handle non-standard formats.

Here's an example with xmltodict:

import json
from pprint import pprint

import xmltodict

s = """
<body>
    <schema>
        <name>Block 1</name>
        <attributes>
            <string>block_1_attribute_1_name</string>
        </attributes>
        <subblocks>
            <block>
                <name>Block 2</name>
                <attributes>
                    <string>block_2_attribute_1_name</string>
                </attributes>
            </block>
            <block>
                <name>Block 3</name>
                <attributes>
                    <string>block_3_attribute_1_name</string>
                </attributes>
                <subblocks>
                    <block>
                        <name>Block 3_1</name>
                        <attributes>
                            <string>block_3_1_attribute_1_name</string>
                        </attributes>
                    </block>
                </subblocks>
            </block>
        </subblocks>
    </schema>
    <profiles>
        <values>
            <string>block_1_attribute_1_value_1</string>
        </values>
        <subblocks>
            <subblock>
                <values>
                    <string>block_2_attribute_1_value_1</string>
                </values>
            </subblock>
            <subblock>
                <values>
                    <string>block_3_attribute_1_value_1</string>
                </values>
                <subblocks>
                    <subblock>
                        <values>
                            <string>block_3_1_attribute_1_value_1</string>
                        </values>
                    </subblock>
                </subblocks>
            </subblock>
            <subblock>
                <values>
                    <string>block_3_attribute_1_value_2</string>
                </values>
                <subblocks>
                    <subblock>
                        <values>
                            <string>block_3_1_attribute_1_value_2</string>
                        </values>
                    </subblock>
                </subblocks>
            </subblock>
            <subblock>
                <values>
                    <string>block_3_attribute_1_value_3</string>
                </values>
                <subblocks>
                    <!-- empty subblock -->
                    <subblock/>  
                </subblocks>
            </subblock>
        </subblocks>
    </profiles>
</body>
"""

d = xmltodict.parse(s)
schema = d['body']['schema']
profiles = d['body']['profiles']
schema_json = json.dumps(schema)
profiles_json = json.dumps(profiles)

pprint(schema)
print('='*120)
pprint(schema_json)

print('#'*120)

pprint(profiles)
print('='*120)
pprint(profiles_json)

Output:

{'attributes': {'string': 'block_1_attribute_1_name'},
 'name': 'Block 1',
 'subblocks': {'block': [{'attributes': {'string': 'block_2_attribute_1_name'},
                          'name': 'Block 2'},
                         {'attributes': {'string': 'block_3_attribute_1_name'},
                          'name': 'Block 3',
                          'subblocks': {'block': {'attributes': {'string': 'block_3_1_attribute_1_name'},
                                                  'name': 'Block 3_1'}}}]}}
========================================================================================================================
('{"name": "Block 1", "attributes": {"string": "block_1_attribute_1_name"}, '
 '"subblocks": {"block": [{"name": "Block 2", "attributes": {"string": '
 '"block_2_attribute_1_name"}}, {"name": "Block 3", "attributes": {"string": '
 '"block_3_attribute_1_name"}, "subblocks": {"block": {"name": "Block 3_1", '
 '"attributes": {"string": "block_3_1_attribute_1_name"}}}}]}}')
########################################################################################################################
{'subblocks': {'subblock': [{'values': {'string': 'block_2_attribute_1_value_1'}},
                            {'subblocks': {'subblock': {'values': {'string': 'block_3_1_attribute_1_value_1'}}},
                             'values': {'string': 'block_3_attribute_1_value_1'}},
                            {'subblocks': {'subblock': {'values': {'string': 'block_3_1_attribute_1_value_2'}}},
                             'values': {'string': 'block_3_attribute_1_value_2'}},
                            {'subblocks': {'subblock': None},
                             'values': {'string': 'block_3_attribute_1_value_3'}}]},
 'values': {'string': 'block_1_attribute_1_value_1'}}
========================================================================================================================
('{"values": {"string": "block_1_attribute_1_value_1"}, "subblocks": '
 '{"subblock": [{"values": {"string": "block_2_attribute_1_value_1"}}, '
 '{"values": {"string": "block_3_attribute_1_value_1"}, "subblocks": '
 '{"subblock": {"values": {"string": "block_3_1_attribute_1_value_1"}}}}, '
 '{"values": {"string": "block_3_attribute_1_value_2"}, "subblocks": '
 '{"subblock": {"values": {"string": "block_3_1_attribute_1_value_2"}}}}, '
 '{"values": {"string": "block_3_attribute_1_value_3"}, "subblocks": '
 '{"subblock": null}}]}}')