Remove special characters from keys of a parsed xml file using xmltodict

9.9k Views Asked by At

I have parsed an xml file using the xmltodict module and the result is stored in a dictionary of dictionaries.

Now I want to remove the special characters @ and # in every key of the dictionary.

def remove_using_json(parse_result):
    data = {}
    data = json.dumps(parse_result)
    #print data
    #for d in data:
    for key, value in data.iterkeys():
        if key[0] == '@':
            data[key]=key.strip("@")
        elif key[0] == '#':
            data[key] =key.strip("#")
4

There are 4 best solutions below

0
On

It happens because your function doesn't go 'in depth'. So let's get a sample dict from @Matthew answer for example:

d = xmltodict.parse("""
    <root>
        <abc><def>ab</def></abc>
        <abc id="a3">efg</abc>
    </root>
""")

In [29]: d
Out[29]: {'root': {'abc': [{'def': 'ab'}, {'#text': 'efg', '@id': 'a3'}]}}

Your function will find only one key in this dict: root. But you can iterate over all items recursively in a way like that:

# What if you use different from dict Mapping implementation
# ...like OrderedDict or defaultdict? So lets check type 
# ...of the nested 'dicts' with Mapping interface
from collections import Mapping

def transform(element, strip_chars="#@"):
    if isinstance(element, Mapping):
        return {key.strip(strip_chars): transform(value) 
                    for key, value 
                    in element.iteritems()}
    elif isinstance(element, list):
        return [transform(item) for item in element]
    else:
        return element

In [27]: d1 = transform(d)

In [28]: d, d1
Out[28]: 
({'root': {'abc': [{'def': 'ab'}, {'#text': 'efg', '@id': 'a3'}]}},
 {'root': {'abc': [{'def': 'ab'}, {'id': 'a3', 'text': 'efg'}]}})
3
On

There is no direct way to eliminate those during parsing as they are used to denote attributes and text nodes allowing them to be distinguished from elements (if they weren't there the output would be unusable).

For example

xmltodict.parse("""
    <root>
        <abc><def>ab</def></abc>
        <abc id="a3">efg</abc>
    </root>
""")

produces a nested ordered dict with the structure

{'root': {'abc': [ 
                     {'def': 'ab'},
                     {'@id': 'a3', '#text': 'efg'}
                 ]
         }
}

The @ symbol tells me that the @id is an attribute. Without that symbol, I couldn't tell if it was an attribute or an element named id. Similarly, the # symbol tells me that #text is the text value of that element. Without that I couldn't tell if it was the element's text, or if it was an element named text.

However, when dealing with the keys, you can strip them using ky[1:] where ky is the key.

For example, if I assign the above parsed output to the variable doc, I can do1

for abcelem in doc["root"]["abc"]:
    for ky in abcelem:
        if ky[0]=="@": print("Attribute:",ky[1:])
        elif ky[0]=="#": print("Text Content")
        else: print("Element:",ky)

Which would output

Element: def
Attribute: id
Text Content

where I have stripped the @ symbol from the attribute.


If you really want to remove these symbols completely from the parsed value, you can write a recursive function to do this.

def remover(x):
    if isinstance(x,list): return [remover(y) for y in x]
    elif isinstance(x,OrderedDict):
        for ky in list(x.keys()):
            if ky[0] in ["@","#"]: 
                x[ky[1:]] = remover(x[ky])
                del x[ky]
            else: x[ky] = remover(x[ky])
        return x
    else: return x

Thus in the above, remover(doc) would remove all of the @ and # symbols from the keys. The behavior may be unstable and will lose some data if any node has an element and attribute with the same name or either an element or attribute named text, which is precisely why those symbols are there in the first place. This function does modify the object in place, and thus, if the original needs to be preserved, a deepcopy should be made and passed to the function.


1 This uses python 3 syntax, where the print command is a function. To make this example work in python 2.6 or 2.7, first issue from __future__ import print_function or change the print function calls to statements like print "Attribute: "+ky[1:].

1
On

You shouldn't remove these special characters from your response.

There is an option not to get them in your response at all. ;-)

result = xmltodict.parse(response, attr_prefix='@', cdata_key='#text')

These are the default options, but you may set attr_prefix="" to get rid of @ symbols, and change cdata_key in the same way.

Furthermore, you may also add dict_constructor=dict to create dictionaries in your parsed response instead of OrderDicts if you don't want to convert it back to XML with xmltodict.unparse().

0
On

To remove @ from keys of dictionary use attr_prefix='' as argument to xmltodict.parse() function. To remove # from keys of dictionary use cdata_key='text' as argument to xmltodict.parse() function.

Text values for nodes can be specified with the cdata_key key in the python dict, while node properties can be specified with the attr_prefix prefixed to the key name in the python dict. The default value for attr_prefix is @ and the default value for cdata_key is #text.

Click here for details.