UltraEdit/Notepad - XML Remove nodes with empty properties

659 Views Asked by At

I'm currently facing an issue with a software i'm working with , this software receives from an external sofware several Xmls that we do need to process , now our issue is that those Xml files contain a lot of nodes which are totally useless and also make the files (xmls) really heavy because of that , in result out program runs very slow to process each one of the xmls , this should be changed in the future and i'd like to prove that by removing those nodes we would improve our processing time a lot , now i'd like as first step to do this manually , using a sample xml and applying a regex syntax to remove all the nodes with value property empty , this is the syntax that i'm using now and through the replace function in notepad i'm able to remove those rows and then remove the empty lines :

<.*(\s\w+?[^=]*?="[^"]*?")*?\s+?value="[""]*?".*?>

Example

<TEST_NODE value="1"/>
<TEST_NODE value=""/>
<TEST_NODE value="0"/>

In my case nodes can be named differently and can have different properties , but the one that i should care for are the ones that contain something in the value property , therefore in this case i should remove the second row

This looks to be working fine , however with very large files (10 mb) the replace notepad++ function seems to have issues and it stop working properly breaking a lot of tags...

I've tried using another software called "Ultraedit" , but there the syntax i guess it's different as i can use regular Expressions but need to select one of those options : Perl , Unix , Ultraedit ; only using "Perl" i'm able to do this replacement but also there , for big files this is not working and i get the following error:

The complexity of matching the expression has exceeded available resources..

Can anyone help me out with this? unfortunately i'm not even that good with Regex and i'm not sure if the above code is good or bad..

3

There are 3 best solutions below

1
On BEST ANSWER

Try this regular expression in Notepad++

<[^<]+value=""[^>]*>
0
On

You're using the wrong tool for the job. If you're going to be manipulating XML then you need to add XSLT and/or XQuery to your tool kit. Using regular expressions for the job is slow and error-prone.

For example, here are just a few of the bugs in the answer that you accepted:

  • Elements that use single quotes (value='') won't be matched
  • Element with whitespace around the equals sign won't be matched
  • Elements with an attribute whose name ends in value (e.g. xvalue="") will be matched
  • value="" will be matched inside comment and CDATA nodes
  • value="" can be matched inside text nodes: <x>value=""</x>
  • Elements split across multiple lines won't be matched (I suspect)

In XSLT 3.0 this is simply

<xsl:transform version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:mode on-no-match="shallow-copy"/>
 <xsl:template match="*[@value='']"/>
</xsl:transform>
0
On

Try this:

<(?=[^><]*?value\s*=\s*"")[^><]*>

Replace with nothing.

This might be a case of catastrophic backtracking when the regex runs caused by too many quantifiers applied to too many wide character classes like .

The quantifiers in this answer are only applied to not < or > class which should stop the expression backtracking through XML tags.