Regex for newline in XML

2.3k Views Asked by At

I'm trying desperately hard to figure this out but with no luck. I'm trying to parse this XML data in Postgres:

<map>
  <entry>
    <string>id</string>
    <string>555</string>
  </entry>
  <entry>
    <string>label</string>
    <string>Need This Value</string>
  </entry>
  <entry>
    <string>key</string>
    <string>748</string>
  </entry>
</map>

I'm trying to get the value in the string element right after <string>label</string>. Note that the Postgres version I'm working does not have the XML (libxml) function installed.

I have tried many variations of:

substring(xmlStringData from E'<string>label</string>\\n<string>(.*?)</string>')

but with no luck.

3

There are 3 best solutions below

0
cYn On BEST ANSWER

So I seem to got it figured out. I just needed to account for the spaces after the newline. The solution was:

substring(event_data from E'<string>label</string>\\n\\s*?<string>(.*?)</string>')
2
Federico Piazza On

If your <entry> list is not variable. You can use the following regex and access to the capturing group in the 4th match to get the content.

<string>(.*?)<\/string>

Working demo

On the other hand, If you want to access at the first match, you can use the following regex:

<string>id<\/string>|<string>\d+<\/string>|<string>label<\/string>|<string>(.*?)<\/string>

Working demo

0
Erwin Brandstetter On

xpath() would be the right tool here. Because, you know ...

While stuck with your unfortunate situation, this would do the trick:

WITH t(x) AS (SELECT '<map>
  <entry>
    <string>id</string>
    <string>555</string>
  </entry>
  <entry>
    <string>label</string>
    <string>Need This Value</string>
  </entry>
  <entry>
    <string>key</string>
    <string>748</string>
  </entry>
</map>'::text
)
SELECT substring(x, '<string>label</string>[\s]*?<string>(.*?)</string>')
FROM  t

Returns:

substring
---------------
Need This Value

regexp explained:

<string>label</string> .. finds the position
[\s].. whitespace (including \n and \r)
*? .. do this "non-greedy", so ignore whitespace up until ...
<string>.. the next string element
(.*?) .. capturing parentheses, any characters, non-greedy
</string> .. up to the next appearance of the end tag

This is halfway reliable, unless you throw in unconventional XML formatting - which is why you should use an XML parser to begin with ...