Postgres regexp_replace: inability to replace source text with first captured group

243 Views Asked by At

Using PostgreSQL, I am unable to design the correct regex pattern to achieve the desired output of an SQL statement that uses regexp_replace.

My source text consists of several scattered blocks of text of the form 'PU*' followed by a date string in the form of 'YYYY-MM'--for example, 'PU*2020-11'. These blocks are surrounded by strings of unpredictable, arbitrary text (including other instances of 'PU*' followed by the above date string format, such as 'PU*2017-07), white space, and line feeds.

My desire is to replace the entire source text with the FIRST instance of the 'YYYY-MM' text pattern. In the above example, the desired output would be '2020-11'.

Currently, my search pattern results in the correct replacement text in place of the first capturing group, but unfortunately, all of the text AFTER the first capturing group also inadvertently appears in the output, which is not the desired output.

Specifically:

Version: postgres (PostgreSQL) 13.0

A more complex example of source text:

First line
Exec committee
PU*2020-08
PU*2019-09--cancelled
PU*2017-10

added by Terranze

My pattern so far:

(\s|\S)*?PU\*(\d{4}-\d{2})(\s|\S*)*

Current SQL statement:

select regexp_replace('First line\nExec committee; PU*2020-08\nPU*2019-09\nPU*2017-10\n\nadded by Terranze\n', '(\s|\S)*?PU\*(\d{4}-\d{2})(\s|\S*)*', '\2') as _regex;

Current output on https://regex101.com/

2020-08

Current output on psql

                              _regex                               
───────────────────────────────────────────────────────────────────
 2020-08\nPU*2019-09--cancelled\nPU*2017-10\n\nadded by Terranze\n
(1 row)

Desired output:

2020-08

Any help appreciated. Thanks--

1

There are 1 best solutions below

5
On BEST ANSWER

How about this expression:

'^.*?PU\*(\d{4}-\d{2}).*$'