Remove extra delimiters created by using characters within a delimited file with Notepad ++

137 Views Asked by At

I have a delimited file exported with pipe as the delimiter by someone who didn't use a text qualifier and the text contains pipe characters. There should be 13 columns.

Additional columns have been created in several lines. Some 3, some 4. So, it's not the same number of additional columns created.

The delimiters are fine up till the text/column after the 10th pipe and the text/columns after the last two pipes. I want to replace the extra pipe characters after the 10th pipe and before the last two pipes with spaces regardless of whether it is 1 or more pipes in one go using a Regex pattern.

I have tried this:

^([^|]*)(?!(?:\|[^|\r\n]*){12}$)((?:\|[^|\r\n]*){10})(?:(\|)([^|\r\n]*))((?:\|[^|\r\n]*){2})$ 

replace with $1$2$4$5.

This only worked for where there were 14 columns. Lines with a larger number of columns are not matched.

Here's a line with more than 14 columns:

5070599047|SA|03/31/2023 00:00:00|04/03/2023 00:00:00|23121|152|65642.950000|0|0|XP-OFS-222761665|ZIB|BRH|NPAU|31-03-2023|749542|30271246.390000|342

The result I need:

5070599047|SA|03/31/2023 00:00:00|04/03/2023 00:00:00|23121|152|65642.950000|0|0|XP-OFS-222761665|ZIB BRH NPAU 31-03-2023 749542|30271246.390000|342

I need to remove all pipes after ZIB and before |30271246.390000|342 and keep the text in that space. I also need a pattern that will work for lines with more than 13 columns regardless of the additional number of columns to reduce them all to 13 columns.

1

There are 1 best solutions below

4
Cary Swoveland On

The first 10 and last two pipes in each line are to be preserved. Each of the four others is to be replaced with a space. Each match of the following regular expression (with g and m flags1 set) can be replaced with a space.

(?:^(?:[^|\r\n]*\|){10}[^|\r\n]*\K\||(?!^)\G[^|\r\n]*\K\|)(?=.*(?:\|[^|\r\n]*){2}$)

If there are 12 or fewer pipes no substitutions are made.

Demo


Notepad++ uses the Boost regex engine, which differs only slightly from the PCRE engine, which is the one used at the "Demo" link. In particular, both support \K and \G. \K causes previously-matched tokens to be discarded from the match that is returned and resets the start of match to the current location in the string. \G sets the current position to be at the end of the previous match.


The expression can be broken down as follows.2

(?:             # begin non-capture group
  ^             # match the beginning of the string
  (?:           # begin a non-capture group
    [^|\r\n]*\| # match >= 0 characters other than '|', '\r' and '\n', then '|'
  )             # end the non-capture group
  {10}          # execute preceding non-capture group 10 times
  [^|\r\n]*     # match >= 0 characters other than '|', '\r' and '\n'
  \K            # discard previously-consumed tokens and reset start of match 
  \|            # match '|' 
  |             # or
  (?!^)         # negative lookahead asserts next character is not at the start of the string 
  \G            # sets the current position to be at the end of the previous match
  [^|\r\n]*     # match >= 0 characters other than '|', '\r' and '\n'
  \K            # discard previously-consumed tokens and reset start of match  
  \|            # match '|'
)               # end non-capture group
(?=             # begin the positive lookahead
  .*            # match >= 0 characters other than line terminators
  (?:           # begin a non-capture group
    \|[^|\r\n]* # match '|' then >= 0 characters other than '|', '\r' and '\n'
  )             # end the non-capture group
  {2}           # execute preceding non-capture group twice
  $             # match the end of the string
)               # end the positive lookahead

1. g causes the regex engine to make multiple matches, if possible. m causes ^ and $ to represent end-of-line (rather than end-of-string) anchors, respectively.

2. Also, by hovering over each part of the expression at the link (the cursor, not you, yourself) one can obtain an explanation of its function.