I have a .csv file with this error that I want to correct with regular expression, some fields contain line breaks, example:
"abc
de
f 123",123,456
Should be in one line, I want to remove line-breaks only but to keep the text
"abcdef 123",123,456
I tried to isolate within double quotes, but this removes only first line-break:
^(?:"[0-9a-zA-Z])\r?\n(?=",)
(Text can include numbers and other characters, so I tried to include them, I only want to remove line-breaks and keep all other characters, hope it is clear)
It's not possible to capture multiple newlines in a single query if you're using a repeated capture group; the regex engine can only grab the last match. That being said, if you're using powergrep (or some other search-and-replace that can selectively replace capture groups, not the whole match), do you really need a one-liner?
^"(?:[^"\n]|(\n+))*",will look between"and",and match any amount of text that doesn't include a", keeping it within your quoted statement - or, alternatively, will capture the last group of newlines it finds. If your tool can remove/replace text only in the capture group, why not use this regex a few times in a row? It'll leave your typo-free rows alone, but will remove one chunk of whitespace from your erroneous rows each time it's run. (Try it here! Note that this matches all lines, but only captures whitespace for the badly formatted ones)How will you know when you're done? Try using
^(?=.*\n.*)"[^"]*",- it'll match any lines in your csv file that still have newlines, but will ignore properly formatted lines. When this regex returns no matches, you can be confident your file is typo-free. (Try it here!)It's not a very elegant solution, but if you run it enough times, you'll get rid of all the whitespace.