Remove non printable string

386 Views Asked by At

I've done an OCR on a pdf image and extracted the text. The OCR for some reason has converted a single space to a double carriage return line feed.

eg.

"\r\n\r\n"

The following doesn't work as I think my 4 characters are not really a stirng but 4 non printable CHARACTERS.

DocumentData = DocumentData.Replace(@"\r\n\r\n", "");

I only want to replace those 4 non printable characters with a space when they occur together.

How can this be achieved without too much fuss.

3

There are 3 best solutions below

1
On

Is this what you want?

DocumentData = DocumentData.Replace("\r\n\r\n", " "); // <-- change "" to " ", remove @ char
2
On

The problem is the usage of the "@". By pre-pending your text with it, the escaping is ignored. Just use -

DocumentData = DocumentData.Replace("\r\n\r\n", " ");
2
On

If you want to ensure it doesn't matter what system you're (or the sender) running on and you'll always catch the non-printable I would utilize Regular Expressions:

DocumentData = Regex.Replace(DocumentData, @"\r\n?|\n|\r|\s+", " ");

Edit: Made the expression a touch more robust and checking for extra whitespaces replacing them with a single which will avoid excessive spacing after replacement so it's specific to this question. My Bad.