Removing non-breaking space characters in R

1.6k Views Asked by At

I have dataframe with several columns and 50K plus observations. Let's name it df1. One of the variables is PLATES (denoted here as "y"), which contains plate numbers of buses in a city. I want to match this data frame with another(df2) where I also have plates data. I want to keep matching records only. While looking at the data in df1, which comes from a CSV file, I realized that for y, several observations had symbols before the plate number that correspond to non-breaking space. How do I get rid of this so that it isn't an issue when I do the matching. Here's some code to help illustrate. Let's say you have 5 plate numbers:

y <- c(0740170, 0740111, 0740119, 0740115, 0740048)

But upon further inspection

view(y)

You see the following

<c2><a0>0740170
<c2><a0>0740111
<c2><a0>0740119
<c2><a0>0740115
<c2><a0>0740048

I tried this, from this post https://blog.tonytsai.name/blog/2017-12-04-detecting-non-breaking-space-in-r/, but didn't work

y <- gsub("\u00A0", " ", y, fixed = TRUE)

I would appreciate a lot your help on how to deal with this issue. Thanks!

2

There are 2 best solutions below

1
On BEST ANSWER

Not quite sure this will help as I can't test my answer (as I can't recreate your problem). But if non-breaking space characters are at the same time non-ASCII characters then, the solution would be this:

y <- gsub("[^ -~]+", "", y)

The pattern matches any non-ASCII characters and the replacement sets them to null. Hope this helps

6
On

EDIT 1 This works under R 4.0.3 and 4.1.2 on Windows, but no longer under 4.2.2 or 4.3.1.

The other answer matches any non-ASCII character but what if you need to keep non-ASCII characters e.g. letters with accents? In this situation I wanted to match specifically a non-breaking space of type <c2><a0> as in the question. What worked for me was matching \xa0

test # nbsp between type and II
# [1] "Diabète de type II"
tools::showNonASCII(test) 
# 1: Diab<c3><a8>te de type<c2><a0>II

# other answer
gsub("[^ -~]+", " ", test) # has missing è
# [1] "Diab te de type II"
tools::showNonASCII(gsub("[^ -~]+", " ", test))# no output as no non-ascii chars left

gsub("\xa0+", " ", test)
# [1] "Diabète de type II"
tools::showNonASCII(gsub("\xa0+", " ", test)) # the <c2><a0> nbsp is replaced
# 1: Diab<c3><a8>te de type II

Hat tip to http://www.pmean.com/posts/non-breaking-space/

EDIT 2 This example can be made to work on Windows and R 4.3.1 by also matching the <c2>

test = rawToChar(as.raw(c(0x44, 0x69, 0x61, 0x62, 0xc3, 0xa8, 0x74, 0x65, 0x20,  0x64, 0x65, 0x20, 0x74, 0x79, 0x70, 0x65, 0xc2, 0xa0, 0x49, 0x49)))
tools::showNonASCII(test)
# 1: Diab<c3><a8>te de type<c2><a0>II
tools::showNonASCII(gsub('\xc2\xa0+', '_', test))
# 1: Diab<c3><a8>te de type_II