extract comma separated strings

1.1k Views Asked by At

I have data frame as below. This is a sample set data with uniform looking patterns but whole data is not very uniform:

locationid      address     
1073744023  525 East 68th Street, New York, NY      10065, USA
1073744022  270 Park Avenue, New York, NY 10017, USA      
1073744025  Rockefeller Center, 50 Rockefeller Plaza, New York, NY 10020, USA 
1073744024  1251 Avenue of the Americas, New York, NY 10020, USA
1073744021  1301 Avenue of the Americas, New York, NY 10019, USA 
1073744026  44 West 45th Street, New York, NY 10036, USA

I need to find the city and country name from this address. I tried the following:

1) strsplit This gives me a list but I cannot access the last or third last element from this.

2) Regular expressions finding country is easy

str_sub(str_extract(address, "\\d{5},\\s.*"),8,11)

but for city

str_sub(str_extract(address, ",\\s.+,\\s.+\\d{5}"),3,comma_pos)

I cannot find comma_pos as it leads me to the same problem again. I believe there is a more efficient way to solve this using any of the above approached.

6

There are 6 best solutions below

1
On BEST ANSWER

Split the data

 ss <- strsplit(data,",")`

Then

n <- sapply(s,len)

will give the number of elements (so you can work backward). Then

mapply(ss,"[[",n)

gives you the last element. Or you could do

sapply(ss,tail,1)

to get the last element.

To get the second-to-last (or more generally) you need

sapply(ss,function(x) tail(x,2)[1])
0
On

How about this pattern :

,\s(?<city>[^,]+?),\s(?<shortCity>[^,]+?)(?i:\d{5},)(?<country>\s.*)

This pattern will match this three groups:

  1. "group": "city", "value": "New York"
  2. "group": "shortCity", "value": "NY "
  3. "group": "country", "value": " USA"
0
On

Here's an approach using a the tidyr package. Personally, I'd just split the whole thing into all the various elements using just the tidyr package's extract. This uses regex but in a different way than you asked for.

library(tidyr)

extract(x, address, c("address", "city", "state", "zip", "state"), 
    "([^,]+),\\s([^,]+),\\s+([A-Z]+)\\s+(\\d+),\\s+([A-Z]+)")

##   locationid                       address     city state   zip state
## 1 1073744023          525 East 68th Street New York    NY 10065   USA
## 2 1073744022               270 Park Avenue New York    NY 10017   USA
## 3 1073744025          50 Rockefeller Plaza New York    NY 10020   USA
## 4 1073744024   1251 Avenue of the Americas New York    NY 10020   USA
## 5 1073744021   1301 Avenue of the Americas New York    NY 10019   USA
## 6 1073744026           44 West 45th Street New York    NY 10036   USA

Her'es a visual explanation of the regular expression taken from http://www.regexper.com/:

enter image description here

1
On

I think you want something like this.

> x <- "1073744026 44 West 45th Street, New York, NY 10036, USA"
> regmatches(x, gregexpr('^[^,]+, *\\K[^,]+', x, perl=T))[[1]]
[1] "New York"
> regmatches(x, gregexpr('^[^,]+, *[^,]+, *[^,]+, *\\K[^\n,]+', x, perl=T))[[1]]
[1] "USA"

Regex explanation:

  • ^ Asserts that we are at the start.
  • [^,]+ Matches any character but not of , one or more times. Change it to [^,]* if your dataframe contains empty fields.
  • , Matches a literal ,
  • <space>* Matches zero or more spaces.
  • \K discards previously matched characters from printing. The characters matched by the pattern following \K will be shown as output.
0
On

Using rex to construct the regular expression may make this type of task a little simpler.

x <- data.frame(
  locationid = c(
    1073744023,
    1073744022,
    1073744025,
    1073744024,
    1073744021,
    1073744026
    ),
  address = c(
    '525 East 68th Street, New York, NY      10065, USA',
    '270 Park Avenue, New York, NY 10017, USA',
    'Rockefeller Center, 50 Rockefeller Plaza, New York, NY 10020, USA',
    '1251 Avenue of the Americas, New York, NY 10020, USA',
    '1301 Avenue of the Americas, New York, NY 10019, USA',
    '44 West 45th Street, New York, NY 10036, USA'
    ))

library(rex)

sep <- rex(",", spaces)

re <-
  rex(
    capture(name = "address",
      except_some_of(",")
    ),
    sep,
    capture(name = "city",
      except_some_of(",")
    ),
    sep,
    capture(name = "state",
      uppers
    ),
    spaces,
    capture(name = "zip",
      some_of(digit, "-")
    ),
    sep,
    capture(name = "country",
      something
    ))

re_matches(x$address, re)
#>                      address     city state   zip country
#>1        525 East 68th Street New York    NY 10065     USA
#>2             270 Park Avenue New York    NY 10017     USA
#>3        50 Rockefeller Plaza New York    NY 10020     USA
#>4 1251 Avenue of the Americas New York    NY 10020     USA
#>5 1301 Avenue of the Americas New York    NY 10019     USA
#>6         44 West 45th Street New York    NY 10036     USA

This regular expression will also handle 9 digit zip codes (12345-1234) and countries other than USA.

2
On

Try this code:

library(gsubfn)

cn <- c("Id", "Address", "City", "State", "Zip", "Country")

pat <- "(\\d+) (.+), (.+), (..) (\\d+), (.+)"
read.pattern(text = Lines, pattern = pat, col.names = cn, as.is = TRUE)

giving the following data.frame from which its easy to pick off components:

          Id                                  Address     City State   Zip Country
1 1073744023                     525 East 68th Street New York    NY 10065     USA
2 1073744022                          270 Park Avenue New York    NY 10017     USA
3 1073744025 Rockefeller Center, 50 Rockefeller Plaza New York    NY 10020     USA
4 1073744024              1251 Avenue of the Americas New York    NY 10020     USA
5 1073744021              1301 Avenue of the Americas New York    NY 10019     USA
6 1073744026                      44 West 45th Street New York    NY 10036     USA

Explanation It uses this pattern (when within quotes the backslashes must be doubled):

(\d+) (.+), (.+), (..) (\d+), (.+)

visualized via the following debuggex railroad diagram -- for more see this Debuggex Demo :

Regular expression visualization

and explained in words as follows:

  • "(\\d+)" - one or more digits (representing the Id) followed by
  • " " a space followed by
  • "(.+)" - any non-empty string (representing the Address) followed by
  • ", " - a comma and a space followed by
  • "(.+)" - any non-empty string (representing the City) followed by
  • ", " - a comma and a space followed by
  • "(..)" - two characters (representing the State) followed by
  • " " - a space followed by
  • "(\\d+)" - one or more digits (representing the Zip) followed by
  • ", " - a comma and a space followed by
  • "(.+)" - any non-empty string (representing the Country)

It works since regular expressions are greedy always trying to find the longest string that can match backtracking each time subsequent portions of the regular expression fail to match.

The advantage of this appraoch is that the regular expression is quite simple and straight forward and the entire code is quite concise as one read.pattern statement does it all:

Note: We used this for Lines:

Lines <- "1073744023 525 East 68th Street, New York, NY 10065, USA
1073744022 270 Park Avenue, New York, NY 10017, USA
1073744025 Rockefeller Center, 50 Rockefeller Plaza, New York, NY 10020, USA
1073744024 1251 Avenue of the Americas, New York, NY 10020, USA
1073744021 1301 Avenue of the Americas, New York, NY 10019, USA
1073744026 44 West 45th Street, New York, NY 10036, USA"