Choosing the right regex expression for coordinates

139 Views Asked by At

I have coordinates in various formats and try to get a more or less universal conversion routine.

For this i try to parse the individual elements of the string with a regex expression and try to get the individual information for degree, minute and second via their index of appearance in the string.

For some it works....but not for all. I am pretty convinced that my problem is closely correlated with my limited understanding of regex.

Thus the question: Someone who has a better understanding of the regex pattern and may help?

I tried to compile a short piece of code to demonstrate the problem. Running the example below shows that i get three components for the first four and last three coordinates. The rest -in between- delivers just 2 components....

coords = c("-53°30''30.54'",
       "s55°30' 30.54",
       "55°30'30.54n",
       "0°1 0.5S",
       "-0°30'30''s",
       "S55 30 30",
       "-55°30'30''",
       "-55° 30' 30''",
       "-55°   30'   30",
       "-55 sometimes with text rests 30 30''",
       "55°30'30,54S",
       "S55° 30' 30,54",
       "-55° 30' 30.54''"
       )

for (i in 1:length (coords)) {
    pattern   <- gregexpr ("[0-9.]+", coords [i])
    print (as.character (unique (unlist (regmatches (coords [i], pattern)))))
}


<Output>
[1] "53"    "30"    "30.54"
[1] "55"    "30"    "30.54"
[1] "55"    "30"    "30.54"
[1] "0"   "1"   "0.5"
[1] "0"  "30"
[1] "55" "30"
[1] "55" "30"
[1] "55" "30"
[1] "55" "30"
[1] "55" "30"
[1] "55" "30" "54"
[1] "55" "30" "54"
[1] "55"    "30"    "30.54"

The below regex expression is a pretty impressive monster ;-) Nevertheless, it has some problems when the coordinates are in a slightly different format (e.G. dec_deg). In this case the first or the second number of the string are not correctly identified. I just compiled a list with such coordinates:

coords = c("-53°30''30.54'", "s55°30' 30.54", "55°30'30.54n", "0°1 0.5S", "-0°30'30''s", "S55 30 30", "-55°30'30''", "-55° 30' 30''", "-55° 30' 30", "-55 sometimes with text rests 30 30''", "55°30'30,54S", "S55° 30' 30,54", "-55° 30' 30.54''", "-55.5432 30 30.54", "-55.30.30", "55.555", "55,555S", "S55,555", "S55.555", "55,555°S", "55.555°", "-55,555", "-55.555"

       )
2

There are 2 best solutions below

1
On

We can try using regexec along with regmatches to match exactly three numbers in each row. A "number" here is defined as either an integer or an integer with a decimal component (the decimal point being either dot or comma).

We can convert the list-of-vector output from the above to a matrix using do.call.

regex <- "^.*?(-?\\d+(?:[,.]\\d+)?).*?(-?\\d+(?:[,.]\\d+)?).*?(-?\\d+(?:[,.]\\d+)?).*$"
do.call(rbind, lapply(regmatches(coords, regexec(regex, coords)), function(x) x[2:4]))

      [,1]  [,2] [,3]   
 [1,] "-53" "30" "30.54"
 [2,] "55"  "30" "30.54"
 [3,] "55"  "30" "30.54"
 [4,] "0"   "1"  "0.5"  
 [5,] "-0"  "30" "30"   
 [6,] "55"  "30" "30"   
 [7,] "-55" "30" "30"   
 [8,] "-55" "30" "30"   
 [9,] "-55" "30" "30"   
[10,] "-55" "30" "30"   
[11,] "55"  "30" "30,54"
[12,] "55"  "30" "30,54"
[13,] "-55" "30" "30.54"
0
On

It seems to work OK with stringr...

library(stringr)
str_extract_all(str_replace_all(coords, ",", "."), "[0-9.\\-]+")

[[1]]
[1] "-53"   "30"    "30.54"

[[2]]
[1] "55"    "30"    "30.54"

[[3]]
[1] "55"    "30"    "30.54"

[[4]]
[1] "0"   "1"   "0.5"

[[5]]
[1] "-0" "30" "30"

[[6]]
[1] "55" "30" "30"

[[7]]
[1] "-55" "30"  "30" 

[[8]]
[1] "-55" "30"  "30" 

[[9]]
[1] "-55" "30"  "30" 

[[10]]
[1] "-55" "30"  "30" 

[[11]]
[1] "55"    "30"    "30.54"

[[12]]
[1] "55"    "30"    "30.54"

[[13]]
[1] "-55"   "30"    "30.54"