I have some property sale data downloaded from Internet. It is a PDF file. When I copy and paste the data into a text file, it looks like this:

> a
[1] "Airport West 1/26 Cameron St 3 br t $830000 S Nelson Alexander" "Albert Park 106 Graham St 2 br h $0 SP RT Edgar"  

Let's take the first line as an example. Every row is a record of a property, including suburb (Airport West), address (1/26 Cameron St), the count of bedrooms (3), property type (t), price ($830000), sale type (S). The last one (Nelson) is about the agent, which I do not need here.

I want to analyse this data. I need to extract the information first. I hope I can get the data like this: (b is a data frame)

> b
        Suburb         Address Bedroom PropertyType  Price SoldType
1 Airport West 1/26 Cameron St       3            t 830000        S
2  Albert Park   106 Graham St       2            h      0       SP

Could anyone please tell me how to use stringr package or other methods to split the long string into the sub strings that I need?

1

There are 1 best solutions below

1
On BEST ANSWER

1) gsubfn::read.pattern read.pattern in the gsubfn package takes a regular expression whose capture groups (the parts within parentheses) are taken to be the fields of the input and a data frame is created to assemble them.

library(gsubfn)

pat <- "^(.*?) (\\d.*?) (\\d) br (.) [$](\\d+) (\\w+) .*"
cn <- c("Suburb", "Address", "Bedroom", "PropertyType", "Price", "SoldType")
read.pattern(text = a, pattern = pat, col.names = cn, as.is = TRUE)

giving this data.frame:

        Suburb         Address Bedroom PropertyType  Price SoldType
1 Airport West 1/26 Cameron St       3            t 830000        S
2  Albert Park   106 Graham St       2            h      0       SP

2) no packages This could also be done without any packages like this (pat and cn are from above):

replacement <- "\\1,\\2,\\3,\\4,\\5,\\6"
read.table(text = sub(pat, replacement, a), col.names = cn, as.is = TRUE, sep = ",")

Note: The input a in reproducible form is:

a <- c("Airport West 1/26 Cameron St 3 br t $830000 S Nelson Alexander", 
"Albert Park 106 Graham St 2 br h $0 SP RT Edgar")