I scrapped a web and I now need to clean the "service" column, which is a string.
service column in the fl_data dataset, you can see that there are multiple services such as Testing Services and Preventions Services. These services are in between \n and : but not all the rows have all the services.
I need to divide the string into columns, each column should have a type of service and its elements.
This is my dataset:
url_base <- "https://npin.cdc.gov/search?=type%3Aorganization&page="
map_df(0:0, function(i) {
cat(".")
pg <- read_html(sprintf(url_base, i))
data.frame(org_name = html_text2(html_nodes(pg, ".block-field-blocknodeorganizationtitle")),
street = html_text(html_nodes(pg, ".address-line1")),
city = html_text(html_nodes(pg, ".locality")),
state = html_text(html_nodes(pg, ".administrative-area")),
zip = html_text(html_nodes(pg, ".postal-code")),
service = html_text2(html_nodes(pg, ".services-fieldset")),
stringsAsFactors=FALSE
)
}) -> raw_data
fl_data <- raw_data |>
filter(state=="FL") |>
mutate(service = str_remove(service, "Services\nPlease contact organization for eligibility requirements"))
You can use for loops to extract services and corresponding items. In the
result, items are separated with,.Created on 2024-03-14 with reprex v2.1.0