I am trying to extract html links from a data set. I am using strsplit and then grep to find the substring with the links but the result has unwanted chars either at the beginning or the end of the string....How can I extract only the string with the desired pattern or keep the string with the desired pattern
He is what I am currently doing.
1) I split a chunk of text using strplit and " " (space) as the delimiter
2) Next I grep the result of strsplit to find the pattern
e.g. grep("https:\/\/support.google.com\/blogger\/topic\/[0-9]",r)
3) And few variations of the result is shown below....
https://support.google.com/blogger/topic/12457
https://support.google.com/blogger/topic/12457.
[https://support.google.com/blogger/topic/12457]
<<https://support.google.com/blogger/topic/12457>>
https://support.google.com/blogger/topic/12457,
https://support.google.com/blogger/topic/12457),
xxxxxxhttps://support.google.com/blogger/topic/12457),hhhththta
etc...
How can I just extract "https://support.google.com/blogger/topic/12457" or after extracting the dirty data how can I remove the unwanted punctuations
Thx in advance.
The
qdapRegex
package has an awesome function calledrm_url
that is perfect for this example.