Suppose I have the following submitted text. Note the spaces in most of the URLs:
According to NASA (htt ps://www.nasa. gov) and the New York Times (https://www.nytimes.com/topic/organization/national-aeronautics- and-space -administration), scientists are making lots of new discoveries! There are all kinds of exciting new findings. The
astro-ph
category of ArXiv (https:// arxiv.org /list/astro-ph.GA/new) lists a bunch of new research that is going on. Lucky me! A Google search (https://www.google.com/) turned up more new discoveries!
I want to detect the URLs and replace them with the corrected URL using Python.
The fixed URLs in this text are:
- https://www.nasa.gov
- https://www.nytimes.com/topic/organization/national-aeronautics-and-space-administration
- https://arxiv.org/list/astro-ph.GA/new
- https://www.google.com/
Not all the URLs have spaces. Some URLs have more than one space. The spaces may be in any part of the URL. I can assume the URLs are web URLs (http/https). I think I can assume there will only be spaces (no tabs or newlines). I think I can assume there will not be more than one consecutive space. I think I can assume that tokens/words will not be broken by a space — in other words, spaces will be next to punctuation marks. I cannot assume that all URLs are enclosed in parentheses.
Note: My question is similar to this one, except that the URLs I am hoping to fix are in written text, the spaces may be in any part of the URL, and I am limiting myself to web URLs.
Note: I currently am using the excellent (if overkill) Liberal Regex Pattern for Web URLs here, but it seems insufficient for this job.
Note: I need to both detect and replace the URLs. For my own use, I scan the text and convert it to LaTeX. The URLs get converted to hyperlinks via the \href{}{}
command. In doing so, I need to detect good and bad URLs, fix any bad URLs, create a hyperlink using the correct URL, and then replace the original good or bad URL the corrected URL inside the text body.
I'm assuming the URLs are between
(
and)
. Then you can try to usere
module andurlparse()
for checking the URLs:Prints: