Suppose I have the following submitted text. Note the spaces in most of the URLs:
According to NASA (htt ps://www.nasa. gov) and the New York Times (https://www.nytimes.com/topic/organization/national-aeronautics- and-space -administration), scientists are making lots of new discoveries! There are all kinds of exciting new findings. The
astro-phcategory of ArXiv (https:// arxiv.org /list/astro-ph.GA/new) lists a bunch of new research that is going on. Lucky me! A Google search (https://www.google.com/) turned up more new discoveries!
I want to detect the URLs and replace them with the corrected URL using Python.
The fixed URLs in this text are:
- https://www.nasa.gov
- https://www.nytimes.com/topic/organization/national-aeronautics-and-space-administration
- https://arxiv.org/list/astro-ph.GA/new
- https://www.google.com/
Not all the URLs have spaces. Some URLs have more than one space. The spaces may be in any part of the URL. I can assume the URLs are web URLs (http/https). I think I can assume there will only be spaces (no tabs or newlines). I think I can assume there will not be more than one consecutive space. I think I can assume that tokens/words will not be broken by a space — in other words, spaces will be next to punctuation marks. I cannot assume that all URLs are enclosed in parentheses.
Note: My question is similar to this one, except that the URLs I am hoping to fix are in written text, the spaces may be in any part of the URL, and I am limiting myself to web URLs.
Note: I currently am using the excellent (if overkill) Liberal Regex Pattern for Web URLs here, but it seems insufficient for this job.
Note: I need to both detect and replace the URLs. For my own use, I scan the text and convert it to LaTeX. The URLs get converted to hyperlinks via the \href{}{} command. In doing so, I need to detect good and bad URLs, fix any bad URLs, create a hyperlink using the correct URL, and then replace the original good or bad URL the corrected URL inside the text body.
I'm assuming the URLs are between
(and). Then you can try to useremodule andurlparse()for checking the URLs:Prints: