How can I detect and "fix" URLs in a body of text if they have spaces?

69 Views Asked by At

Suppose I have the following submitted text. Note the spaces in most of the URLs:

According to NASA (htt ps://www.nasa. gov) and the New York Times (https://www.nytimes.com/topic/organization/national-aeronautics- and-space -administration), scientists are making lots of new discoveries! There are all kinds of exciting new findings. The astro-ph category of ArXiv (https:// arxiv.org /list/astro-ph.GA/new) lists a bunch of new research that is going on. Lucky me! A Google search (https://www.google.com/) turned up more new discoveries!

I want to detect the URLs and replace them with the corrected URL using Python.

The fixed URLs in this text are:

Not all the URLs have spaces. Some URLs have more than one space. The spaces may be in any part of the URL. I can assume the URLs are web URLs (http/https). I think I can assume there will only be spaces (no tabs or newlines). I think I can assume there will not be more than one consecutive space. I think I can assume that tokens/words will not be broken by a space — in other words, spaces will be next to punctuation marks. I cannot assume that all URLs are enclosed in parentheses.

Note: My question is similar to this one, except that the URLs I am hoping to fix are in written text, the spaces may be in any part of the URL, and I am limiting myself to web URLs.

Note: I currently am using the excellent (if overkill) Liberal Regex Pattern for Web URLs here, but it seems insufficient for this job.

Note: I need to both detect and replace the URLs. For my own use, I scan the text and convert it to LaTeX. The URLs get converted to hyperlinks via the \href{}{} command. In doing so, I need to detect good and bad URLs, fix any bad URLs, create a hyperlink using the correct URL, and then replace the original good or bad URL the corrected URL inside the text body.

2

There are 2 best solutions below

0
On

I'm assuming the URLs are between ( and ). Then you can try to use re module and urlparse() for checking the URLs:

import re
from urllib.parse import urlparse

text = """\
According to NASA (htt ps://www.nasa. gov) and the New York Times (https://www.nytimes.com/topic/organization/national-aeronautics- and-space -administration), scientists are making lots of new discoveries! There are all kinds of exciting new findings. The astro-ph category of ArXiv (https:// arxiv.org /list/astro-ph.GA/new) lists a bunch of new research that is going on. Lucky me! A Google search (https://www.google.com/) turned up more new discoveries!
"""

pat = r"\(\s*(h\s*t\s*t\s*p[^)]+)"

for url in re.findall(pat, text):
    url = url.replace(" ", "")

    # try to parse the URL:
    try:
        urlparse(url)
    except ValueError:
        continue

    print(url)

Prints:

https://www.nasa.gov
https://www.nytimes.com/topic/organization/national-aeronautics-and-space-administration
https://arxiv.org/list/astro-ph.GA/new
https://www.google.com/
3
On

Well, I'm pretty new to Python but I wanted to give it a shot and try to solve the issue with what I'm learning, which is splitting and slicing. So assuming the URLs are always inside ()...

content = "According to NASA (htt ps://www.nasa. gov) and the New York Times (https://www.nytimes.com/topic/organization/national-aeronautics- and-space -administration), scientists are making lots of new discoveries! There are all kinds of exciting new findings. The astro-ph category of ArXiv (https:// arxiv.org /list/astro-ph.GA/new) lists a bunch of new research that is going on. Lucky me! A Google search (https://www.google.com/) turned up more new discoveries!"

guess = content.split('(')

for i in guess:
     if ')' in i:
        position = (i.find(')'))
        newLink = i[:position]
        cleanLink = ''
        for x in newLink:
            if x != ' ':
                cleanLink += x
        print(cleanLink)

This prints:

https://www.nasa.gov
https://www.nytimes.com/topic/organization/national-aeronautics-and-space-administration
https://arxiv.org/list/astro-ph.GA/new
https://www.google.com/