Resolve masked/shortened URL twint is scraping from twitter

498 Views Asked by At

I am using twint for scraping twitter profiles.

When I run this script:

    c = twint.Config()
    c.Username = username
    c.Store_object = True
    c.Store_object_users_list = users
    c.Hide_output = True
    twint.run.Lookup(c)
    try:
        userna = users[0]
    except:
        continue
    web = userna.url

I get the masked/shortened URL instead of a real one. How can I get the real url?

What would you advise?

2

There are 2 best solutions below

1
On

The following works.

import requests


resp = requests.head(short_link)
resp.status_code
true_url = resp.headers["Location"]
0
On

Preface: This answer is based on my results and findings from evaluating UVuuMe's answer (in it's initial version).


To translate a shortened URL into the full URL it represents, you can use the package requests which comes indirectly with twint (it's required by googletransx which is required by twint which you already installed, so there is no need for pip install requests).

Send a HEAD request, then check the response's status_code for 303, only then read the location header; there are other cases where a location will be in the response, but not in case of HTTP 200 OK.

import requests

# Short URL for a Python Requests Tutorial
short_url = 'https://youtu.be/tb8gHvYlCFs'

res = requests.head(short_url)
if res.status_code == 303: # "See Other"
    full_url = res.headers['location']
elif res.status_code == 200: # "OK"
    # let's conclude that short_url is already what we are looking for
    full_url = short_url
else:
    # replace by your error handling:
    assert(False)