I'm trying to extract the URLs from few links using tldextract. Since my links are in different format can anybody help me to extract the URL.
import tldextract
ext = tldextract.extract('booking.com__booking.com_content_privacy.html?label=gen173nr-1FCAEoggI46AdIM1gEaLUBiAEBmAExuAEHyAEP2AEB6AEB-AECiAIBqAIDuALVsdeSBsACAdICJDBkZWExNDc4LWZ')
so in above example, I want to extract booking.com but it doesn't give desired results.
You need provide right input.
booking.com__booking.com_content_privacy.html?label=gen173nr-1FCAEoggI46AdIM1gEaLUBiAEBmAExuAEHyAEP2AEB6AEB-AECiAIBqAIDuALVsdeSBsACAdICJDBkZWExNDc4LWZis NOT valid URL. Here is example you need:More examples and usage here: https://github.com/john-kurkowski/tldextract Probably,
tldextractisn't the right lib for you. You need to process those urls and process. May be, replace__with/. It's more of data cleaning task and is very specific to your input data. This might help Extract domain from URL in python