I'm using mechanize to get the top results from yahoo search and scrape data from them, but yahoo provides only dirtyurls, which gives error on further processing, any solution to obtain original link?
example: For the result stackoverflow.com, I get the following tag
<a dirtyhref="http://r.search.yahoo.com/_ylt=A0SO8zEuKGZUteYAEHRXNyoA;_ylu=X3oDMTEzODh2cDk0BHNlYwNzcgRwb3MDMQRjb2xvA2dxMQR2dGlkA1ZJUDI0NF8x/RV=2/RE=1416009903/RO=10/RU=http%3a%2f%2fstackoverflow.com%2f/RK=0/RS=a.mWRIy6IMjJQysgixByd8053hE-" id="link-1" class="yschttl spt" href="http://r.search.yahoo.com/_ylt=A0SO8zEuKGZUteYAEHRXNyoA;_ylu=X3oDMTEzODh2cDk0BHNlYwNzcgRwb3MDMQRjb2xvA2dxMQR2dGlkA1ZJUDI0NF8x/RV=2/RE=1416009903/RO=10/RU=http%3a%2f%2fstackoverflow.com%2f/RK=0/RS=a.mWRIy6IMjJQysgixByd8053hE-" target="_blank" data-bk="5054.1"> <b>Stack Overflow</b> - Official Site </a>
represents http://stackoverflow.com
Assuming that you can isolate easily the content of
dirtyhref
(you can useBeautifulSoup
to parse the link, http://www.crummy.com/software/BeautifulSoup/bs4/doc/), you can use theurlparse
package to get only the path (https://docs.python.org/2/library/urlparse.html#urlparse.urlparse). Now you'll have it in a string like:Now, it looks to me that fields are separated by
/
, so you can:Assuming that the fields you are interested in is always the sixth:
Finally, you can use
unquote
from theurllib2
package (https://docs.python.org/2/library/urllib.html#urllib.unquote):You can also not assume that the URL will always be in the sixth field, by cycling over
fields
and check if it starts withRU=
.