blacklist href in python to remove junk sites

197 Views Asked by At

I want it to print every site that isnt blacklisted(how the code looks so far) but it doesnt work if you change the string in the last if statement from pass to print(site) then it prints everything in the black list, yet it wont print everything that isnt blacklisted which is my goal

import requests 
from bs4 import BeautifulSoup
from lxml import html, etree
import sys
import re
import fnmatch
url = ("http://stackoverflow.com")
blacklist = ['*stackoverflow.com*', '*stackexchange.com*']
r = requests.get(url, timeout=6, verify=True)
soup = BeautifulSoup(r.content, 'html.parser')
for link in soup.select('a[href*="http"]'):
    site = (link.get('href'))
    site = str(site)
    for filtering in blacklist:
        if fnmatch.fnmatch(site, filtering):
            pass
        else:
            print(site)
1

There are 1 best solutions below

0
On

You want something like:

import requests
from bs4 import BeautifulSoup
from lxml import html, etree
import sys
import re
import fnmatch
url = ("http://stackoverflow.com")
blacklist = ['*stackoverflow.com*', '*stackexchange.com*']
r = requests.get(url, timeout=6, verify=True)
soup = BeautifulSoup(r.content, 'html.parser')
for link in soup.select('a[href*="http"]'):
    site = (link.get('href'))
    site = str(site)
    if any([fnmatch.fnmatch(site, filtering) for filtering in blacklist]):
        continue
    print(site)

The issue happens here (old code):

for filtering in blacklist:
        if fnmatch.fnmatch(site, filtering):
            pass
        else:
            print(site)

While you're iterating here, if the website is blacklisted it will match one condition but not the other, so it will always be printed. There are multiple solutions, mine was to use any() to check if the result is True at least once and if it is, continue the loop and don't print :D