What's the best way to scale millions of DNS queries.. (Python)

344 Views Asked by At

Consequently, I initially intended to write this in Python, but I am completely open to writing it in any language.

I'm working with the Alexa top 1 million right now, and I want to find out how many of these domains have DMARC enabled. To obtain the TXT record on the "dmarc" subdomain, I am making use of DNS Python. This is version 1, and when I have enough time, I will eventually parse it and load it into a Dynamo Table or Postgres (TBD) database. Time is a big factor because this data set will grow in the future.

The current code, which I have gotten down to the quickest, can be found below. It takes approximately 26 hours to run due to its 440 queries per minute. I want to write that down as much as I can.

I considered breaking this up into batches and starting those at the same time, but if I wanted to seriously cut down on time, I would have to do more than 100 batches. Naturally, the number of batches would also increase as this data set expanded)

import csv
import tqdm
from utils.dns import DNS
from concurrent import futures

with open('../data/top-1m.csv', 'r') as f:
    reader = csv.reader(f)
    domains = [domain for _, domain in reader]

def runDNS(domain):
    return DNS().query(f"_dmarc.{domain}", "TXT")

# generate a tqdm progress bar for domains
with tqdm.tqdm(domains, total=len(domains), desc="Check DMARC") as tqdm_domains:
        with futures.ProcessPoolExecutor(max_workers=600) as executor:
            for domain in zip(domains):
                executor.map(runDNS(domain[0]), domain)
                tqdm_domains.update(1)
0

There are 0 best solutions below