Rocks cluster head node DNS failure. Compute nodes unable to resolve hostnames

735 Views Asked by At

I've been tasked with maintaining a Rocks (Centos 6.2 based) cluster where the head node is configured with a static IP to the public network and acts as a NAT router for the compute nodes on the internal private network. The nodes are connected to the head node by standard ethernet and also QDR Infiniband.

Recently, the compute nodes have been unable to access an external data source to begin computations as DNS lookup fails when they use wget to pull down publically-available datasets. All compute nodes are configured with the IP of the head node in their /etc/resolv.conf and I've checked the iptables firewall on the head node, and nothing has changed. SSH works between all nodes and the head node. When I use the IP address of some of the data sources for manually-initiated transfers, data flows again, but some of the applications cannot use IPs to grab data. I've tried restarting named and the iptables firewall, and so far nothing has fixed it. System logs (dmesg, /var/log/messages) show no sudden failures or error messages, I've made no recent configuration changes, and everything had worked fine for multiple months until about 2 nights ago. The head node can access and resolve names fine, it's only the compute nodes behind the NAT head node that are not working.

I'm still unfamiliar with all the workings of Rocks and am not sure if there is some special rocks command(s) that I'm overlooking to get this to work again. What might I be missing to get DNS resolution working again?

Thanks in advance!

UPDATE: DNS is working internally between compute nodes and the head node (e.g. compute-10-10 resolves to the IP address of that node from all other nodes) so the head node is functioning as the cluster DNS properly. Requests to domains outside the local zone still are failing (e.g. nslookup google.com fails) for all compute nodes.

1

There are 1 best solutions below

0
On

Root cause was a failed upstream DNS server. Reconfigured the /etc/named.conf forwarder options to other servers, and all compute nodes could access external resources once again.