Ganglia Web does not show status of other nodes, only shows the meta node. Possible reasons?

1.1k Views Asked by At

I have 3 machines in my HDFS cluster: pr2 is the meta node, and pr1 and pr3 are the other 2 nodes. I want to monitor them with ganglia web. To that end, Ganglia (with both gmetad and gmond) has been installed on all 3 nodes, and ganglia web has been installed on the meta node, pr2.

I am running the following:

On pr2:

service gmetad start
service gmond start
service httpd start

On pr1 and pr3:

service gmond start

I expect to see in ganglia web information about all 3 nodes. But this is what I see:

enter image description here

Only 1 node seems to be running, which is pr2.

What could be the reason?

I have explored the following possibilities when configuring the setup:

  1. In the gmond configuration file, there is a place to specify the "host machine". What should be the "host" here? For example, for node pr1, should it be the local machine (pr1), or the meta node (pr2)? I have tried with both pr1 and pr2 separately, and I still can't see info of pr1 or pr3 on ganglia web. Also, do I need to add anything more in this section?

[root@pr1]# nano /usr/local/etc/gmond.conf

/* The host section describes attributes of the host, like the location */
host {
  location = "unspecified" ### Should I replace "unspecified" with pr2 (the meta node), or with pr1? ###
}
  1. Also, in the same configuration file, there is a section for configuring udp_send_channels. There is a parameter mcast_join, which I think is to specify the meta node. So I changed the IP to pr2. But since ganglia web is not showing info of pr1 and pr3, I tried changing it to pr1 and pr3 respectively, but to no avail.

[root@pr1]# nano /usr/local/etc/gmond.conf

udp_send_channel {
  #bind_hostname = yes 
  mcast_join = 239.2.11.71 ### Should I change this IP to pr2 or pr1? ###
  port = 8649
  ttl = 1
}
  1. Again, in the same file, there's a section for configuring udp_recv_channel. I have commented out the mcast_join and bind here, since I thought the pr1 and pr3 nodes do not need to receive any info from anything, they only need to send info to the meta node, pr2. Am I wrong?

[root@pr1]# nano /usr/local/etc/gmond.conf

udp_recv_channel {
  #mcast_join = 239.2.11.71
  port = 8649
  #bind = 239.2.11.71
  retry_bind = true
  # Size of the UDP buffer. If you are handling lots of metrics you really
  # should bump it up to e.g. 10MB or even higher.
  # buffer = 10485760
}
  1. Then I run the following smoke test:

gmond -d 5 -c /usr/local/etc/gmond.conf

and it returns the following (showing the tail part of the long output):

...
...
Counting device /dev/mapper/centos-root (2.69 %)
Counting device /dev/mapper/centos-home (0.02 %)
Counting device /dev/sda1 (18.49 %)
For all disks: 206.174 GB total, 204.498 GB free for users.
    metric 'part_max_used' has value_threshold 1.000000
    sent message 'heartbeat' of length 48 with 0 errors
Processing a metric value message from pr1.embs
Got a heartbeat message 1511171541

This means that gmond is working correctly, am I right?

  1. Then I configure the gmond init file (for calling the command), and make the following 2 changes: (i) point GMOND to the correct path, and (ii) point the daemon to the correct path of the gmond.conf file. Could I possibly be doing anything wrong here?

[root@pr1]# nano /etc/init.d/gmond

### Change here ###
#GMOND=/usr/sbin/gmond
GMOND=/usr/local/sbin/gmond

# TMPDIR set to SysV IPC ramdrive to avoid include processing failures
TMPDIR=/dev/shm
export TMPDIR

. /etc/rc.d/init.d/functions

start() {
    [ -x $GMOND ] || exit 5
    [ -f /etc/ganglia/gmond.conf ] || exit 6
    echo -n "Starting GANGLIA gmond: "

    ### Change here ###
    #daemon $GMOND
    daemon $GMOND -c /usr/local/etc/gmond.conf
  1. Then I modify the gmetad.conf file to change the datasource and add the cluster name and the user name. Here, should I change the datasource to the local machine (pr1 or pr3), or to the meta node (pr2)?

[root@pr1]# nano /usr/local/etc/gmetad.conf

===(change)
data_source "edhcluster" pr1 ### Should I change this to pr2, the meta node?
===

===(add)
gridname "edhcluster"
===

===(add)
setuid_username "ganglia"
===

Or, an even bigger question, should I not change anything in this file of the nodes pr1 and pr3 (as only pr2 is the meta node that will monitor everything), and instead change the gmetad.conf file of the meta node, pr2, only to include in its datasource all the 3 nodes, like so:

[root@pr**2**]# nano /usr/local/etc/gmetad.conf

data_source "edhcluster" pr1 pr2 pr3

But then, I read somewhere a little while ago while browsing to solve this problem, that writing the names of the nodes like this means that it'll first check pr1, if it is down it'll check pr2, and if that is down too, it'll check pr3. So, if we want to make it collect data from all 3 machines, we should write separate lines for each node, like so:

[root@pr2]# nano /usr/local/etc/gmetad.conf

data_source "edhcluster" pr1
data_source "edhcluster" pr2
data_source "edhcluster" pr3

I tried this as well, and then restarted all the services in all the nodes (gmetad, gmond and httpd in pr2, and gmond in pr1 and pr3), but ganglia web still doesn't show anything for the other 2 nodes.

  1. Then I modified the gmetad.init file (for calling the command) on pr1 to make the GMETAD daemon point to the correct path of the gmetad.conf file. However, I suspect this might be wrong, as the gmetad of pr2 should handle this. So, am I correct in identifying this as the mistake? If yes, how do I make the gmetad daemon point to the path of the config file in another node, the meta node, pr2?

[root@pr1]# nano /etc/init.d/gmetad

### Change here ###
#GMETAD=/usr/sbin/gmetad
GMETAD=/usr/local/sbin/gmetad

. /etc/rc.d/init.d/functions

test -f /etc/sysconfig/gmetad && . /etc/sysconfig/gmetad

export RRDCACHED_ADDRESS

start() {
    [ -x $GMETAD ] || exit 5
    [ -f /etc/ganglia/gmetad.conf ] || exit 6
    echo -n "Starting GANGLIA gmetad: "

    ### Change here ###
    #daemon $GMETAD
    daemon $GMETAD -c /usr/local/etc/gmetad.conf 
    ### If this means that the gmetad of the local system 
    ### will monitor this node, how do I make the daemon 
    ### point to the gmetad.conf file of the meta node, pr2?

    RETVAL=$?
    echo
    [ $RETVAL -eq 0 ] && touch /var/lock/subsys/gmetad
    return $RETVAL
}

Thanks for going through all the details of the configuration steps. If I missed or overlooked any other possible causes, please feel free to point them out.

0

There are 0 best solutions below