Monitoring Puppet – Part 1

Monitoring Puppet is a often discussed issue in user forums or mailing lists. Questions like “How can I ensure all my Puppet clients are running” or “How can I be sure my Puppet master is always serving catalogues to its clients” arise as soon as more than a hand full of boxes have to be managed by Puppet. But this is not caused by Puppet being a kind of bad written software – Puppet is great! Those questions arise since Puppet has to run on tens and hundreds of different operating systems, versions, environments and circumstances and therefore it is impossible to have the one and only solution for everything. For example sometimes I encountered problems caused by the underlying virtualisation and sometimes by some weird syslog issues which simply prevented the Puppet client from executing – the process was actually running, but did nothing. So a simple Icinga/Nagios check is evident, but not sufficient at all to ensure a running Puppet client!

We also have a very special case since our servers, mainly running as test systems for QA, often perform time jumps. That is, in order to simulate some situation in the future, the date of the boxes changes to, lets say, some day in December though it is just October. After tests are finished, the machines switch back to real time. This is of course a very special case which seems to cause some kind of confusion by the Puppet clients as well as Puppet’s Dashboard, since the reports generated by the clients have a date in the future. This can be totally confusing.

It also seems that the regular runs of the Puppet clients get confused, since they cannot cope with those time jumps. There’s no clear evidence for this and I don’t know the source code of Puppet, but it seemed to me, that the clients sometimes just ceased working in their regular intervals. So I had to find a solution and I’d like to share my experiences here.

Keeping the clients alive
Since I encountered the Puppet clients getting confused by the time jumps, I just thought about if there is any reasonable way to restart them. If I will manage to restart them, no matter in which time they currently “live”, I would ensure that at least when they restart, they will get their compiled catalogue from their master.

So I established a cron job – of course managed by Puppet itself – which restarts the client every two hours:

  $root_bin="/root/bin"
  $restart_script="puppet-restart.sh"
  $restart_path="${root_bin}/${restart_script}"

  file {"${root_bin}":
    ensure => "directory",
    owner => "root",
    group => "root",
    mode => "0755",
  }

  file { "${root_bin}/${restart_script}":
    ensure => "file",
    owner => "root",
    group => "root",
    mode => "0755",
    source => "puppet:///modules/puppet/${restart_script}",
    require =>  File["${root_bin}"],
  }

  cron { "restart-client":
    command => "${root_bin}/${restart_script}",
    hour => "*/2",
    minute => "0",
    user => "root",
    require => File["${root_bin}/${restart_script}"],
  }

The restart script triggered by the cron job looks like this:

#!/bin/bash
WAIT_TIME=`expr $RANDOM % 10`
logger "Restart of Puppet client scheduled. Will wait $WAIT_TIME minute(s) before."
sleep ${WAIT_TIME}m
service puppet restart

It was important not to have all clients running in the same time zone requesting their catalogues at the same time. So I added a random-wait before the actual restart takes place. In this case, the random wait time is between zero and nine minutes. I was inspired by a post in Puppet’s user list to do it this way. Join this list if you’re interested in all the important things about Puppet!

It turned out that even if the “time-confused” clients not running within their 30-minutes interval, at least they will run every two hours – which is perfectly OK for us. So now we’re able to keep them alive. But strange things may happen on any machine and we want to be sure that they are really alive.


Observing the clients

Again the problem was that just monitoring the reports generated by the clients was not reliable since they report their time and not the real time. I found a good basis for our solution on this blog. The idea is to look at Puppet master’s log file where every client entry is listed. The date of every entry in the log file is generated on Puppet master, so these dates will always be real time and this is a good opportunity to measure the time difference between two runs of a Puppet client.

The script I use is simple bash code. It is executed regularly by Icinga on the Puppet master using the NRPE client. Let’s look at it:

#!/bin/bash

LOGFILE=/var/log/puppet/masterhttp.log
MSG_CLIENTS_CRITICAL=""
MSG_CLIENTS_WARNING=""
TIME_CRITICAL=3600
TIME_WARNING=1800
CRITICAL_COUNT=0
CRITICAL_WARN=0
NOW=`date "+%s"`

# grant access to nrpe user to log file
sudo setfacl -m u:nrpe:r $LOGFILE

#
# Get the list of all nodes from Puppet's CA:
#
for node in `sudo /usr/sbin/puppetca -la | awk '/^+/ {print $2}'`; do

LASTRUN=`grep $node $LOGFILE | tail -1 | awk '{ print $1 " " $2 }' | sed 's/[//' | sed 's/]//'`

# Check time difference of host if he's
# available inside the logfile. (Roling logs!)
if [ -n "${LASTRUN}" ]; then

	LASTRUN=`date "+%s" -d "$LASTRUN"`
	TIMEDIFF=`expr $NOW - $LASTRUN`

	if [ $TIMEDIFF -gt $TIME_CRITICAL ]; then
		MSG_CLIENTS_CRITICAL+="${node}, "
		let "CRITICAL_COUNT += 1"
	elif [ $TIMEDIFF -gt $TIME_WARNING ]; then
		MSG_CLIENTS_WARNING+="${node}, "
		let "CRITICAL_WARN += 1"
	fi
fi

done


# Now evaluate the results. 
# If critical hosts are present, the warning hosts # are hidden (which is OK).
if [ $CRITICAL_COUNT -gt 0 ]; then
        echo "${CRITICAL_COUNT} Puppet client(s) not reporting for more than ${TIME_CRITICAL}s : $MSG_CLIENTS_CRITICAL"
        exit 2
elif [ $CRITICAL_WARN -gt 0 ]; then
        echo "${CRITICAL_WARN} Puppet client(s) not reporting for more than ${TIME_WARNING}s: $MSG_CLIENTS_WARNING"
        exit 1
else
        echo "All Puppet clients reporting as expected"
        exit 0
fi

Look at line 18. We use the puppetca -la command to get a list of all clients connected to this master. Since the user nrpe normally is not allowed to access the masterhttp.log, I granted access using the setfacl command in line 13. Line 20 greps the last entry this host was reporting. It may happen that the log file has rolled. So we check in line 24 if the host is actually available within the file. If it is not due to rolling log files, this would cause false alarms.

The actual calculation of the time difference is done in line 27 by just using the current date of the system and the host’s date from the log file. We collect all “late” hosts with their name and count the overall amount. When the for-loop ends, the results are simply reported using the list of hosts and the appropriate exit code for NRPE.

Credits
Before establishing this solution, I had of course a look at some other resources on the web which inspired me to create this solution. So you may find the monitoring solutions here also useful:

Overall Concept
This Puppet check shown here is just one part of a larger concept we’ve introduced. Some more checks apply and I will post another article showing more checks and describing the conceptual overview.

, ,

No comments yet.

Leave a Reply

* Copy This Password *

* Type Or Paste Password Here *