Nagios Voice Alerts Using Gmail and Android

05 Jan 2012

Many sysadmins use Nagios; it's a way of life. I've been using it for over 4 years now and it's a very useful tool for host/service monitoring. The one drawback to it is that when critical notifications hit my phone, a single sound plays regardless of the type or content of the notification. A disk capacity alert or a server outage makes the same noise, which means I need to check my phone in order to actually see what the alert is. If your Nagios installation is noisy because of a very large amount of checked services and not all criticals are in fact actionable alerts, this can be very aggravating.

Since v2.3.5, Gmail on Android can synchronize specific tags to your phone as well as play separate notifications for each tag. Therefore, if you craft a set of Gmail filters to classify your Nagios notifications into categories via tags, you can play a different sound for each alert type.

You can take it a step further by creating mp3 files using this Google Translate URL for each type, then assign it to play for the corresponding tag notification. For example:

Now when you get paged in the middle of the night, you know exactly what's going on, not just that something is wrong.

Links:

pdsh - A Sysadmin's Secret Weapon

27 Jul 2011

In my experiences working with the computing grid and the cloud, the ability to run commands across a large set of servers becomes quite necessary. From forcing a puppet run, to gathering hardware statistics - these tasks become non-trivial and even painful when your server count mounts into the hundreds and beyond. There comes a point where Bash loops will no longer suffice.

I spent a bit of time researching various solutions and came across Parallel Distributed Shell (pdsh), which is an open-source project from the Lawrence Livermore National Laboratory. It is available in most Linux distributions, and can easily be compiled from source otherwise. What it allows for you to do is to run commands on remote hosts in parallel by expressing a hostgroup via an external library such as libgenders.

I highly encourage taking a look at the documentation and seeing how powerful this little tool is: http://code.google.com/p/pdsh/wiki/UsingPDSH

I use it heavily in the operations team at Acquia, and it has served me extremely well when I'm in a tight spot and I need to run a command across a large set of servers quickly. A quick tip about use- I tend to run pdsh with this environment variable setting, especially since servers can commonly be relaunched in a cloud environment, and I don't want to deal with my SSH known_hosts file being inaccurate:


PDSH_SSH_ARGS_APPEND="-q -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o PreferredAuthentications=publickey"

Ask me any questions about pdsh in the comments!



Nagios Error 127- An Unusual Solution

22 May 2011

I post this to benefit any Nagios admin who struggles determining why their check and notification plugins fail with an error code 127, like this one:


Aug 12 07:01:39 nagios_host nagios: Warning: Return code of 127 for check of service 'HTTP' on host 'examplehost' was out of bounds. Make sure the plugin you're trying to run actually exists.

That error message gives sound advice for the 99% case. However, a misconfigured commands.cfg file is not the only cause of an 127 error message. After Googling and searching though many support forums, comparing working configs to my non-working one, setting LD_LIBRARY_PATH in my Nagios init script, and even writing test plugins, I found a big clue when I decided to strace the Nagios daemon looking for execve/execvp calls:


[pid 23350] execve("/bin/sh", ["sh", "-c", "/bin/echo -e \"***** Nagios *****\\n\\nNotification Type: PROBLEM\\n\\nService: Puppet Client\\nHost: web-001 \\nAddress: 192.168.1.2\\nState: UNKNOWN\\nLast State: UNKNOWN\\n\\nDate/Time: Sun May 22 18:56:28 UTC 2011\\n\\nAdditional Info:\\n\\nNRPE: Unable to read output\" | /usr/bin/mail -s \"** PROBLEM Service Alert: web-001/Puppet Client is UNKNOWN **\" -a \"Reply-to: alerts@host.com\" alerts@host.com"], [/* 210 vars */]) = -1 E2BIG (Argument list too long)

Aha! It wasn't the command definition after all, it was something entirely different! After Googling for 'nagios E2BIG', I discovered that for large installations, the config option 'enable_environment_macros' needed to be disabled, otherwise this condition would occur. At any rate, Nagios should handle this particular error condition with a much more informative error message. Please let me know in the comments if this helps!



A Season of Change

05 Feb 2011

When I studied at USF, the question 'What company do you want to work for when you graduate?' was always answered with 'For an open-source company.' Two years after graduation, I have finally managed to achieve my goal. I have moved from the comfortable weather of the Tampa Bay area to Woburn, Massachusetts to work for Acquia, a startup that focuses entirely on Drupal, just as Red Hat focuses on its flavor of Linux. I have been hired as a systems engineer, and I am really excited to apply my experience in HPC towards their Amazon EC2-based hosting platform. As I've discussed with colleagues in the past, HPC and cloud computing methodologies are starting to merge together, so I should be able to fall into place there just fine. I've already started to adapt tools for managing their servers as well as write various Nagios plugins for monitoring their infrastructure.

The weather here in Woburn is starkly different; We had blizzard conditions often and had to work from home sometimes. It's odd shoveling snow and scraping frost off of car windows as part of the morning commute. The benefits living here outweigh the disadvantages though: there is a Italian bakery, a Thai-Viatnamese restaurant, a steakhouse, TWO chinese restaurants, a Japanese restaurant, and a Brazilian restaurant all within walking distance from my apartment. If I wanted to drive, there's a really nice sushi place in Stoneham and a H-Mart in Burlington. It might be cold, but my tastebuds won't be bored anytime soon.

EDIT: Due to a combination of circumstances, I recently moved from Woburn to Burlington, which is the next town over. It's a mile away from H-Mart. I couldn't get enough of the place, I guess.



SGE Arraytasks and Matlab

29 Nov 2010

Recently I've had users of CIRCE in need of being able to take a set of input values and iterate over them in parallel in Matlab to save time. Most people would consider utilizing the Parallel Computing Toolbox, but that would require modification of code, as well as a license for the toolbox itself. I have an alternative to allow users to run their Matlab code in parallel with minimal modification in conjunction with an arraytask in Sun Grid Engine. I'm sure other schedulers support arraytasks, so this can be adapted to whatever scheduler is in production.

To review- an arraytask is a type of HPC job that allows you to run a piece of software multiple times simultaneously, each time with a different set of inputs. This is usually done for compute tasks that are embarassingly parallel in nature. Here's an example of a simple arraytask in SGE:


#!/bin/bash
#$ -N my_app_array_run
#$ -o output.$JOB_ID
#$ -cwd
#$ -pe smp 1
#$ -l h_rt=00:45:00
#$ -t 1-48

./my_app inputfile.$SGE_TASK_ID

This creates 48 tasks to run the program ./my_app with the argument inputfile.1 to inputfile.48 with each task. Now, how do we use this tool with Matlab?

Environment variables. First, use this to export the appropriate value to the environment:


#!/bin/bash
#$ -cwd
#$ -l h_rt=1:00:00
#$ -j y
#$ -N matlab_arraytask
#$ -o output.$JOB_ID
# HOW MANY TASKS?
#$ -t 6

task=1
# SPECIFY A LIST OF INPUTS EQUAL TO NUMBER OF TASKS
for i in 1 2 3 4 5 6; do
      if [[ "$task" -eq "$SGE_TASK_ID" ]]; then
        INPUTVALUE=$i
      fi
      let task=$task+1
done

export INPUTVALUE

matlab -nodisplay -r function

Next, add this to your Matlab code to convert the environment variable into a Matlab variable:


inputvalue = str2num(getenv('INPUTVALUE'));

With this, and sufficient computing resources, you can cut the time needed to run all of your inputs to the time needed to run just one.



Older Posts