I’m a systems engineer with over 5 years of experience in the cloud and HPC space, where I’ve been responsible for environments ranging from several hundred to thousands of hosts. I currently work at Acquia where I’m building a world-class Operations team.
Varnish is a very powerful caching reverse proxy which features a configuration language (VCL) and tools to analyse traffic. I use it primarily to cache anonymous Drupal page requests so that a site can handle a massive spike. When your application is configured correctly to work with Varnish, you will soon find that the next major bottleneck to deal with is your server’s Internet connection :–).
In the last post we went over how to use GoAccess to analyse Apache logs to find potential malicious clients. Varnish can lend a hand towards this purpose as well, and can even be used to thwart an attack.
The primary tool I use for this is varnishtop, in particular to see incoming request headers.
root@server:~# varnishtop -i RxHeader
An example output with me just running curl against my frontpage over and over again:
So with this information I can see the user agent most prominently displayed. Now, assume there was a really unsophisticated script kiddie that wanted to take my site down. He just wants to generate a bunch of pageloads using some random script he downloaded off a site somewhere:
for i in `seq 1 100`;
do curl $1 -H 'User-Agent: EvilBot 0.1 (Linux x86_64)';
Nagios wakes me up because my uptime is starting to suffer and once I come to the conclusion this is a problem regarding traffic levels and I analyse what’s coming in, I will see this bubbling up to the top of the varnishtop output:
When dealing with performance problems with a web application, there’s always the possibility that the root cause is malicious traffic. The difficulty lies in how to find unique identifiers of the attack (addresses, HTTP headers, etc). These series of posts will go over tools and techniques that I’ve depended on when I needed to quickly isolate attackers and take steps to mitigate.
The first tool is GoAccess, which is a ncurses-based log analyser. I use it for NCSA log formats generated by Apache, Nginx, and Varnish web servers.
You can point it directly at a log file. This will allow you to analyse the log as it is being written to:
Top Connecting IPs: is there one or many IPs generating the most traffic?
Top Requested Files: is there a page or pages that is being requested the most?
HTTP Status Codes: is the stack surviving the traffic (200) or timing out (503)?
Top 404 requests: some web applications (eg: Drupal) can take a performance hit when there’s a lot of 404s being emitted
With this information, you will be able to gain a clearer understanding of traffic hitting a single host. If you need to analyse multiple hosts, you can use a parallel shell like pdsh and redirect all webhead logs to a file as they are being written to:
I enjoyed blogging with Jekyll however I had the biggest pain with the theme I manually ported from my Wordpress days, not to mention that I never managed to get syntax highlighting to work. I gave Octopress a shot and got a lot of success using the theme FoxSlide, with a few cosmetic changes of course.
I always wanted to have the ability to have a program call my cell and give me an automated voice message, especially if something really goes awry in my infrastructure. Getting called and having to pick up the phone is way better than getting a page or text message, especially when you are sleeping.
The pjsip SIP library comes with a CLI tool pjsua that you can use to place calls. Using Google Translate and sox to convert MP3 to WAV, I constructed the following:
One of my current interests is creating tools for realtime traffic log analysis. I’ve used gltail and GoAccess, however I’ve recently been wanting to visualize a rate of incidence of certain things like HTTP 503s or 404s. The following solution is a bit low tech, however very flexible:
Many sysadmins use Nagios; it’s a way of life. I’ve been using it for over 4 years now and it’s a very useful tool for host/service monitoring. The one drawback to it is that when critical notifications hit my phone, a single sound plays regardless of the type or content of the notification. A disk capacity alert or a server outage makes the same noise, which means I need to check my phone in order to actually see what the alert is. If your Nagios installation is noisy because of a very large amount of checked services and not all criticals are in fact actionable alerts, this can be very aggravating.
Since v2.3.5, Gmail on Android can synchronize specific tags to your phone as well as play separate notifications for each tag. Therefore, if you craft a set of Gmail filters to classify your Nagios notifications into categories via tags, you can play a different sound for each alert type.
You can take it a step further by creating mp3 files using this Google Translate URL for each type, then assign it to play for the corresponding tag notification. For example:
In my experiences working with the computing grid and the cloud, the ability to run commands across a large set of servers becomes quite necessary. From forcing a puppet run, to gathering hardware statistics - these tasks become non-trivial and even painful when your server count mounts into the hundreds and beyond. There comes a point where Bash loops will no longer suffice.
I spent a bit of time researching various solutions and came across Parallel Distributed Shell (pdsh), which is an open-source project from the Lawrence Livermore National Laboratory. It is available in most Linux distributions, and can easily be compiled from source otherwise. What it allows for you to do is to run commands on remote hosts in parallel by expressing a hostgroup via an external library such as libgenders.
I use it heavily in the operations team at Acquia, and it has served me extremely well when I’m in a tight spot and I need to run a command across a large set of servers quickly. A quick tip about use- I tend to run pdsh with this environment variable setting, especially since servers can commonly be relaunched in a cloud environment, and I don’t want to deal with my SSH known_hosts file being inaccurate:
I post this to benefit any Nagios admin who struggles determining why their check and notification plugins fail with an error code 127, like this one:
Aug 12 07:01:39 nagios_host nagios: Warning: Return code of 127 for check of service 'HTTP' on host 'examplehost' was out of bounds. Make sure the plugin you're trying to run actually exists.
That error message gives sound advice for the 99% case. However, a misconfigured commands.cfg file is not the only cause of an 127 error message. After Googling and searching though many support forums, comparing working configs to my non-working one, setting LD_LIBRARY_PATH in my Nagios init script, and even writing test plugins, I found a big clue when I decided to strace the Nagios daemon looking for execve/execvp calls:
[pid 23350] execve("/bin/sh", ["sh", "-c", "/bin/echo -e \"***** Nagios *****\\n\\nNotification Type: PROBLEM\\n\\nService: Puppet Client\\nHost: web-001 \\nAddress: 192.168.1.2\\nState: UNKNOWN\\nLast State: UNKNOWN\\n\\nDate/Time: Sun May 22 18:56:28 UTC 2011\\n\\nAdditional Info:\\n\\nNRPE: Unable to read output\" | /usr/bin/mail -s \"** PROBLEM Service Alert: web-001/Puppet Client is UNKNOWN **\" -a \"Reply-to: firstname.lastname@example.org\" email@example.com"], [/* 210 vars */]) = -1 E2BIG (Argument list too long)
Aha! It wasn’t the command definition after all, it was something entirely different! After Googling for ‘nagios E2BIG’, I discovered that for large installations, the config option ‘enable_environment_macros’ needed to be disabled, otherwise this condition would occur. At any rate, Nagios should handle this particular error condition with a much more informative error message. Please let me know in the comments if this helps!
When I studied at USF, the question ‘What company do you want to work for when you graduate?’ was always answered with ‘For an open-source company.’ Two years after graduation, I have finally managed to achieve my goal. I have moved from the comfortable weather of the Tampa Bay area to Woburn, Massachusetts to work for Acquia, a startup that focuses entirely on Drupal, just as Red Hat focuses on its flavor of Linux. I have been hired as a systems engineer, and I am really excited to apply my experience in HPC towards their Amazon EC2-based hosting platform. As I’ve discussed with colleagues in the past, HPC and cloud computing methodologies are starting to merge together, so I should be able to fall into place there just fine. I’ve already started to adapt tools for managing their servers as well as write various Nagios plugins for monitoring their infrastructure.
The weather here in Woburn is starkly different; We had blizzard conditions often and had to work from home sometimes. It’s odd shoveling snow and scraping frost off of car windows as part of the morning commute. The benefits living here outweigh the disadvantages though: there is a Italian bakery, a Thai-Viatnamese restaurant, a steakhouse, TWO chinese restaurants, a Japanese restaurant, and a Brazilian restaurant all within walking distance from my apartment. If I wanted to drive, there’s a really nice sushi place in Stoneham and a H-Mart in Burlington. It might be cold, but my tastebuds won’t be bored anytime soon.
EDIT: Due to a combination of circumstances, I recently moved from Woburn to Burlington, which is the next town over. It’s a mile away from H-Mart. I couldn’t get enough of the place, I guess.