//amin astaneh

Pushing the limits of DevOps and the Cloud

//about me

I’m a systems engineer with over 5 years of experience in the cloud and HPC space, where I’ve been responsible for environments ranging from several hundred to thousands of hosts. I currently work at Acquia where I’m building a world-class Operations team.

//recent public projects

Status updating…

//found on

//contact at

amin@aminastaneh.net

Easy Multi-User S3 Policy

- -

I recently had to set up multiple users and buckets in the Amazon Simple Storage Service (S3) and I wanted an easy way to set up permissions. This IAM group policy does the following:

Group members can:

  • list all buckets;
  • have full access to buckets named with their username as a prefix (eg: user amin can access buckets amin-data, amin-backup, etc);
  • not access any other buckets

This achieves a homedir-style system with very little effort. I hope this helps someone!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
  "Version": "2012-10-17",
  "Statement": [
    {
       "Action":["s3:ListAllMyBuckets" ],
       "Effect":"Allow",
       "Resource":["arn:aws:s3:::*"]
    },
    {
       "Action":["s3:ListBucket", "s3:GetBucketLocation", "s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:GetBucketVersioning", "s3:PutBucketVersioning" ],
       "Effect":"Allow",
       "Resource":["arn:aws:s3:::${aws:username}*"]
    }
  ]
}

Real-Time Traffic Analysis 2: Varnish

- -

Varnish is a very powerful caching reverse proxy which features a configuration language (VCL) and tools to analyse traffic. I use it primarily to cache anonymous Drupal page requests so that a site can handle a massive spike. When your application is configured correctly to work with Varnish, you will soon find that the next major bottleneck to deal with is your server’s Internet connection :–).

In the last post we went over how to use GoAccess to analyse Apache logs to find potential malicious clients. Varnish can lend a hand towards this purpose as well, and can even be used to thwart an attack.

The primary tool I use for this is varnishtop, in particular to see incoming request headers.

1
root@server:~# varnishtop -i RxHeader

An example output with me just running curl against my frontpage over and over again:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
list length 14                                                     aminastaneh.net

    14.86 RxHeader       Host: aminastaneh.net
    14.86 RxHeader       Accept: */*
    13.86 RxHeader       User-Agent: curl/7.29.0
     1.97 RxHeader       Server: nginx/0.7.65
     1.97 RxHeader       Content-Type: text/html
     1.97 RxHeader       Last-Modified: Sun, 28 Jul 2013 21:45:26 GMT
     1.97 RxHeader       Connection: keep-alive
     1.97 RxHeader       Accept-Ranges: bytes
     1.00 RxHeader       User-agent: Mozilla/5.0 (compatible; Ezooms/1.0; ezooms
     1.00 RxHeader       Accept-Charset: utf-8;q=0.7,iso-8859-1;q=0.2,*;q=0.1
     1.00 RxHeader       Date: Thu, 22 Aug 2013 01:47:30 GMT
     1.00 RxHeader       Content-Length: 11270
     0.98 RxHeader       Date: Thu, 22 Aug 2013 01:47:08 GMT
     0.98 RxHeader       Content-Length: 37999

So with this information I can see the user agent most prominently displayed. Now, assume there was a really unsophisticated script kiddie that wanted to take my site down. He just wants to generate a bunch of pageloads using some random script he downloaded off a site somewhere:

1
2
3
4
#!/bin/bash
for i in `seq 1 100`; 
  do curl $1 -H 'User-Agent: EvilBot 0.1 (Linux x86_64)'; 
done

Nagios wakes me up because my uptime is starting to suffer and once I come to the conclusion this is a problem regarding traffic levels and I analyse what’s coming in, I will see this bubbling up to the top of the varnishtop output:

1
95.50 RxHeader       User-Agent: EvilBot 0.1 (Linux x86_64)

Sweet! Now I can use VCL to block the offending user-agent.

1
2
3
4
5
6
7
sub vcl_recv {
...
  if (req.http.User-Agent ~ "EvilBot") {
    error 404 "Not Found";
  }
...
}

I reload varnish, and the attacker’s terrible script starts returning this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
  <head>
    <title>404 Not Found</title>
  </head>
  <body>
    <h1>Error 404 Not Found</h1>
    <p>Not Found</p>
    <h3>Guru Meditation:</h3>
    <p>XID: 1958276348</p>
    <hr>
    <address>
       <a href="http://www.varnish-cache.org/">Varnish cache server</a>
    </address>
  </body>
</html>

Of course this is a very hypersimplistic example. However, I have used varnishtop and VCL-driven 404s in this fashion several times to great success against several patterns:

  • Accept-Language headers unique to a certain region
  • Obviously nonexistent URLs, protecting the application from needlessly running code only to spit a 404
  • Videos embedded in a frontpage instead of using a streaming service

Some guidelines:

  • This technique shouldn’t be used for a long-term fix, however can be a lifesaver when in the middle of an outage. Consider moving to a CDN if these issues become common.
  • Look for out-of-the-ordinary but prominent headers in incoming requests.
  • Be careful to not accidentally block legitimate traffic.

Got a question or an interesting Varnish-related story to tell? Please let me know in the comments!

Real-Time Traffic Analysis 1: GoAccess

- -

When dealing with performance problems with a web application, there’s always the possibility that the root cause is malicious traffic. The difficulty lies in how to find unique identifiers of the attack (addresses, HTTP headers, etc). These series of posts will go over tools and techniques that I’ve depended on when I needed to quickly isolate attackers and take steps to mitigate.

The first tool is GoAccess, which is a ncurses-based log analyser. I use it for NCSA log formats generated by Apache, Nginx, and Varnish web servers.

You can point it directly at a log file. This will allow you to analyse the log as it is being written to:

1
root@host:~# ./goaccess -f /var/log/nginx/access.log

You can even generate an html report by redirecting output to a file:

1
root@host:~# ./goaccess -f /var/log/nginx/access.log > report.html

Here’s an example report.

Key information to look at would be:

  • Top Connecting IPs: is there one or many IPs generating the most traffic?
  • Top Requested Files: is there a page or pages that is being requested the most?
  • HTTP Status Codes: is the stack surviving the traffic (200) or timing out (503)?
  • Top 404 requests: some web applications (eg: Drupal) can take a performance hit when there’s a lot of 404s being emitted

With this information, you will be able to gain a clearer understanding of traffic hitting a single host. If you need to analyse multiple hosts, you can use a parallel shell like pdsh and redirect all webhead logs to a file as they are being written to:

1
pdsh -w web1,web2,web3 "sudo tail -f /var/log/nginx/access.log" > /tmp/analysis.log

Then run goaccess against the file. I use this technique all the time against groups of webservers.

Next time we’ll go over the tool suite that the Varnish caching reverse proxy offers to isolate and block pesky crawlers and botnets. You are using Varnish, aren’t you? :D

Moved to Octopress

- -

I enjoyed blogging with Jekyll however I had the biggest pain with the theme I manually ported from my Wordpress days, not to mention that I never managed to get syntax highlighting to work. I gave Octopress a shot and got a lot of success using the theme FoxSlide, with a few cosmetic changes of course.

Anyway, enjoy the new look!

Fun With Command-Line SIP

- -

I always wanted to have the ability to have a program call my cell and give me an automated voice message, especially if something really goes awry in my infrastructure. Getting called and having to pick up the phone is way better than getting a page or text message, especially when you are sleeping.

The pjsip SIP library comes with a CLI tool pjsua that you can use to place calls. Using Google Translate and sox to convert MP3 to WAV, I constructed the following:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/bin/bash

if [ "$#" -ne "2" ]; then
  echo "usage: basename $0 'text message' phone_number"
  exit 1
fi 

MESSAGE=$1
NUMBER=$2

URLMESSE=$( echo $MESSAGE | sed 's/ /+/g' )
wget -q -U Mozilla -O /tmp/message.mp3 \
"http://translate.google.com/translate_tts?ie=UTF-8&amp;tl=en&amp;q=$URLMESSAGE"
sox /tmp/message.mp3 /tmp/message.wav
( sleep 45; echo q ) | $HOME/bin/pjsua --config-file $HOME/.pjsuarc \
sip:$NUMBER@pbx.domain.com

And the config file:


1
2
3
4
5
6
7
8
--id sip:1234@pbx.domain.com
--registrar sip:pbx.domain.com
--username 1234
--password password
--realm asterisk
--null-audio
--auto-play
--play-file /tmp/message.wav

If you want to send longer messages, you can’t use Google API since the message cannot exceed 100 characters. Use Festival instead:


1
echo $MESSAGE | text2wave -o /tmp/message.wav

Have fun! With such power comes great responsibility!

Loghist

- -

One of my current interests is creating tools for realtime traffic log analysis. I’ve used gltail and GoAccess, however I’ve recently been wanting to visualize a rate of incidence of certain things like HTTP 503s or 404s. The following solution is a bit low tech, however very flexible:


root@machine:~$ loghist /var/log/apache2/access.log 5 "HTTP/1.0\" 200"
2012/03/25 21:03:08 | 18 	==================
2012/03/25 21:03:13 | 30 	==============================
2012/03/25 21:03:18 | 26 	==========================
2012/03/25 21:03:23 | 17 	=================
2012/03/25 21:03:28 | 25 	=========================
2012/03/25 21:03:33 | 28 	============================

You can get loghist here. Enjoy!

Nagios Voice Alerts Using Gmail and Android

- -

Many sysadmins use Nagios; it’s a way of life. I’ve been using it for over 4 years now and it’s a very useful tool for host/service monitoring. The one drawback to it is that when critical notifications hit my phone, a single sound plays regardless of the type or content of the notification. A disk capacity alert or a server outage makes the same noise, which means I need to check my phone in order to actually see what the alert is. If your Nagios installation is noisy because of a very large amount of checked services and not all criticals are in fact actionable alerts, this can be very aggravating.

Since v2.3.5, Gmail on Android can synchronize specific tags to your phone as well as play separate notifications for each tag. Therefore, if you craft a set of Gmail filters to classify your Nagios notifications into categories via tags, you can play a different sound for each alert type.

You can take it a step further by creating mp3 files using this Google Translate URL for each type, then assign it to play for the corresponding tag notification. For example:

Now when you get paged in the middle of the night, you know exactly what’s going on, not just that something is wrong.

Links:

Pdsh - a Sysadmin’s Secret Weapon

- -

In my experiences working with the computing grid and the cloud, the ability to run commands across a large set of servers becomes quite necessary. From forcing a puppet run, to gathering hardware statistics - these tasks become non-trivial and even painful when your server count mounts into the hundreds and beyond. There comes a point where Bash loops will no longer suffice.

I spent a bit of time researching various solutions and came across Parallel Distributed Shell (pdsh), which is an open-source project from the Lawrence Livermore National Laboratory. It is available in most Linux distributions, and can easily be compiled from source otherwise. What it allows for you to do is to run commands on remote hosts in parallel by expressing a hostgroup via an external library such as libgenders.

I highly encourage taking a look at the documentation and seeing how powerful this little tool is: http://code.google.com/p/pdsh/wiki/UsingPDSH

I use it heavily in the operations team at Acquia, and it has served me extremely well when I’m in a tight spot and I need to run a command across a large set of servers quickly. A quick tip about use- I tend to run pdsh with this environment variable setting, especially since servers can commonly be relaunched in a cloud environment, and I don’t want to deal with my SSH known_hosts file being inaccurate:


1
PDSH_SSH_ARGS_APPEND="-q -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o PreferredAuthentications=publickey"

Ask me any questions about pdsh in the comments!

Nagios Error 127- an Unusual Solution

- -

I post this to benefit any Nagios admin who struggles determining why their check and notification plugins fail with an error code 127, like this one:


Aug 12 07:01:39 nagios_host nagios: Warning: Return code of 127 for check of service 'HTTP' on host 'examplehost' was out of bounds. Make sure the plugin you're trying to run actually exists.

That error message gives sound advice for the 99% case. However, a misconfigured commands.cfg file is not the only cause of an 127 error message. After Googling and searching though many support forums, comparing working configs to my non-working one, setting LD_LIBRARY_PATH in my Nagios init script, and even writing test plugins, I found a big clue when I decided to strace the Nagios daemon looking for execve/execvp calls:


[pid 23350] execve("/bin/sh", ["sh", "-c", "/bin/echo -e \"***** Nagios *****\\n\\nNotification Type: PROBLEM\\n\\nService: Puppet Client\\nHost: web-001 \\nAddress: 192.168.1.2\\nState: UNKNOWN\\nLast State: UNKNOWN\\n\\nDate/Time: Sun May 22 18:56:28 UTC 2011\\n\\nAdditional Info:\\n\\nNRPE: Unable to read output\" | /usr/bin/mail -s \"** PROBLEM Service Alert: web-001/Puppet Client is UNKNOWN **\" -a \"Reply-to: alerts@host.com\" alerts@host.com"], [/* 210 vars */]) = -1 E2BIG (Argument list too long)

Aha! It wasn’t the command definition after all, it was something entirely different! After Googling for ‘nagios E2BIG’, I discovered that for large installations, the config option ‘enable_environment_macros’ needed to be disabled, otherwise this condition would occur. At any rate, Nagios should handle this particular error condition with a much more informative error message. Please let me know in the comments if this helps!

A Season of Change

- -

When I studied at USF, the question ‘What company do you want to work for when you graduate?’ was always answered with ‘For an open-source company.’ Two years after graduation, I have finally managed to achieve my goal. I have moved from the comfortable weather of the Tampa Bay area to Woburn, Massachusetts to work for Acquia, a startup that focuses entirely on Drupal, just as Red Hat focuses on its flavor of Linux. I have been hired as a systems engineer, and I am really excited to apply my experience in HPC towards their Amazon EC2-based hosting platform. As I’ve discussed with colleagues in the past, HPC and cloud computing methodologies are starting to merge together, so I should be able to fall into place there just fine. I’ve already started to adapt tools for managing their servers as well as write various Nagios plugins for monitoring their infrastructure.

The weather here in Woburn is starkly different; We had blizzard conditions often and had to work from home sometimes. It’s odd shoveling snow and scraping frost off of car windows as part of the morning commute. The benefits living here outweigh the disadvantages though: there is a Italian bakery, a Thai-Viatnamese restaurant, a steakhouse, TWO chinese restaurants, a Japanese restaurant, and a Brazilian restaurant all within walking distance from my apartment. If I wanted to drive, there’s a really nice sushi place in Stoneham and a H-Mart in Burlington. It might be cold, but my tastebuds won’t be bored anytime soon.

EDIT: Due to a combination of circumstances, I recently moved from Woburn to Burlington, which is the next town over. It’s a mile away from H-Mart. I couldn’t get enough of the place, I guess.