Very recently, one of the platforms that we develop and host for a client was the subject of a take-down attempt via a DDOS (Distributed Denial of Service) attack – and as it later transpired – with the goal of extorting money. Thankfully, the attack was more “Made by Mattel” than “Made by MI5” – but this incident gave us a chance to revalidate and prove out the tools and processes to identify what happened – and more critically, put in place some measures to improve things in the event that the individual with the “my first hacker kit” came back for a second attempt.
At around 15:50pm – we noticed that our application server cluster CPU utilisation went through the roof. A quick message from one of our DevOps team confirmed that things didn’t quite look right. As you can see in the images below, despite the ‘maxed out’ CPU of the Auto-scale group – things actually continued to work pretty well. Target response time took a little hit – so queries to the platform took a little longer than we would have liked – but so far – not the end of the world.
Although these kinds of things are a little unnerving and sometimes painful – we figured this process was a great opportunity to not only test our play-book again, but also to take a moment to tell you about some of the things we have setup that make troubleshooting and resolving these scenarios a lot easier.
We develop nearly exclusively inside of AWS (disclaimer: other cloud providers are available) and as a result are big users of some of the out of the box AWS tools like CloudWatch. When we start to deploy infrastructure, we also deploy a few lightweight monitoring elements. There are number of standard metrics that we configure on our dashboards – for example, CPU utilisation, ELB/ALB active, request and connection counts, target (application) response times, application service errors (4xx/5xx) and WAF counts (we’ll come to that later). Our monitoring setup again proved pretty useful for us and highlighted the problem immediately – giving us the opportunity to not only narrow down activity to date / time windows and services and subsystems, but also perform an initial impact analysis. Without this visibility – even initial impact assessment would have been greatly more complex and time consuming.
Logging is generally enabled for all components we deploy as part of a project or ecosystem. Every log event is dumped into an S3 bucket and we leverage Athena to build tables against the log structures to support analysis (yes Kinesis is perhaps a more streamlined way without the latency – but it’s also more complicated and in some projects its way more than we need).
Centralised logging gives us a way to quickly analyse traffic patterns and look for anything that stands out at an aggregated high level. For example – we can break down data by response code (Error / LimitExceeded etc) and compare this with IP counts and traffic that is passed through from CloudFront to the ALB.
A quick scan of traffic over the course of the attack duration indicated that this was a pretty rudimentary DDOS attempt. The incredibly high request numbers from single IP’s over a short period is evident – and is way more traffic than the platform usually sees.
The various AWS IaaS and PaaS components in AWS generate different degrees of logging. For example – ALB and ELB logs do not capture referrer headers, but CloudFront logs do. However, In our DDOS scenario – the individual(s) responsible were targeting the ALB directly – not the CloudFront entry point. In order to make further headway with this – what we need to see is the referrer value – often a static string or URL structure and typically consistent with a targeted DDOS. Que WAF. This is a bit of a limitation with Athena and S3 logs – so introducing WAF.
WAF – Web Application Firewall
WAF not only gives us the ability to restrict access, but critically also allows running in ‘silent’ mode. This means that rather enabling with default blocks – you can enable passively to monitor. This allows the generation of a great set of data that you can analyse and shows a bunch of useful information like Source IP, URI, Match Rules (in this case DefaultAction is passive mode), Action (Block/Allow) and Time – as requests hit the edge of a CloudFront or Application Load Balancer instance.
The example below shows what we saw during the DDOS – with common patterns that emerged. The URL’s being requested seem to be spuriously generated non existent paths – a frequently employed tactic as query strings are often bypass any caching logic and go straight to the ALB / Instances – generating additional load.
It’s also worth noting that the Header was consistent in almost every line item over the period of high activity (https://www.google.com/search?q=domainaddress).
Regardless of whether you intend to use WAF to block – we’d generally suggest you run it in passive mode – giving the ability to monitor activity. A quick look back every few days to check for traffic patterns will likely highlight more than you think is really going on.
Auto Scale Group Capping
As a final note, and something that relates not to monitoring – but something we learned as impact and damage control – is ASG capping. The DDOS that we experienced was designed to bypass caching logic and generate additional load. In our traditional ASG architecture – the result would have been potentially hundreds of instances spooling up and assigning themselves to the ALB to account for the increased CPU and additional latency in the response times.
In many customer environments – we hear that the key autoscale logic to spool up and down resources is in place – but when asked about caps or limits – many say “none”. “Why would we? We have unlimited resources with the cloud and just need to manage the scale down process correctly”.
This is wholly true – but in this scenario – had we not noticed when we did – and with no ASG limits imposed – those additional instances would have burned through some serious money.
An Auto Scale Group should have limits imposed that account for peak calculated traffic – even in scenarios like live sporting platforms – where rudimentary estimations can be made against subscriber numbers and historical trends. This capping ensures that you’ll maintain more than enough capacity for the peak demand – but not experience a nasty shock over your 8am coffee when you arrive at the office to 150 self spawning instances.
There is a great quote that was attributed to a former employer of mine:
“There are two types of companies: those who have been hacked, and those who don’t yet know they have been hacked.”
I’m proud to say we’re now in the former group. I’ll await my badge and membership pack.
Chief Technology Officer.