Post

I Accidentally Sent 7 Million SQS Messages

A few nights ago I woke up to an email I had never received before: an alert from Amazon SQS informing me I had used my 1 million free monthly SQS requests. I thought that was weird and jumped onto AWS, assuming Amazon had gotten confused. Instead, I found my worker instance pegged at 97% CPU usage, out of memory, processing ~13,000 SQS messages every five minutes.

I restarted my worker instance, purged my SQS queue (which only had 7 messages in it) and then shut down my SQS logic. (I have a flag in my application code that lets me “pause” publishing to SQS.)

Then I looked into billing. Luckily, SQS messages are less than $0.50 per million. With some confidence that I hadn’t just bankrupted myself, I started investigating.

What Actually Happened

I run a nightly batch process in Linkidex. At a high level:

  • A header is a concept of a batch of “stuff” that needs to happen.
  • The header has many lines (one instance of “stuff” to process).
  • Processing a batch means: process each line one by one.
  • When there are no lines left, you’re done. Simple. What could go wrong.

The Edge Case: A State Transition Failed

In normal life, a line transitions through states like:

createdrunningcompleted

But in this case, the transition to running failed due to a validation on the state transition. This was a valid failure. The state transition was doing exactly what it should do.

Buuuut the logic is:

  1. The batch worker grabs the next created line to process.
  2. It tries to transition the line into running.
  3. The transition fails.
  4. The worker says, “Cool, moving on to the next created line!”
  5. But because the state never changed, the next created line is the same line again.
  6. Repeat at maximum speed, forever.

The state transition failure is one of the first validations that happens in this path of logic. Like 7 lines of code were being executed per message, and then the logic would say “ok, let’s send another SQS message to process the “next” line.”

And that’s how you get seven million SQS messages published in about 9 hours from a tiny EC2 worker instance.

The Fix

I fixed the state transition flow properly. If a line can’t transition into running, it now transitions into a terminal state. I also added tests that attack this specific issue. All of my services already have good test coverage. It just wasn’t good enough in this case.

The Monitoring Lesson

Monitoring this isn’t as simple as “alert if SQS has > 1,000,000 messages.” Linkidex can handle 1 million SQS messages (as proven by the other night). The queue also never had more than a few messages in it. The issue was that the message throughput was insane for a number of hours.

So the more useful signal would be:

  • message publish rate / throughput (e.g., “why are we suddenly processing 13,000 messages every 5 minutes?”),
  • sustained spikes relative to baseline,
  • and/or alerts tied to “the same job repeating the same work.”

I’m still dialing in an approach that I will be happy with long-term.

The Moral of the Story

SQS messages that can send additional SQS messages are dangerous. There is a reason I am doing it this way instead of sending one message per line for every line in the batch all at once, but I may have stumbled into the area of being clever, and in my experience “clever code” is never a good thing. Simple code is a good thing. Clever code leads to unexpected consequences.

Such as publishing 7 million SQS messages a night.

~ David