Linkidex - First Major Outage
Accidentally Bringing Down the Site
A week or so ago I was wrapping up one of the features I have been most excited about for Linkidex - AI Suggestions. I had just gotten back from a bunch of international travel. I had been monitoring Linkidex and I use it more or less every day, but I paused feature development in favor of spending more time exploring various countries. It felt good to be building again, and plugging into OpenAI is something I have been planning on for a while.
I wrap up testing locally, merge to master, and deploy. Elastic Beanstalk does its magic, and boom we are live. I forgot to add my OpenAI production API key to the EB environment. Not a big deal. I add it via the AWS UI, which restarts the web server. While in the AWS UI, I see that the platform version is out of date. Also not a big deal. At least it never has been in the past.
Elastic Beanstalk periodically updates their supported Platform Versions. In my case, updating the platform can mean I need to update the version of ruby
or puma
that the Linkidex monolith is using. If I don’t, the rails gemfile won’t match what is actually installed on the ec2 box and the rails application will fail to start.
I did not realize this at the time. Even now I am assuming that this was the original core issue. All I know for sure is updating the platform failed and brought the production Linkidex environment down.
It was immediately obvious that something wrong had happened. I pulled logs, and noticed in eb-engine
logs that while deploying we were getting an error that we were using the wrong ruby version. Ec2 instances were using ruby 3.0.6
but the rails application was on 3.0.4
. I bumped Linkidex to use 3.0.6
. I had to deal with some dependency issues when doing this, but nothing major.
Updating Linkidex to 3.0.6 and redeploying resulted in the exact same error. It seemed EB wasn’t respecting the contents of my gemfile. I could SSH onto the machine and CAT the gemfile to confirm 3.0.6 was indeed now on there, but Elastic Beanstsalk was still unhappy. The internet suggested uploading a zip file to EB instead. With Elastic Beanstalk you can either deploy via the CLI like a normal person, or you can zip up your web application and manually upload the zip folder to Elastic Beanstalk like a savage via the AWS UI. Unfortunately this did not do the trick either.
At this point I was thinking something must be cached. I tried rolling back to a previous healthy build, but the rollback process on Elastic Beanstalk doesn’t include downgrading the environment to its prior platform version, thus this did not help.
So I decided to rebuild the environment. This is an option built into Elastic Beanstalk you can just select via a dropdown. I figured if a restart didn’t clear the caching issue or whatever it was that was making EB not respect the gemfile, a rebuild might.
Making It Worse
It did not.
Rebuilding the instance failed about halfway through. Linkidex was now not in a ready state
. Further internet research suggested that this is a terminal state Elastic Beanstalk can get into. If it does get into this state, your only option is to find what part of the rebuild failed via AWS CloudFormation logs, manually fix it, then reach out to AWS customer service and have them update the state of your Elastic Beanstalk instance so you can try again.
I was less than joyous discovering this. I have worked with AWS support before. I don’t spend enough money on AWS to get a timely response and Linkidex was completely down. I elected to try something different.
I cloned the production Linkidex environment. This is another feature built into Elastic Beanstalk. Then I swapped the domains of the terminal prod environment and the new environment. Linkidex was still down, but now it was at least debuggable by me. I redeployed, and began getting errors that var/app/current/development.rb
did not exist. I updated .ebextentions
to handle creating this file if it did not exist, just to get past this step.
Deployments were now succeeding and passing the Elastic Beanstalk Health Check, but Linkidex was throwing 504 Gateway errors. I searched logs, and identified that puma was complaining “You have already activated nio4r 2.7.1, but your Gemfile requires nio4r 2.5.8. Prepending bundle exec
to your command may solve this”
nio4r 2.7.1
was coming from Elastic Beanstalk, nio4r 2.5
was coming from the Linkidex gemfile. I updated Linkidex to use nio4r 2.7.1
instead.
Puma next complained about using the wrong puma version. I bumped that up too.
Elastic Beanstalk was passing healthchecks, and Linkidex public assets (the blog) was loading, but the React application was not. Chrome and Firefox were both blocking it due to ‘unknown headers’ which led me to believe a Route53 issue was happening. I checked everything in route53, and identified that the load balancer URL had changed. I updated this, and after waiting for the updates to propagate Linkidex was finally back online.
Learnings & Suggestions
Infrastructure issues are challenging to solve. If an error gets thrown in application code its typically straightforward to identify where one should at least start an investigation. With Elastic Beanstalk, I find it challenging to figure out where to start. For example, when pulling logs for Elastic Beanstalk, there isn’t one log file that may have the error. There are a lot. And just because there is an error, doesn’t mean its the error. eb-engine.log
s and puma.log
s tend to be the most helpful for me thus far, but who knows what will happen next.
Moving forward I’ll do Blue Green Deployments whenever making platform updates. In hindsight this is obvious.
For anyone else dealing with infrastructure things while part of a small / one man team, I highly recommend you write everything down when dealing with issues like this. I have a growing playbook about how to configure / rebuild / debug Linkidex infrastructure. Every time something bad happens I document everything I can while debugging and then organize it afterwards. I also recommend looking into terraform.
Infrastructure as code would theoretically allow someone to just execute a command and rebuild all of their infrastructure from scratch. I am not sure how this would play with Elastic Beanstalk, as doing this such as cloning environments and swapping URLs negates some of the value terraform would provide. But having one ultimate set of instructions defining everything your infrastructure relies on is very valuable. It also makes it easier to remove dead infrastructure from your environment. In my case, I have resources left over from half terminated environments, and making sure those resources are indeed dead and safe to remove can be tedious. Disclaimer: have not used terraform in a meaningful way yet. I do know smart people however who like it.
Anyway this wraps up how I brought down and brought back up Linkidex production. Below are the actual notes I took while dealing with all of this incase whoever is reading this right now also currently is lacking a living production Elastic Beanstalk environment and isn’t in the mood to parse a 1,200 word essay. Good luck!
Notes from untenable crash state of ldx20-v2-production May 8th 2024
- Updated platform, this was a mistake.
- Noticed in
eb-engine
logs that while deploying we were getting an error that we were using the wrong ruby version. New Platform wanted to use ruby3.0.6
, we were on3.0.4
- Updating gemfile to
3.0.6
and redeploying has no effect. - Internet suggests just uploading a zip file. redeploy via zip has no effect.
- Restarting instances via AWS UI has no effect
- Tried rebuilding instance. This was a mistake. Can no longer deploy as EB is ‘not in a ready state’, this requires AWS support to resolve which we don’t have access to.
- Cloned ldx20, made ldx21
- swapped environment domains via AWS EB UI
- Updated
.elasticbeanstalk/config.yml
to have master branch point to new ldx21 environment. - Redeployed ldx21 via CLI
- Was getting errors in
eb-engine.log
thatvar/app/current/development.rb
did not exist. created it insetvars.config
in.ebextentions
NOTE: this step should not be necessary, deleted this later on. - Deployment now succeeds, but we are still getting 504 gateway errors
- Puma logs now complaining
You have already activated nio4r 2.7.1, but your Gemfile requires nio4r 2.5.8. Prepending 'bundle exec' to your command may solve this
nio4r 2.7.1is
coming from Elastic Beanstalk. Upgraded ruby application code to usenio4r 2.7.1
instead- Puma now complaining about using the wrong puma version. Upgraded ruby application code to use puma version Elastic Beanstalk was loading.
- Elastic Beanstalk was passing healthchecks, and /blog was loading. React application was not, chrome and firefox were both blocking it due to ‘unknown headers’ which led me to believe a Route53 issue was happening.
- Load balancer URL had changed. Updated route 53 record to point ot the new load balancer URL. Had to do this for 2 records within the linkidex route 53 settings
- I believe the above eventually worked. There was a delay as route 53 settings propagated throughout the internet (this delay was independent of the Route53 60 seconds to push stuff live)