

In the end, we reduced the API gateway’s memory usage during expensive queries by 60-70%. If you’re a pro, I think you’ll still find the outcome of our investigation interesting! If you’re a novice Elixir user, I’ll introduce some handy tools that may be new to you. If you’re unfamiliar with Elixir and Erlang, I’ll talk a bit about the features of the VM that they share and how it factored into our investigation. My hope is there’s a little something for everyone here. This blog post covers what we learned, how we learned it, and how we ultimately addressed the underlying problem. A recent increase in our memory usage warnings prompted us to investigate once again. They were never frequent or sustained enough to cause a disruption, and they were difficult to reproduce. But sporadic OOM kills sometimes occurred, often under puzzling circumstances. We've experienced more than 10 times the traffic spikes that increase CPU usage, with little impact other than a small increase in latency. Two years later, this assessment has proved correct. While we didn't trigger this in our load tests, we have seen it in production and know it can happen. When memory-bound, the gateway will eventually be killed because it's out of memory (OOM), like any other service.When CPU-bound, we feel confident that throughput will maintain, but latency will increase.What exactly was happening to the BEAM VM under stress? How could we mitigate risks or recover from a severe incident rapidly? We wanted to force the gateway into a failure state so we could better prepare for the future. Our team's desire to understand the failure scenarios for BEAM-based services drove us to do thorough load testing on an API gateway that uses GraphQL two years ago. You must also understand the ways your tools can fail. We’ve gained some expertise along the way, but expertise requires more than knowing how your tools can perform well. The stability and performance of Erlang’s virtual machine (VM), known as BEAM, has impressed us on the New Relic Unified API team. New Relic has been running important, high-throughput services written in Elixir for a few years.
