Elastic Stack May Be Costing You More Than TCO

The Elastic Stack has value to the software development community but is known for being hard to maintain. Many developers look past that, saying that the total cost of ownership for it is still lower than a paid package. However, beyond these costs, it has fundamental flaws that keep it from meeting its service-level objective to our development teams. It reminds me of a scene in Aladdin, inside the cave of wonders. Many teams think of the Elastic Stack as the magic lamp: Just grab it and your monitoring wishes will come true, but the cost of the wishes runs deep. However, the Elastic Stack is actually the ruby that Abu nabs: It appears large, shiny, and easy to grab. But once grabbed, it melts away into nothing, failing to live up to the hope of riches Abu had.

In this post, I will explain why people like the Elastic Stack. I will show how it doesn’t quite live up to the value it sells because it can’t always meet the service levels we expect for our development teams. I will then show how Scalyr does meet these service levels without compromising the flexibility a dev team can use to meet its unique needs.

Why Do People Like the Elastic Stack?

The Elastic Stack definitely has appeal. To overview, the Elastic Stack is composed of the following components:

  • Elasticsearch: this is the search and analytics engine for querying across all your logs.
  • Logstash: this is the storage mechanism, focusing around indexing logs for searching.
  • Beats: these are data shippers that automatically collect and send data to Logstash from your servers.
  • Kibana: this is the user interface for visualizing your logs and metrics as graphs and charts.

Each one of these components is open source: no paid licensing, no subscription costs.

The Elastic Stack claims to make it easy to search through your logs and give you the operational data you need when you need it. It also sells monitoring tooling: Want to know why your customers says your website is slow to place orders? Kibana will show you a graph of how long it takes to place an order in multiple instances of your servers. It also claims to be good for looking at machine health: What is your server’s uptime? How much memory does it consume?

Many tools claim to do similar things, including Scalyr. However, many feel that the Elastic Stack has advantages beyond the lack of license fees. Many developers believe that it will give them more control. Most developers I know highly value having control over their software and infrastructure. With the Elastic Stack, you have to control everything. You have to provision your own containers or servers and install the software itself. Of course, with control comes flexibility: you can change what you need to in both the software and the infrastructure when you need to, how you need to. You also get to avoid vendors. Many developers seem to have a mistrust of most software vendors, making open-source tools look attractive by comparison.

Most developers I know highly value having control over their software and infrastructure.

Hosted Solutions

An obvious question that comes up is, “Do I have to control and set up everything myself?” You don’t have to, these days. There are providers that can host your stack for you, including Elastic themselves. Amazon Web Services also hosts a solution. And other competitors like Logz.io and Logsense have carved out a customer base for themselves.

However, hosted solutions only solve part of the problem. They trade off some of the original benefits of control, flexibility, and pricing that the Elastic Stack gives. But when using hosted solutions, you still have to contend with the problems fundamental to the stack. These problems go beyond the cost of maintenance or hosting, and you should care about how this ruby can melt.

Why Should You Care?

The Elastic Stack has some fundamental problems that stop it from meeting common service levels to a development team. If your team refuses to care about these problems, it is likely you will pay the price for it.

Service-Level Objectives

When we use monitoring software, you implicitly expect certain levels of service. You want it to be available when you need to use it. It should work relatively quickly, giving you search results when you need them. This is especially pertinent when you are using it to troubleshoot a sticky issue in production. Yet, the Elastic Stack continually fails to meet such expectations.

It has very limited scalability. As your systems grow in traffic and complexity, the Elastic Stack will struggle to keep up. You have to put in a lot of upfront work to mitigate this, like telling which applications to write to which fields. Once such restrictions are in place, they’re hard to change down the road.

The Elastic Stack suffers from log availability because of how it indexes. Your system is unique, and it can be hard to predict exactly how you will need to shape your logs for searching. But the Elasticesarch works against evolving needs. Many log changes need to be re-indexed. this indexing is slow and can take your entire monitoring service offline for long periods of time. Even when it does not, the re-indexing can lag all your searches. Further, any change to your log formats can suddenly break dashboards and other UI pieces.

As your systems grow in traffic and complexity, the Elastic Stack will struggle to keep up.

Even without re-indexing, the Elastic Stack is not optimized for high-cardinality searches. If I have a complex purchase order process with more than a dozen different fields I want to query across half a dozen different log entities, I can expect timeouts after 30 seconds of attempting to search.

Finally, the Elastic Stack is old enough to have a lot of backward compatibility support. This means it is hard for it to innovate over existing problems. It also means there are hard-to-predict reliability issues hiding.

Support Costs

All of these problems are in addition to the other costs of ownership teams willingly take on:

  • Infrastructure acquisition. To deal with the Elastic Stack’s limited scalability, you often have to add more hardware to your setup. This, in turn, ramps up your infrastructure costs.
  • Infrastructure support. Your team will have to dig deep into certain infrastructure issues. Elasticsearch has a lot of options and very little guidance on how to select the right options for your team.
  • Ongoing operations. Your team will have to continually carve into their capacity to visit the inevitable challenges that will pop up with your Elastic Stack.
  • Training. The members of your team may have to delve into tooling that they are unfamiliar with. And to avoid this, you would need to invest in upfront training. Even with training, there are elements that could take months for your team to gain true expertise with.

With all this in mind, let’s look at how Scalyr helps you avoid these problems while going beyond your expected service levels.

Scalyr Benefits in Contrast

Scalyr gives you many benefits that you can directly compare against the Elastic Stack:

  • Software as a service. Scalyr hosts your monitoring by default. This means that all the software and infrastructure behind it is built first class for handling varied, large loads of data. It also does not need to worry about long-tailed backward compatibility. Scalyr continually refines its software to let go of unneeded features and make room for new ones.
  • Optimized for high-cardinality data. Feel free to pump in as many fields per log as you can get away with. It will all be searchable quickly, without needing to invest lots of time designing them up front. You will not need to leave fields behind because they don’t “fit the schema,” are too large, or take the service down while it re-indexes.
  • Scale up/down without concern. Being built first class for hosting, Scalyr can easily scale up and down with no downtime or significant lag on your searches.
  • High-speed queries. Scalyr has been able to build its queries to be fast with little up-front design needed from you as a user. You will rarely see timeouts, even when new fields are added to your logs.
  • Real-time problem resolution model. This plays back into up-front design. you need not worry about what you may need to search on ahead of time. Scalyr makes it easy to develop new searches, and points out data related to things you are already searching on. This makes problem resolution fast.

Scalyr Gives You Control

In addition to the above benefits, Scalyr gives your development team control of the things they care about.

  • Filters and rules. You can customize filters and rules to control how data gets categorized and what data gets retained over time. Scalyr comes with smart defaults for these if you don’t want to mess with them too much.
  • Parsers. Scalyr can organize any arbitrary formatting for the output of your logs. If you have more complex formatting, you can customize these parsers.
  • Visualization. Your team members can leverage useful built-in visualizations. Or they can easily create their own visualizations of the data. They can also use Grafana if they are more comfortable with that UI.
  • Power queries. Scalyr’s power queries are designed to handle the most complex questions you can ask of your data. If your team needs to find obscure links across three or more different log sources and aggregate the results in five different ways, power queries are for you.
  • Plugins. Scalyr has a varied set of plugins used to collect data from various sources. Does your team have a unique data source? You can make your own plugins, too.

All of the above customizations are reusable and savable. Beyond all of this, Scalyr automatically handles most things you probably have to worry about. In an Elastic Stack, you may have to think about how many nodes to add to a cluster. Not with Scalyr. That’s all handled for you.

Conclusion

Hopefully, it’s clear that in many cases the Elastic Stack does not give you the scalability, latency, and availability you need from a first-class monitoring system. On the other hand, Scalyr can give you much of the value promised by the Elastic Stack, but with the service levels you need. Scalyr is no magic lamp, but it is a ruby that guarantees not to melt when you touch it.