Elasticsearch Query Testing

Elasticsearch Query Speed

In modern software operations, Logs are essential, but dealing with storing, indexing, and searching them can be time consuming. There are many decisions to make but one of the most important is the tradeoffs between running an in-house system like Elasticsearch or Splunk, or using a hosted log management SaaS like Scalyr.

Many factors go into the decision of what tool to use like cost, log volume, your in-house skills, and if you are able to dedicate engineering time and resources to managing a complex search system. As a Site Reliability Engineer one of my primary concerns when choosing a log management system is the speed and reliability, particularly when being used to debug operational issues.

In this post I give an overview of some of the tradeoffs between running an Elasticsearch yourself, and using Scalyr to host and manage your log data. We’ll look together at the configuration of an Elasticsearch cluster, load some log data, and then run queries over this data to compare the query performance of Elasticsearch and Scalyr.

Elasticsearch vs Scalyr Architecture

Elasticsearch is a search engine built on top of Apache Lucene. When used together with Logstash and Kibana for storing and searching log files it’s known as the Elastic Stack (also called ELK). It is also called EFK when Logstash is replaced with Fluentd, which is happening more and more often. For more on the Elastic Stack see their website.

ELK/EFK is flexible and can be configured in many different ways. However, this flexibility comes with a cost: significant effort can be required for configuration, maintenance, and troubleshooting.

Scalyr is a purpose built signal store, which has been highly optimized for storing, processing, and searching log data. To ingest logs and metrics, Scalyr provides their own open source log agent scalyr-agent, and also supports popular log shippers like Syslog, FluentD, LogStash, and Kafka (coming soon!).

Scalyr is provided as a hosted SaaS solution which frees you from the responsibility of having to configure and tune a search cluster. Scalyr’s architecture fits best for investigating operational problems, because it maintains fast query performance for both simple queries, and queries that require scanning large amounts of data such as performing aggregations. Scalyr also manages the complexity of the Ingest, Storage, and Search systems behind the scenes, and maintains a fast ingestion time of 1-2 seconds.

Elasticsearch Configuration

Elasticsearch has extensive configuration options and possibilities. I found one of the most time consuming aspects of setting up ELK to be understanding the different ways in which a cluster can be designed, and how these choices interact with each other and impact the outcome. For example, there are several node roles:

  • Master-eligible Nodes
  • Data Nodes
  • Ingest Nodes
  • Machine Learning Nodes
  • Coordinating Nodes

Many clusters run with all of these roles on every node. However, with large clusters and high data volumes, it can be necessary to run these roles on separate servers. For this test I decided to run in the default configuration of every role on each node.

Elasticsearch Cluster Sizing

A common question with Elasticsearch is, how big should my cluster be?

Unfortunately, the answer I repeatedly found is: “It Depends”.

Elasticsearch says on their page on Quantitative Cluster Sizing:

> “How many shards should I have? How many nodes should I have? What about replicas? Do these questions sound familiar? The answer is often ‘it depends’”

The advice given is to start an Elasticsearch cluster, load your log data, and then perform queries and measure the performance. If the performance is too slow, then begin evaluating the different options for making cluster changes including, larger nodes with more CPU or Memory, a greater number of nodes, faster storage, a different number of shards, separating the cluster roles, tuning parameters such as the index refresh time, etc.

When starting to run Elasticsearch it seems that there’s a large amount of unknown-unknowns, which can only be understood through operating a cluster and encountering problems. One issue with this approach is that when performance limits are hit, the ability to effectively query your logs can be disrupted.

Based on discussions with colleagues who have run large Elastic clusters, I started an Elasticsearch cluster on AWS using five M5.XLarge nodes which have 16GiB of Memory and 4vCPUs.

For storage I attached 5TB GP2 EBS volumes. This is the typical method of attaching storage to Elasticsearch nodes in AWS. Elastic has new preliminary documentation that suggests using EC2 instance store, however this introduces additional complexities and is not easily supported with the Kubernetes Operator. There seems to be some differing opinions on using EBS with Elasticsearch, and Elastic prefers to use on-host instance-store.

GP2 EBS volumes scale their performance with size, and according to the AWS documentation 5TB volume should have approximately 15,000 IOPS of baseline performance.

In this test I monitored the EBS volumes through CloudWatch during ingest, and did not observe any throttled performance. If anyone has experience with directly comparing an Elasticsearch cluster using GP2 EBS with one using instance-store I’d love to hear from you in the comments below.

Elastic on Kubernetes

In order to actually run Elasticsearch, at first I started configuring it directly on EC2 instances. I did this both because I wanted to understand the configuration and installation, and to keep things simple.

What I found was as I wanted to try different options across the cluster, such as growing and shrinking the number of nodes, this was much too time consuming to manage. I took a look at the options and decided to use the Elastic Kubernetes Operator: cloud-on-k8s. I considered using Ansible or Puppet to manage the configuration, but as I’m very familiar with the care and feeding of Kubernetes, I decided to go with what I know.

The Elasticsearch operator does simplify the most tedious aspects of starting a cluster, but it’s still important to understand the different components and underlying architecture and settings.

Scalyr Configuration

As Scalyr is a fully managed SaaS, I didn’t have to do any cluster configuration or setup. I just signed up on Scalyr.com and used the open source Scalyr-Agent to send the log data to the service.

Log Data

For this test, I used the flog fake log generator to produce apache_combined access logs. I made a small change to the generator to generate these in JSON for simple parsing.

Here is a sample log line:

```
{"Hostname":"181.73.177.67","UserIdentifier":"effertz1003","Datetime":"02/Mar/2020:08:21:04 -0800","Method":"POST","Request":"/infomediaries","Protocol":"HTTP/2.0","ResponseCode":200,"Bytes":82449,"Referrer":"http://www.directrepurpose.com/action-items/sexy/engage","Agent":"Opera/10.17 (Macintosh; U; PPC Mac OS X 10_6_0; en-US) Presto/2.11.263 Version/10.00","TraceId":"82c4ff30-ce6f-426d-9ddc-be31a033300e"}
```

To simulate 400 Gigabytes of daily log volume — an estimated log volume for a mid-size organization — I created a Kubernetes deployment with the log generator and scaled it to the appropriate size to generate 400GB over 24 hours.

I then used Filebeat to send the data to Elasticsearch, and the Scalyr-Agent to send the data to Scalyr.

Queries

Once I had a full day of data loaded, I decided to simulate a typical troubleshooting experience. Let’s imagine that we have received reports of some users having problems in one of our applications, and we are trying to pinpoint the problem. Our metrics dashboard indicates an increased level of 5xx responses, but it is not clear what is going wrong because the responses are pre-aggregated. To investigate this we will use logs from the load balancer.

5xx Responses

The first query I ran is to simply look at requests with a 5XX status code over the last 4 hours. On both Scalyr and Elasticsearch the maximum results were limited to 100 loglines.

On Scalyr this query consistently returned in less than one second.

On Elasticsearch this query also returned in less than one second.

* Note: For these measurements, I used the Elasticsearch and Scalyr query APIs, and sent requests from Postman in order to collect accurate timings. Browser perception is representative of the numbers.

5xx Responses for e-commerce requests

In this imaginary debugging scenario, the e-commerce team says that their users are reporting problems. It is a legacy application and there are multiple URL paths that go to e-commerce.

Because of this, I next used a RegEx to filter for “.*e-commerce.*” in the path, and return 5xx responses over the past 4 hours.

On Scalyr this query consistently returned in less than one second

On Elasticsearch this query took between 1.5 and 5 seconds

I find that approach is one I use often when debugging production issues: Starting with a broad query then adding more and more search filters to breakdown and filter the data in different ways while narrowing in on the problem.

Because the Scalyr architecture is designed to rapidly scan large volumes of data, it is able to maintain high performance as queries increase in complexity. To learn more about how it’s done, check out the Scalyr Blog on Searching 1.5TB/sec.

Percentile Aggregation for e-commerce requests

The e-commerce team suspects that something has gone wrong with a new compression scheme they have deployed.

To investigate this, I wanted to see the 99th, 95th, and 90th percentile of the request sizes, over the last 4 hours, and compare it with the past 24 hours.

First I ran the aggregation for the past 4 hours:

On Scalyr this consistently returned in less than one second

On Elasticsearch this took between 6 and 8 seconds.

Next I ran the same query for the past 24 hours:

On Scalyr this took 2 seconds

On Elasticsearch this took between 29 and 40 seconds

As I tried different time ranges and queries, it appears that Scalyr’s response time remains nearly flat, while on Elastic it seems to grow linearly. This is likely due to the performance advantage Scalyr has because it horizontally distributes the load of querying the log data across a massive multi-tenant cluster. Once the log data has been loaded for querying, more complex queries do not incur a significant performance penalty. In contrast, queries which take advantages of Indexed fields in Elastic are able to return with good performance, however as we dig further into the incident and issue queries that require scanning large amounts of log data, Elastic slows significantly because these types of queries are not amenable to indexing.

Query Summary

Query Query Results: Scalyr Query Results: Elasticsearch
5xx Query (4 hours) < 1 second < 1 second
ReRegEx to filter for “.*e-commerce.*” in the path, and return 5xx (4 hours) < 1 second < 5 seconds> 1.5 seconds
Percentile ranking of request sizes, (4 hours) < 1 second < 8 seconds> 6 seconds
Percentile ranking of request sizes, (24 hours) < 2 seconds < 40 seconds> 29 seconds

From the query results above, we can see both Elasticsearch and Scalyr deliver reasonable performance for queries that work well with indexing, such as returning log lines with a range of status codes.

As we move towards needing to answer questions about our application and system which were not anticipated in advance and stored in an index, Elasticsearch becomes significantly slower because the design is not optimized for rapidly searching over large volumes of log data. Scalyr, on the other hand has designed a horizontally scalable architecture that maintains performance even when scanning and aggregating hundreds of gigabytes of data on the fly when the query is made.

As systems become more and more complex with an explosion of microservices, the ability to effectively perform queries over large volumes of log data becomes more critical to enable Software Engineering and SRE teams to perform effectively. With systems composed of dozens or hundreds of services, observability tools like Scalyr allow running complex queries that are not constrained by data that is pre-indexed.

In this test I queried over one 400GB day of data. We can imagine scenarios where we would want to query over a week or a month of historical data, for example to see the 99th percentile trend over time to see if it has increased or decreased over the last several deploys of an application.

Based on this testing, it appears that Elasticsearch would become significantly slower when performing this aggregation over a greater time span, while the Scalyr performance would have a smaller impact.

If you have questions or comments about any of this you can find me on Twitter: https://twitter.com/thebrc007

To try Scalyr for yourself, check out http://www.scalyr.com for a free 30 day trial.If you are interested in the performance of Scalyr over multiple Terabytes of data, leave a comment below and let’s chat.

Support

Elasticsearch is flexible with a wide range of configuration options. However, this flexibility comes with a cost: significant effort can be required for configuration, maintenance, and troubleshooting. It took considerable time and investigation to explore, learn, make choices and experience pitfalls. There is no magic formula or online case study that will give you all the answers ahead of time. Even after the initial configuration is determined, users should consider the ongoing effort required to manage for growth and changes, solve problems and support a user population. Indicies can be hard to size and there can be costly trial and error to get things optimized.

Elasticsearch users need to consider the balance of TCO (total cost of ownership) versus the costs of a paid service, such as Scalyr, that removes most of the configuration and support burden.

References:

Node | Elasticsearch Reference [7.6]

Quantitative Cluster Sizing

Amazon EC2 Instance Comparison

Sizing Hot-Warm Architectures for Logging and Metrics in the Elasticsearch Service on Elastic Cloud