How Observe.AI Optimized Elasticsearch Aggregations for Nested Fields

Elasticsearch is used as our secondary datastore for running queries and search on call transcriptions. In this article, we’ll discuss optimizations on the schema of our Elasticsearch index to suit our reporting workload.

At Observe.AI, we’ve built a platform that allows contact centers of any size to evaluate and score 100% of calls, analyze agent and team performance trends, and enhance agent training and coaching. Our typical workload includes pulling voice call recordings from customers, processing calls, and report the call insights to end users.

Problem Context

When processing a call, we calculate call signals as “present” or “not present.” We report on those findings as a percentage of calls that have that call signal as “present,” along with many other call filter parameters. We use Elasticsearch for this reporting because it serves additional use cases (like phrase match searches). In Elasticsearch, each document is mapped to a customer service call. To save the call signals information for a call, we use the following nested schema in Elasticsearch index:

// ES document structure
meeting : {
"criteria":{
"type":"nested",
"properties":{
"id":{
"type":"keyword"
},
"present":{
"type":"integer"
}
}
},
...

To report the percentage of calls with a certain call signal on the above schema, we use the following aggregate query:

"aggregations":{
"criteria":{
"nested":{
"path":"criteria"
},
"aggregations":{
"criteria_aggs":{
"terms":{
"field":"criteria.id"
},
"aggregations":{
"avg_score":{
"avg":{
"field":"criteria.present"
}
}
}
}
}
}
}

This nested structure worked well for us when we were running aggregations on thousands of meetings. The page load times were less than a second for customers with 100k meetings in a month.

However, as our platform scaled, we had accounts with over 5 million meetings in a month, and the page load times started to take longer, especially when the system was under processing load (with a large number of phrase search matches going on Elasticsearch service). These aggregation queries became a bottleneck for the reporting pages with observable slowness in page load. The aggregation for around 5 million documents typically took 3–4 seconds, however when the system was under load, this increased to 15–20 seconds.

When profiling our query, we found most of the time was taken in the Aggregation Collector:

Solution

Separate Indexes

We have two types of data for each “call” that we index on Elasticsearch.

Call processing data (like call signals) and call metadata on which aggregations and filters are run.
Call transcripts on which text searches (phrase search matches) are done.

One easy optimization was dividing the index into two based on the above data. The bulk of the document size comes from the transcription of the call, so the first index (that runs aggregations) becomes relatively small and is only used for reporting workflows.

The other index, which contains transcripts of a call, is bigger in size and is normally the one under load while processing calls. This separation removed a lot of issues. This allowed us to separate the transcription index and host it on a separate cluster. As a result, our aggregation APIs aren’t affected by the system under heavy loads (there is still substantial indexing on this index as well, but that is 100x less than the number of search requests, which happens on a separate index).

Moving to Non-Nested Structure

When not under load, the aggregation queries took 4–5 seconds. We ran the profiler for the aggregation query and determined that most of the time was spent joining nested documents, since the query collector was taking too much time. This led to us experimenting with non-nested structures, while indexing documents on the fields where we run aggregations.

The new flat document structure:

// ES document structure after
{
"criteria_present":{
"type": "keyword"
},
"criteria_processed":{
"type": "keyword"
},

}

To calculate the same percentage of calls with a call signal, the nested aggregation query was changed to a composite bucket aggregation based on the above schema.

Below is the new query:

"aggregations": {
"criteria_present": {
"composite": {
"sources": [
{
"criteria_present": {
"terms": {
"field": "criteria_present"
}
}}]}},
"criteria_processed": {
"composite": {
"sources": [
{
"criteria_processed": {
"terms": {
"field": "criteria_processed"
}
}}]}}
}
}

The above query gives us the number of calls where a call signal was “present” and the number of times it was “calculated.” This allows us to calculate the percentage of calls with the calls signals as “present.” The performance of this new non-nested aggregation query turned out to be significantly better. Below shows the performance of the aggregation query:]

With this, we were able to decrease the page load time of 4 seconds for aggregation on around three million documents to around 0.6 seconds.

Looking Forward

Overall, the solution worked great for us. However, as our platform continues to scale (and the time interval of the above reports increases), we’ll continue to look for additional ways to optimize the aggregation query times, including Elasticsearch percolators and rolling time-based indexes.

If you’re interested, check out my other Elasticsearch article on how we scaled Elasticsearch throughput for searches in individual documents.