Elasticsearch as Big Data analytics tool

woman-using-laptop-in-bed

Every day approximately 2.5 quintillion bytes of data are generated. Data comes from a vast majority of different sources. This data is referred to as Big Data. Most of the time, Big Data is unstructured and doesn't make sense when presented as raw data. Once raw data is graphically presented, patterns can be spotted without difficulty, and then exploration and analytics can be applied. An analytics tool like Elasticsearch can make things much easier for us.

Elasticsearch is a distributed, RESTful open source mechanism for searching and analyzing all types of data, including textual, numerical, geospatial, structured, and unstructured. It is built on Apache Lucene and is part of the ELK Stack (Elasticsearch, Logstash, Kibana). Elasticsearch RESTful API provides a large number of options for searching and analyzing data. The current version is 7.8.

Importing data in Elasticsearch can be done in many ways. Here, we will describe how to use Logstash for this purpose. Logstash is used to process data before it is indexed in Elasticsearch. Because all of our data is in CSV format, and Elasticsearch accepts only typed JSON documents, it seemed natural for us to choose Logstash and Logstash CSV Filter Plugin (check out: Logstash CSV filter). Also, we have used a mutate plugin to set the type of document fields. List of all Logstash Filter Plugins can be found on the following link: Logstash Filter Plugins.

Our logstash.conf file looks like this:

After the data is imported, time for analytics and visualization has come. To speed things up and to develop a functional web application in a short time, we chose Spring Boot for backend technology. Spring Boot is an open-source platform based on the Java programming language, used to create microservices.

The backend part of the application calls Elasticsearch Java Search API and sends gathered data to the frontend part, where data is displayed in the form of charts (area, line, pie and others). Search API allows users to execute queries and obtain hits that match the query. The queries are created with Query DSL.

Query DSL (Domain Specific Language) is a JSON based mechanism for creating queries, while java class for creating queries is QueryBuilder. A query can be formed from one or more clauses, divided into two groups: leaf (match, term, range) and compound (bool, dis_max, etc). Queries are great to be used for search, but the real power of Elasticsearch as an analytics tool lies in the Aggregations. 

Aggregations are constructed similarly to the queries, and Java class for creating them is AggregationBuilders.They are grouped in the following manner: metrics (min, max, avg, sum, etc) and bucket aggregations (terms, histogram, etc). Metric aggregations take a set of documents as input, compute metrics on a specified field, and return a result.

For example, a search request which computes an average of the field in all documents can look like this:

With the following code, we can obtain data from the reponse:

The example above shows a basic analysis of the data.

More advanced analysis can be done by using the bucket aggregation, and by combining bucket (sub-bucketing) and metric aggregations. Bucket aggregations produce buckets that have a bucket criterion, and each document is checked whether it meets the mentioned criterion. Some bucket aggregations create a fixed number of buckets and some create buckets dynamically. Bucket aggregations can be Terms aggregations, Date histogram, Date range, etc.

To make a search request that creates buckets from the text written in the field "my Field", the following has to be done:

To obtain returned values from the response, use the code below:

Another interesting thing we can do is to combine aggregations. In the next example, we can see a combination of Terms, Date Histogram, and Average aggregations:

Data obtained with the aforementioned request can be used to draw a chart:

The example above shows how to group documents by specified textual field, and then to calculate the average value of the specified numerical field for each bucket through time.

Here we have described what Elasticsearch is and how you use it. We have explored the usage of queries and aggregations, both metrics and bucket. We have seen a couple of examples, from the metrics aggregations application to the example of how to combine sub-aggregations. And yet, only a little piece of Elasticsearc's analytical power has been presented.
PostgreSQL upgrade (version 9.6 to version 12)

Related Posts

By accepting you will be accessing a service provided by a third-party external to https://www.netvizura.com/

Contact

Mailing and Visiting Address:
Soneco d.o.o.
Makenzijeva 24/VI, 11000 Belgrade, Serbia
Phone: +381.11.6356319
Fax: +381.11.2455210
sales@netvizura.com | support@netvizura.com

CONNECT WITH US:

linkedin facebook facebook