fbpx
Skip links

Data Lakes and Elasticsearch

So, you have set up your new application and all the infrastructure that supports it; maybe you have done it the old fashioned way and purchased some servers and network equipment or set it up in the cloud (AWS, Azure or GCP). However, you have deployed the application at will, along with the infrastructure and output logs. 

, Data Lakes and Elasticsearch

Logs are great for telling us what is happening in our applications and platforms, from whether the application is working as expected down to who has been using them and what actions have been carried out. 

There is an absolute wealth of information that is generated and held by the applications we run. We may need this information for many things, such as troubleshooting why the application has failed to process a payment or to help us detect unauthorised access attempts. It’s all there in the logs, just waiting for us to read.

Now all that sounds good, right? Except, for the most part, logs are stored at the source on our servers or network appliances, typically in text files and almost impossible to read in their natural habitat. So, what good are they, you may be asking yourself? Let’s see how we can take all these logs from different places and bring them somewhere applicable.

What is a Data Lake?

A quick Google for “what is a Data Lake?” and up pops good old Wikipedia with:

“A Data Lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A Data Lake is usually a single store of data, including raw copies of source system data, sensor data, social data etc. and transformed data used for reporting, visualisation, advanced analytics and machine learning. A Data Lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).  A Data Lake can be established “on premises” (within an organisation’s data centres) or “in the cloud” (using cloud services from vendors such as Amazon, Microsoft, or Google).”

So, in a nutshell, a Data Lake is precisely that: a central repository for data. But does that help us with the problem of having a lot of log data? Just by sending the logs to your Data Lake, what exactly do we get? 

Well, you get all of your logs from your servers, OS and App logs, and possibly all of your firewall and network logs held in a single place. But while we now have all the logs in one place, everything is mixed together. That doesn’t help when it comes to making the data easily accessible, does it? You could argue we have made it worse by sending all the logs to a single place that is unstructured.

Enter Elastic

, Data Lakes and Elasticsearch

Elasticsearch is a powerful search engine based on the Lucene library. It allows us to index our logs into a structured format. More than just that, Elastic also provides us with the means to ingest logs from many sources by using Filebeat. This lightweight data shipper agent can send logs directly to Elasticsearch and supports many other output options, such as Logstash or queuing solutions such as Kafka. Filebeat has numerous modules that allow it to receive logs from many sources, including Office 365, AWS, operating system logs, Firewall logs, CISCO and Palo Alto, and more. 

By parsing logs through Filebeat and on to Elasticsearch, they are stored as JSON documents and have all the fields mapped as the correct type within the log file. So straight away, Elastic is helping us to make sense of all these logs by taking them from their raw format and transforming them into documents that we can use and store in an organised manner. 

But this is only half the issue. Now we have the logs coming in, how do we see these logs and work with them? Well, there is a tool for that too. Kibana, another part of the Elastic Stack, is a web frontend for Elasticsearch. Kibana allows us to visualise the data that we are sending to Elasticsearch. 

Through these visualisations, we can build dashboards that show us what is happening in our applications. For example, we could set up dashboards to show us the number of transactions being processed per hour and the outcomes of those transactions. If it’s in the logs, we can query and visualise it. 

But it doesn’t stop there. With machine learning, we can build in anomaly detection and have alerts to give us a heads up that something has gone wrong in the application, helping support to be preemptive and proactive rather than waiting for someone to raise an issue. Suddenly, through the Elastic Stack, your logs Data Lake is now working for you, helping you make the most of your data that initially, you may have turned to only after the event. 

Leave a comment