What is a Data Lake and how can organisations combine Data Lakes and Amazon Web Services (AWS) to store their data?
What is a Data Lake?
A Data Lake is a storage area for data without a predefined schema, storing data in structured or unstructured formats at any scale. More than 90% of business data is organised in an unstructured format, making a Data Lake the ideal storage option for organisations.
This is in contrast to a Data Warehouse, where data is cleansed, processed and indexed.
|Data Lake||Data Warehouse|
|Highly Accessible & Changeable||More Costly & Difficult to Change|
|Keeps Data in Original Format||Data Already Transformed for Usage|
As storage costs have decreased over the last couple of years, Data Lakes have increased in popularity. This is not only due to the decrease in cost but because of the availability of new services such as analytics from cloud providers that offer pay per request payment models.
Data Lake Pros and Cons
Data Lake Pros ✓
Speed and scalability: As Data Lakes store data in its original format, there is no preprocessing required. Meaning data can be quickly ingested and stored at any scale.
Cost: Due to the decreasing cost of storage, Data Lakes are low cost compared to a traditional data warehouse.
Data: Data that may have been discarded with a traditional data model can be retained due to decreasing costs ($0.02 per GB per month). They are simple to control access to that data and subsets of the data based on different user requirements.
Data Lake Cons X
Skills: Data Lakes involve a set of skills and toolsets that have not previously been used by the data teams, requiring investment in training and development.
Infrastructure: Due to the bias toward cloud providers for the flexibility of storage, this may not fit with a traditional on-premise strategy.
Data Swamp: There is potential that as the amount of data getting deposited into a data lake increases that data has no value, is not governed and will have no use. This is referred to as a ‘Data Swamp’.
Amazon Web Solutions (AWS) Components
There are different offerings from the Cloud Services providers for Data Lakes and a thorough comparison should be made between the services available and the associated costs before committing to a provider.
There are also options for on-premise solutions. However, for the rest of this discussion, we will look at the components offered by Amazon Web Services (AWS) to set up a secure Data Lake.
AWS Glue is Amazon’s managed extract, transform and load service (ETL), which can be used to cleanse, transform and migrate data between different data stores. This includes a Data Catalogue which is used to store the metadata of your data in tables and databases. The Data Catalogue also stores the metadata for transformations, jobs and targets.
It is the Crawlers’ job to interrogate a data store and determine the properties, type and schema of the data. A Crawler is configured to scan a data store and can only scan the data that it has been given permission to ‘see’ by an IAM policy.
Separate crawlers are required for each data store and each crawler has a classifier to determine the type and format of the data. AWS provides a large number of built-in classifiers to determine the data types. However, if the built-in classifiers are not suitable the user-created classifiers can be developed to define the format and schema of the data.
Crawlers are one method available within AWS Glue to populate the Data Catalogue with tables that store metadata.
Extract, Transform, Load (ETL)
Extract, Transform and Load (ETL), as the name suggests, is the process of reading the data from a data store and converting that data into the form required for the end-user. The load part of the process is writing that data into the target datastore.
In the world of AWS, this is performed by an AWS Glue job – a script written in Apache Spark or Python. AWS provide a number of built-in transformation scripts as well as the option to create and provide your own custom scripts.
Crawlers and jobs can be started manually, or automated triggers can be created to run at scheduled times or when a set of conditions have been satisfied.
AWS Lake formation
The AWS Lake Formation Service has been created to integrate the AWS Glue services mentioned previously with security and access permissions (IAM) as well as analysis and machine learning services. This enables users to create a secure Data Lake in less time than manually setting up individual components.
Lake Formation and Glue share the same Data Catalogue, which allows the services to integrate easily and make the creation of a Data Lake as simple as possible.
Amazon advertises the service as being able to ‘create a Data Lake in days’. Although, it is possible to create a basic Data Lake setup in a number of hours.
Querying the Data
Once we have created a Data Lake and populated it with data from a number of different sources, what do we do with it?
Amazon provides us with a service called Athena – an interactive query service that allows us to analyse the data in the Data Lake using standard SQL language and commands.
Athena is serverless, meaning there is no need to set up and manage any servers and users are charged only for queries that are run. It integrates with the AWS Glue Data Catalogue, allowing us to directly query the Data Lake in the AWS Athena console.
Amazon’s visualisation and business intelligence tool is known as Quicksight. This enables users to create single Dashboards that can include data from your Data Lake, spreadsheets, streamed data, SaaS services and other data sources including a number of AWS services.
Quicksight includes a large number of chart, graph, table and diagram visualisations to allow data to be displayed in a multi-tabbed dashboard.
New visualisations start with an Auto-graph representation of your data which, using algorithms, shows your data in the most suitable way. It also includes something called Insights – an Amazon machine learning presentation of data, which produces visualisations that can be included in your dashboard.
Quicksight allows users to create impressive dashboards. Incorporating data from a range of different data sources that can be displayed on desktop and mobile devices. Access to dashboards and the data within them is controlled by permissions applied within the Quicksight service.
Organisations that have a large amount of complex data in different formats and sources should consider the benefits Data Lakes bring. Data Lakes are capable of combining separate data sources into a single source where new machine learning and analytical tools can give insight to that data and benefits to the business.
Amazon Web Services have developed a number of services that can help organisations move to a Data Lake and provide the services to analyse and visualise that data. Security and permission are built into the services at all levels allowing strict control of who is allowed to see the data and report on it.