Skip links

Data Lake Architecture

The term ‘Big Data’ refers to data that is so large, fast or complex that it becomes difficult to process using traditional methods. With 5G and the Internet of Things on the rise, the amount of data that businesses are collecting is set to skyrocket.

But, Big Data isn’t about how much data you can collect, it’s about how you process, analyse and interpret it. By taking data from multiple sources within your organisation, it can be leveraged to produce outcomes such as:

  • Cost reductions
  • Product development and innovation
  • Informed decision making on marketing campaigns
  • Time savings on business processes

Big Data Strategy

In order to take advantage of Big Data, datasets from across an organisation must be brought together and combined so engineers can analyse ‘the big picture’. This can be achieved by looking for trends in data that couldn’t be identified in isolated data silos. This is where the largest challenge for an organisation appears – how do we combine and aggregate data from multiple sources, built on different technologies using contrasting data types?

It’s on this journey, that a Data Strategy should be established – to drive the business from its current data architecture to the desired state. This desired state showcases the latest services and tools offered by cloud providers – utilising machine learning to analyse and identify trends. However, it is the approach that is taken to get to that state that often proves the most challenging aspect.

The most common blueprint is to use steppingstones to achieve the desired state by moving in target-driven increments. By adopting an approach of carefully migrating workloads and processes whilst adopting new technologies and services, the data architecture can grow in a sustainable manner. This allows existing analysis to continue whilst on-boarding selected parts of the business.

Common Challenges

Security Concerns: The main principle behind big data is to combine data from across an organisation into a single data source, often in the cloud. This can lead to concerns about data access, authorisation and auditability.

Networking: Often an organisation’s datasets are stored on legacy infrastructure that is buried away behind several layers of security in isolated data centres. Architecting the links between these silos and the cloud can be complex and time-consuming.

ETL: What extra tooling needs to be installed or configured to allow data to be extracted from the source, transformed and loaded into The Cloud?

All of these concerns can be addressed using a robust data strategy, a well-architected approach and leveraging cloud services and tooling. 

The first decision to make is how to store all this data in The Cloud so that it can be leveraged by all parts of the business in the future…

Data Lakes

A Data Lake is a centralised repository that allows you to store all of your structured and unstructured data at any scale. The main benefit of a Data Lake is the ability to store data as-is without performing any kind of transformation. This ensures that the data is retained in its raw format and allows future business departments the opportunity to transform the data based on their needs.

Data Lake Diagram

It’s important to call out the distinction between a Data Lake and a Data Warehouse. In some cases, the choice is not a simple either/or decision but could involve the use of both technologies based on business requirements.

A Data Warehouse is a database optimised to analyse relational data coming from transactional systems. The data structure and schema are defined upfront and data is cleaned, enriched and transformed so that it can act as the single source of truth.

By contrast, a Data Lake is different because it stores both the relational data from business applications and unstructured data from services such as IoT, mobile apps etc. The structure of the data is unknown when it is captured and means that all data can be stored without defining the requirements up front – leaving open possibilities for future data manipulation.

A common pattern is to use a data lake to capture all data in its raw format before a subset is transformed and loaded into a data warehouse to allow for further processing and analysis. This ensures that, should a different approach be taken in the future; the raw data is available to be transformed from scratch.

By storing data in a data lake in The Cloud, such as Microsoft Azure – organisations can take advantage of cloud analytical services to automate the processing of data utilising machine learning to analyse trends and forecast outcomes.

Azure Data Lake Storage

Azure Data Lake Storage is Microsoft’s answer to Data Lakes in The Cloud. Built on their Azure platform and utilising existing, proven services such as Azure’s storage accounts. Azure data lakes are the ideal technology to utilise in a data strategy.

Data Lakes Image 2

  • Limitless scale and 16 nines of data durability with automatic geo-replication.
  • Secure by default, enabling encryption, network level access and role-based access control that can integrate with existing active directory platforms.
  • Support for a wide variety of analytical frameworks and integrations.
  • Flexible cost optimisation via independent scaling of storage and compute including lifecycle policy management and object level tiering.

Governance

It is important that data within your Data Lake is governed and managed effectively or it could turn into a ‘Data Swamp’ – an unorganised grouping of data with no metadata or curation.

The first thing to do is to establish a pattern for organising your data. A common approach is to split the data out into zones based on the data’s classification. As such, you may have a pattern where each Azure Container has a subset of folders for organisation.

RAW: This is where all RAW data is ingested and stored. It is vital that this data remains unchanged so it can be reused by future projects. Access to this container should be restricted to automated processes, services and Cloud Administrators.

Temporary: As the name suggests, this container is used to store temporary files during the processing and curation of data. Access to this container should be restricted to the automated processes that are performing data transformation.

Trusted: After data has been taken from the RAW container and has been validated/cleansed/masked/obfuscated, it is placed into the trusted container where analytical services can begin interrogating it. This validation step (performed by services such as Azure Data Factory) adds metadata, performs quality checks or changes the format of the dataset.

Transformed: Transformed and enriched data is kept in this container and utilised by delivery teams and key stakeholders to deliver reports and further analysis.

Security

Azure Data Lake Security is one of the top features that an organisation is interested in when migrating their data to The Cloud. Ensuring customers data remains protected is a primary goal for all businesses working with IT and Microsoft Azure has built a series of tools and processes that make it easy to build a strong security posture.

Access

Following the pattern above, granular security controls will need to be placed on each different container type to match the type of access that is needed and enforce a least privilege access policy. Within Azure Data Lakes, there are two primary access control mechanisms: Role-Based Access Control and Access Control Lists.

Using Azure Active Directory, role-based access allows organisations to control identity – who or what can access a dataset. This can be either a person or a service principal assigned to a service such as Azure Data Factory. This feature is only applied at the container level and as such may break the least-privilege access model. Therefore, it is necessary to apply Access control lists at folder and file level to define granular permissions to specific data files. Using Active Directory groups, access to a specific business function e.g. Developers can be targeted at a particular folder within the Data Lake container.

Networking

An additional layer of defence lies in the network restrictions that can be placed on the Azure Data Lake. This includes the use of firewalls to restrict access to only Azure services or an organisation’s address range – users will still need to be authenticated using roles as above.

Encryption

Encryption is auto-managed at rest by Azure Data Lake Store. Using Azure managed keys, it is possible to provide customer-managed keys. Encryption in transit should be enforced for all communication coming in and out of the Data Lake. This can be made easier by using services such as Azure Data Factory which can automate encryption.

Alerting and Auditing

By enabling Microsoft Azure advanced security features, abnormal access and risks are tracked and alerts are raised via Azure Threat Detection. Activity Log stores all access requests providing a source of data for investigation and regulatory purposes.

Cost

Azure Data Lake Storage is priced the same as Azure’s already cheap blob storage – you pay for the storage that you use and there is no concept of reserving a specific size of storage. 

Pricing starts at around £0.02 per GB of data stored and with all of the analytical & ETL services available in the UK regions, there is no need to pay transfer costs between regions.

The majority of Data Lake costs will reside in using analytical, compute and ETL services to curate and transform data.

Cataloguing Data

In order to discover data sources and understand their context, Azure has a fully managed Azure Data Catalog service which helps organisations get more value from their existing investments.

By registering data sources from within the Data Lake with Azure Data Catalog, a copy of their metadata is stored and indexed to enable quick searching and discovery. This metadata can be annotated and enriched to include information such as authors, how to request access, versions etc.

Scalability

Azure Data Lake supports high and instant scalable data processing. The processing power required to execute concurrent analytic jobs can be available within seconds without expending effort to manage and tune any infrastructure. This out-of-the-box, instant scalability is the key to getting massive throughput to support large amounts of analytical workload.

The service auto-scales to meet demand through increased requests from Azure & external services with no extra costs.

Azure Data Lake Analytics

Once your data is stored in the Azure Data Lake, it is possible to take advantage of Azure’s analytical services. These services allow you to easily develop and run massively parallel data transformation and processing programs in U-SQL, R, Python and .NET over petabytes of data.

All of this is done with no infrastructure to manage, allowing you to process on-demand and scale instantly – only paying per request. Microsoft has introduced the Azure Synapse service which aims to bridge the gap between Data Warehouses and Data Analytics by combining the technologies under one service.

Data Lakes Image 3

Azure Synapse is an evolution of Azure SQL Data Warehouse with the ability to bridge the gap between Data Lakes and Data Warehousing. With built in analytics, the Power BI suite can be used within the Azure Synapse platform including tight integration with Azure Active Directory. Queries against data can be performed using serverless, on-demand resources or provisioned resources, before being analysed by Azure Machine Learning for advanced analytics.

Data Extraction

Now that we know where we want to store our data, we must look at how we will extract it from the various data sources held throughout an organisation. 

The goal here is to migrate data in incremental batches, usually overnight, to keep data in our cloud data lake up to date. This must be done securely and ideally, with minimal change to on-premise workloads/design patterns.

As such, a good starting point would be to utilise tools such as Microsoft’s AZCopy service to migrate data automatically from on-premise to The Cloud. Data would first have to be extracted from the data source using proprietary methods before being transferred to the data lake using AZCopy.

Once this pattern has been proven and is delivering good results, a move to a more permanent solution such as Azure Data Factory would be advised. Azure Data Factory is Microsoft’s answer to the ETL problem of how to get data into Azure. It is a fully managed service that can be easily configured to pull from a variety of data sources, perform transformations and push to a variety of Azure services.

Data Lakes Image 4

Furthermore, solutions such as Azure ExpressRoute and VPN Gateway can form permanent connections between your on-premise workloads and Azure’s data lake to secure traffic and increase performance and availability.

If your data sources are already in a Microsoft Azure environment, Azure Data Factory will provide all the necessary connectors and access controls to ship your data from source to data lake.

Data Transformation

As previously hinted, Azure Data Factory is the service of choice for manipulating data within Microsoft Azure. It is possible to construct ETL processes code-free within an intuitive visual environment or by writing your own code. There are over 90 connectors that allow you to connect to various data sources at no extra cost.

The fully managed service can scale on demand to process requests and has built-in Azure security controls to allow it to integrate with existing Azure services and on-premise workloads.

What this service allows us to do is take the data we have copied to our data lake and transform it so that it is ready to be analysed or combined with existing datasets.

The following example outlines how the existing technologies we have discussed would be utilised as a first stepping stone:

  • An on-premise SQL database exports a nightly CSV file to on-premise storage.
  • Microsoft’s AZCopy utility copies the export to Azure Data Lake Storage.
  • An event is triggered to start an Azure Data Factory pipeline which transforms the CSV, removing unneeded columns, rows with dates older than a month and adds some calculation columns, before exporting a now processed CSV file back to Azure Data Lake Storage.
  • Microsoft Power BI Dataflows pick up the newly processed CSV file and updates a visual report based on the new data.
  • Users can view, edit and share the report within the Power BI ecosystem.

What’s important to note in this example, is that the original RAW data from the SQL database is still available for future use. By producing a new dataset from the original, we have not altered it and can be used by any number of teams in the future to create new datasets based on their requirements.

Data Visualisation

Visualising data within a reporting suite is beneficial to key stakeholders within an organisation as it allows managers to quickly identify trends in performance, business areas of concern and how the organisation performs with the outside world.

Data Lakes Image 5

Microsoft’s Power BI platform is the de facto tool for visualising your data and sharing insights across your organisation. It is a self-service, unified enterprise analytics solution that allows you to develop the data-driven culture needed to thrive in a fast-paced, competitive environment.

Once again, Microsoft provides a large set of prebuilt connectors allowing Power BI to connect to hundreds of data sources. In our example, we would require the CSV connector utilising Azure Data Lake Gen2. Wrapping this up into a Power BI Dataflow would allow us to perform automatic refreshes of the data so that our reports are automatically up to date.

Summary

Data Lakes allow organisations to combine datasets to glean business insights into trends and amalgamated data that would, otherwise, remain isolated. 

By utilising Data Lakes, organisations can future proof their analytical capabilities by ensuring data is democratised and available from its raw state.

Microsoft Azure provides a rich set of services that help organisations extract, transform, load and present data. New services like Azure Synapse aim to bridge the gap between data warehouses and data lakes, bringing a group of tools underneath one unified service providing machine learning analytics.