Cloud vs on-premise data lakes

by Joseph K. Clark

Handling large amounts of data is a prerequisite of digital transformation, and key to this are the concepts of data lakes, warehouses, hubs, and data marts. In this article, we’ll start at the top of that hierarchy and look at data lakes. As organizations try to get a grip on their data and wring as much value from it as possible, the data lake is a core concept. It’s an area of data management and analysis that depends on storage – sometimes lots of it – and it’s an activity ripe for a move to the cloud but can also be handled on-premise. We’ll also look at the type of storage needed for a data lake – often object storage – and the pros and cons of building in-house or using the cloud.

Data lake vs. data warehouse

The data lake is the first place an organization’s data flows to. It is the repository for all data collected from the organization’s operations, where it will reside in a more or less raw format. Perhaps there will be some metadata tagging to facilitate searches of data elements. Still, it is intended that specialists such as data scientists and those that develop touchpoints downstream of the lake will access data in the data lake.

Downstream is appropriate because the data lake is seen, like an actual lake, as something into which all data sources flow, and they are potential, many, varied, and unprocessed. Data would go downstream to the data warehouse from the lake, implying something more processed, packaged,d and ready for consumption.

data

While the data lake contains multiple stores of data in formats not easily accessible or readable by the vast majority of employees – unstructured, semi-structured, and structured – the data warehouse comprises structured data in databases to which applications and employees are afforded access. A data mart or hub may allow for data that is even more easily consumed by departments. So, a data lake holds large quantities of data in its original form. Unlike queries to the data warehouse or mart, interrogating the data lake requires a schema-on-read approach.

Data lake: Data types and access methods

Data sources in a data lake will include all data from an organization or one of its divisions. It might consist of structured data from relational databases, semi-structured data such as CSV and log files, data in XML and JSON formats, unstructured data like emails, documents, and PDFs, and binary data such as images, audio, and video.

In terms of storage protocol, it will need to store data that originated in the file, block, and object storage. But, of those, object storage is a common protocol choice for the data lake. Don’t forget; access will not be to the data itself but to the metadata headers that describe the data, which could be attached to anything from a database to a photo. Complex data querying often happens elsewhere, not in the data lake.

Object storage is well-suited to storing vast amounts of data, such as unstructured data. You can’t query it like you can a database in block storage, but you can store multiple object types in a large flat structure and find out what’s there. Object storage is generally not designed for high performance, and that’s fine for data lake use cases where queries are more complex to construct and process than in a relational database in a data warehouse. But that’s fine because much querying at the data lake stage will provide more easily queryable data stores for the downstream data warehouse.

Data lake on-prem vs. cloud

All the usual on-premise vs. cloud arguments apply to data lake operations. On-prem data lake deployment has to consider space and power requirements, design, hardware and software procurement, management, the skills to run it, and ongoing costs in all these areas. That could also consider issues once and connectivity beyond storage and data lake architecting. Outsourcing the data lake to the cloud has the advantage of offloading the capital expenditure (capex) infrastructure costs to an operational fee (opex), one of the payments to the cloud provider. That, however, could result in unexpected costs as data volumes scale and upon data flow to and from the cloud, for which you will also be charged. So, a careful analysis of the benefits and drawbacks of each is needed.

Related Posts