If you’ve spent much time around CIOs, you’ve probably come across the concept of data lakes.

From the oil rig to the lake: a shift in perspective on data - Reaktor

If you’ve spent much time around CIOs, you’ve probably come across the concept of data lakes. Simply put, data lakes are repositories of raw data designed to be used across an organisation. Their purpose is to allow companies to take advantage of big data and avoid the pitfalls of data silos.

But like any major technological initiative, it’s all about implementation. Data lakes are not the ideal solution for every business. Where data lakes fail to achieve their goal is often down to factors around the lake, so I’ll be discussing these challenges and some ways that managers can proactively anticipate them.

Disconnecting data from business processes&nbsp;

Organisational issues and working in silos

Impatience; lack of tenacity and resilience

How do we address these challenges?

Know your data. Gather a complete picture of data lineage: what were the original business processes and interactions that generated it? Where can it be found? Who in the business really knows about it?

Form follows information. Data-driven teams must constantly reassess and reorganise their work, not around company function, but in a way that reflects valuable information processes.

First, a quick discussion of digitalisation

Beyond the hype, the skillful exploitation of data, analytics, and AI has well-documented benefits. And because of the fairly recent development of these fields, their effects have not yet been fully felt. But without the proper tools to monitor what’s going on, this process of digitalisation can actually obscure what’s going on in the business.

In a brick and mortar shop, to understand customer behaviour, you could interview customers, salespeople, and managers to get an idea of what’s going on. By contrast, the customer experience in the early days of e-commerce clearly lacked the personal nature of brick and mortar shopping. Amazon’s convention of showing other items you might like to buy began this shift towards more personalisation.

Modern consumers demand increasingly convenient and personalised services. These are made possible thanks to the data being collected on their behaviour and the use of analytics tools and AI to understand and predict preferences. So, while digitalisation can initially result in less personalisation and observability, these qualities eventually surpass traditional models through proper implementation.

So, it’s clear that data is necessary to enable personalisation. However, this raises another issue: that data collected in one part of the organisation isn’t available or accessible to another part of the business. These data silos can prevent proper communication and may lead to suboptimal decisions, despite the best intentions and efforts of managers.

With each department collecting its own data, decisions that optimise for certain factors in one part of the business may hurt other areas. If we want a 360 degree view of customers, we therefore need a way to handle heterogeneous and disparate data systems. So how do we do this?

A data lake is an umbrella term for technologies that solve a very practical problem: How can we get a view of all the data in our enterprise with a reasonable amount of effort? Instead of carefully first harmonising and transforming all the data into a normal form, in a data lake, everything is extracted in its original format and loaded into a single environment, often in the cloud. All sorts of modern, high-performance tools can then be used to transform and work with the data. The data lake allows you to generate many possible views on your data, rather than the narrower approach of dealing with each department’s data separately.

However, the data lake isn’t a magical place where everyone can dump their data and leave it to data scientists like me to figure out a plan. We call that a ‘data swamp’, and it feeds into a misguided idea of data scientists digging for the data equivalent of gold.

The concept of big data arrived with digitalisation, and it is certainly a tantalising buzzword for stakeholders, especially given the success of companies like Amazon and Facebook in exploiting it. ‘Data is the new oil’ is a slogan that works when we talk about data being a catalyst for a new industrial era. However, when one talks about dealing with data to gain benefit in practice, the metaphor breaks down.

Unlike oil, refining data isn’t the only step necessary to create value, and the value of data is extremely context-dependent. Data is not a specific raw material. Instead, I like to think of data as being traces that a hunter uses to find prey, rather than the data being the prey itself. In business, the traces are left by numerous business processes, oftentimes confined to areas like marketing or procurement. Instead of being obsessed with the data, we must aim to understand the processes that generate the data.

Deconstructing the processes

Data is a byproduct of the systems on which the business runs. However, one often hears dejected analysts peer into the data lake and say that “our data is bad” or “messy”. The data is often technically valid, and for many valuable business processes, the storage or computation needs are not particularly challenging. So what’s the problem then? The biggest concern for analytics – and, by extension, AI – is that the data is either insufficient or is conceptually or semantically broken.

Many problems in analytics crop up when the data is taken out of its original context.

Here’s a very simple example: simple reporting with no AI involved. Reporting tasks can already reveal how data generation processes require refining work that ends up being quite time-consuming. For example, reporting sales by marketing categories may turn out to be tricky if the product categorisation originally comes from a sales system. More questions arise: What do the categories mean? Who owns them? How and when they are updated? Are the categories exhaustive, or are there some missing?

Answering these questions requires working across functions. A data lake won’t miraculously fix this, so it shouldn’t be considered a panacea that will solve your business’s problems. It does not understand the business, or even bring direct business value. It simply widens the options by making data more readily available.

Importance of data lineage &amp; feedback loops

To reconstruct a valid business process from the data, we must understand where and how the data is being generated. Drawing business-relevant observations from the data lake requires a combination of expert work, communication, and iteration.

Understanding the lineage of your data is a paramount task and the source of most of the work in a typical analytics or AI case. It allows for the construction of semantic data models that show how data relates to the business processes that generated it. However, when there is no information from a particular process, despite what you <a href="https://www.youtube.com/watch?v=I_8ZH1Ggjk0">may have seen on CSI</a>, no amount of analytical magic can reconstruct it.

Therefore, effort must be made in collecting informative data. Finding out that crucial data is missing often only occurs after some concrete case to optimise business has already started. Information content could be missing (no data), or just not understood, because it is out of context. This means we must untangle the process that generates data and figure out where to collect more of it.

There must be a reliable feedback loop that records in context the sequence of events. If we take the customer journey as an example, the data is about interactions between the customer and the company. If a message was sent to customer service, what was the response to it? We need to understand the feedback loop between the customer and company’s actions – whether it’s a marketing message, sales transaction, notification, location, etc. This must be done systematically, in a form that suits automated processing, in order to make good use of the data lake.

Let’s use a simple example. An analytics team is asked to optimise targeted marketing lists for a particular product. There are several ways how this could be optimised. For brevity, let’s concentrate on one detail.

Having one item sold to a customer generates a certain revenue, but if the customer returns the product, it causes costs (e.g. transport, re-stocking, customer service engagement, etc.). Knowing the actual revenue and costs is instrumental for properly modelling the expected gains from using an AI model. If the costs incurred by returned products are negligible, we might not need to bother optimising this.

Because of the data lake, the team may notice conflicting incentives between departments. For example, the sales department could be rewarded only in relation to the units sold. They aren’t seeing the lost revenues from returns. In AI parlance, their utility function is optimised by selling as many units as possible, not by minimising returns. It’s therefore important that the model takes into account different data points relevant to the health of the whole business, rather than one department’s very narrow utility function. This may imply making use of appropriate APIs and other integrations from across the business. But more fundamentally, it requires getting the relevant people involved around the same table much more often.&nbsp;

Cross-functional learning, case by case

The topics described in the previous section are cross-functional. Not only do they all matter when designing an AI solution that brings real value, but they also affect the prioritising of technology and data development. Repeatedly prioritising and mapping out end-to-end cases is important for several reasons:

Extract value out of the data in the first place;

Get a direction for the development work;

Find bottlenecks in business processes;

Build new capabilities in order to learn a new way of genuine, cross-functional teamwork.

Many cases will fail because of a lack of information content available at that moment. Persistently trying to implement end-to-end business cases will help you to discover new and important considerations. This process must be embraced and relevant findings systematically shared for learning and adaptation across the business.

Cooperation is also needed to avoid false assumptions on what other people and systems actually do. For example, a manager may assume that the data lake will provide access to dashboards or insights for their specific needs. However, a data lake is a generic concept, not a promise of user interfaces and reporting tools that would fit just about any wish. A data lake might make things easier, but it still requires BI expertise.

Therefore, effort should be made to extend the data lake into a center of excellence in analytics, eventually becoming a business critical operating system. Because many of the relevant pieces of the puzzle are run by another part of the organisation, leaders must think critically about operations and divide responsibilities between marketing automation, customer care, production, data systems, and so forth.

Deconstructing and reconstructing teams

Optimising the design of the data lake can’t occur without instruction and iteration from real business cases. Building a data-driven business requires company-wide transparency – both into the data being generated and the way the work is carried out.

Collections of data analysed and utilised by function or by product line just reinforces siloing, giving a fragmented view of operations and customer interactions. This may lead to suboptimal decisions, overloaded messaging, fragmented service, and ultimately brand dilution.

Despite their best intentions, it is not uncommon to see business experts making technology decisions (see: <a href="https://en.wikipedia.org/wiki/Shadow_IT">Shadow IT</a>), while analysts or tech people are inadvertently making business decisions. Business, marketing, AI, and data experts must learn to communicate by working through these problems together.

The data lake is just one way of getting all of these people together. Deciding to build a data lake can be seen as both a symptom of a business that knows it needs to work together better, and the first step towards making this happen.

Customers do not care about how a business is organised. They just want personalisation, speed, and seamlessness. These can all be facilitated by understanding and handling the information flow in the business. The data lake can therefore be a catalyst in changing how the organisation works.