How to keep your data lake fresh

Think Progress Team

Monday 12 February 2018

There’s a reason enterprise-wide data management platforms lend themselves so well to watery metaphors. When the amount of information your company collects and stores becomes so vast that it is inaccessible, that inflow of data becomes stagnant. And when the information can’t move anywhere, your data lake becomes a swamp.

The ease and speed with which a data lake can be established is part of the problem. They’ve popped up quickly throughout organisations, often with no clear curation plan. There’s lots of data there, but too few people know how to use it – or even get to it. The result? Another slightly fetid trope: standing pools of data. Unintegrated bodies of information that cannot readily be integrated or interpreted.

As Ken Tsai, head of cloud platform and data management at SAP, told TechRepublic: “We call this phenomenon ‘data dissonance’ because the data can’t be brought into a harmonic and compatible state without preparing it so it can work with other types of data.”

Since so much data is funnelled through in its raw state, there’s no helpful metadata associated with any of it, such as the date something was modified or last accessed. This means data is very hard to trace. It’s the equivalent of hunting for a particular needle in a needle stack.

There’s so much of it too. Companies are prone to hanging on to every last scrap of data just in case – in case an audit needs to be carried out, or in case it might prove useful in some future analysis. Yet without any meaningful integration, data just clogs up servers.

What can your business do to ensure its data lake is a gleaming repository of useful information?

1. Start with the goal

What’s the business problem you’re trying to solve? When you understand this, it’ll be easier to home in on the data you’ll need to collect, as well as the way to interpret that data. Setting out with a goal in mind will help you contextualise the information you gather. This will mean you only gather the information you need.

2. Cut the amount of data you collect

It’s so cheap to collect information that it’s essentially free, which is why it’s easy for businesses to collect too much of it. It’s also likely that someone is thinking, “I’ll sort through that later” – just like we all do with items in storage.

Identifying the problem to solve at the outset will help you gather specific datasets and turn the overwhelming deluge into a helpful trickle.

3. Automate the data-trawling

Once you’ve identified the kind of data you need, you then have to figure out how to process it. Here, it pays to put in place an automated system. With the correct metadata attached to your datasets, you can set an artificial intelligence construct to work, trawling your data and retrieving findings. Machine learning in particular is an excellent way to repackage data into chunks ready for your team to analyse and interpret.

It’s okay to be highly selective about the information you gather. In fact, it’s essential if you want to glean anything of value from it.


Building the next-gen data centre

Where traditional and web-scale apps co-exist