Data Lake Management: How to Prevent a Data Swamp

A lake conjures up imagery of cool, inviting waters that beckon you for a refreshing swim. A swamp, on the other hand, feels ominous. Most would be reluctant to even dip a hand into the cloudy waters since you never know what creatures could arise from the murky depths.  

James Dixon, CTO at Pentaho, was inspired by images of clear lake water when he came up with the term “data lake” as a reference for a new storage system. As it would be, this term serves as a descriptive phrase hinting at the nature of the data flow within these storage areas. A data lake can accommodate both structured and unstructured data. However, the sheer volume of data can transform a data lake into a dark, uninviting data swamp. Instead of fishing for new insights, businesses could end up scratching their heads in disbelief, as they muddle through piles of data in search of their files. Fortunately, you’ll never find yourself in this position if you create a data lake with a robust and logical architecture.

Always Be Selective When Creating a Data Lake

Effective data lake management requires a degree of selectivity to maintain a collection of information that’s well-organized and accessible. Data lakes have become important to businesses because of the vast amounts of information they can store, coupled with the low cost of collecting and channeling this data into the storage area. In the modern world, data streams originate from a variety of different sources, including the IoT and the usual structured operational data. This access and storage of data allow for the discovery of new insights like never before.

For a logistics company, vehicle speed analysis, weather forecasts and road condition updates can lead to optimized travel routes. This can save both time and money while increasing customer satisfaction. The reality, however, is that these multiple data channels bring their own complexities so businesses must collect data in a selective manner. Extraneous data will only lead to the formation of a data swamp rather than a pristine, usable data lake.

Businesses need to apply the same principles of demand-driven supply chains to their data.  Companies must first decide what business problems they want the data to solve. Once these are established, you must determine the cost of retaining or not retaining the data. This process will allow you to decide what data needs to be captured and stored. In this way, the irrelevant data will not clutter the data lake and distract from the essential data points that are crucial for forming those key business insights.

Organize Your Data With Metadata Tagging

How many of us have moved stuff into the attic, thinking that we’ll need those items again someday? And before you know it, your attic is cluttered with objects that haven’t been used in decades. When that day comes to finally retrieve an item from the attic, you find yourself searching for hours because that object is buried amidst all those other “useful” objects. Your data lake often gets the same treatment. Lots of data won’t be used for weeks or even months, but when you need that information, it’s vital that you know precisely where and how to find it.

Keeping your data well-organized within a data lake ensures that it remains intelligible and easily accessible. Without good governance, data will never be properly accessible. The benefits of a central repository system are nullified if that data is thrown in without care, creating a bottomless pit type of environment. Appropriate governance and cataloging of raw data is a must.

If you don’t tag it, there’s a good chance you’ll be hard pressed to find a specific piece of data in the future. You may ask all the right questions, but the data you need to answer them could be lost in a data swamp. Metadata tagging is an important process for data lake management, as you’ll be ensuring high-speed data discovery. Once done, users can easily search datasets using keywords. Since large volumes of data are continually being added, this process might appear tedious. The good news here is that there are numerous automated processes available that can take over, tagging each piece of incoming data automatically.

Raw data is usually tagged with three types of metadata:

  • Technical metadata includes file size, format (such as text, image, and audio formats), structure, creation date and time.
  • Operational metadata includes information about where the data came from, referred to as data lineage. Operational metadata also tags data when it is updated so that version history is maintained. Finally, operational metadata can keep track of whether records were rejected and the outcome of running a job.
  • Business metadata tags are comprised of the common words assigned to data fields in an attempt to make them easily searchable. It will also cover business rules such as masking credit card information or other personal data. Many companies in the retail/e-commerce and commercial sectors commonly deal with large volumes of this type of data.

A big benefit of data lake data governance surrounds long-term storage capabilities, which allow companies to securely stow data that could potentially be useful in the future. This can help prevent businesses from throwing away valuable data, while also preventing your data lake from gradually turning into a murky data swamp.

Automate Your Data Lake

Data lakes are critical for companies seeking to pull information from disparate sources into a single, central location for use in business intelligence. Data lakes are the foundation upon which new technologies like artificial intelligence (AI) and machine learning are built. These can be used to sort, analyze and spot patterns with a speed that provides faster intelligence, and subsequently, more efficient business decisions. While the massive storage capacity of a data lake is a key benefit, the weight of all this data becomes worthwhile when you use this technology to bolster emerging technologies and new company initiatives.

Sertics is a software as a service provider for data lake creation, data visualization, data lake management, and predictive analytics. Sertics also utilizes other emerging technologies like machine learning and artificial intelligence. Learn more about Sertics by contacting our team and scheduling a product demo today.

Harshit Gupta has more than eight years of experience as a Senior Developer. He holds a bachelor’s degree in Computer Software Engineering and an MS degree in Computer Science from Arkansas State University. Some of his most notable accomplishments include his dissertation on 3-Tier MapReduce, being named runner-up in the SevenTablets Annual Hack-a-thon, and receiving the “think/WOW” award for his work on Lockton-Dunning Benefits.