3 Misconceptions About Data Lakes

Data lakes are a daring new storage approach that pull in data from a potentially countless number of sources, while also featuring the agility of self-service analytics and insights. However, as with any new concept, data lakes are associated with their fair share of misconceptions. In order to implement and utilize data lakes effectively, we must dispel some of these myths. After all, with great power comes great responsibility and data lakes are superheroes in the Big Data world. Data lakes empower business users to eliminate disparate data silos while discovering correlations between data sources. Still, a data lake can do only so much and knowing their limits makes for better expectations and results.

Misconception #1: Data Lakes are a Dumping Ground for All Data

A data lake is like a fathomless repository that can store virtually any amount of structured, unstructured and historical data that a business may require. Businesses tend towards a “let’s store everything” strategy, since there’s no way to know precisely what bits of data may be required in the future. This is the exact pitfall you must avoid. It’s like a museum that collects just about anything; that incredible 1000-year-old artifact might never get the attention it deserves because it lies forgotten in a dusty corner. While massive storage capabilities are a significant data lake benefit, you must be careful to avoid collecting extraneous data that turns into excess clutter. You don’t want to compromise the value and usefulness of your data simply because your data lake is overloaded with information that’s unlikely to serve a purpose now or in the future.

Data volumes can seem endless, so it’s vital that your company establishes a storage strategy so you can ensure proper data utilization in the future. Businesses need to decide what metrics are most important and which data points should be analyzed in order to bring about the company’s desired outcomes. Be clear to outline the analytical processes that will draw from the data lake because this will alter data collection practices as well. Here are a few key considerations:

  • What metrics are most important to your business?
  • What data is needed for accurate decision-making?
  • How will various pieces of data be analyzed or utilized as your company seeks to achieve a specific goal? (i.e. Data Visualization, Predictive Analytics, Machine Learning, IoT)
  • How often will you need to analyze this information? And how much data will you need to arrive at an accurate conclusion? (i.e. the past six months worth of data)
  • Which data streams will feed into the data lake?
  • How often should each data stream send information to the data lake?
  • How will data-driven insights be implemented? What process will you use?

Data lakes are flexible, so at any point in time, you can change your data requirements as the needs of your business evolve. But starting off by stowing data that you know will be useful can prevent data lake clutter and deterioration.  It’s often best to start slow as you learn the ropes, thereby setting up your data lake for success.

Knowing the purpose that a data set will serve has driven companies to devise use cases across various industries. Organizations are using voice data analysis to improve their customer support services. Retailers are gathering data for customer sentiment analysis. This information can be applied to personalize the shopping experience and ensure that businesses are meeting the needs of their customer base. The sports industry is pursuing video analysis techniques to not only improve athletic training and performance but also to provide commentary and analysis for viewers during the broadcast. These are just a few of the many ways in which well-organized data lakes can be leveraged.

Misconception #2: Data Lakes are Unstructured

Data lakes offer an ocean of possibilities, storing both structured and unstructured data from different sources. With around 2.5 quintillion bytes of data generated every day, organizations can soon find that their data lake has degraded into a giant data swamp of raw data. When analysts need to access the data, they cannot wade through the murky waters of undefined data sets. Data lakes without proper governance are akin to Superman’s kryptonite. While the data itself may start off as unstructured, the data lake needs structure and data governance in order to be effective.

Data cannot just be dumped into a data lake; you must have processes and rules set to make the data meaningful and accessible. Every bit of data flowing in must be stamped with its source, lineage and file format. A data catalog system will tag everything in business terms, in addition to establishing relationships and associations between different data sets. But all of this does not need to be done manually. Machine learning is often incorporated into data catalogs to automate the process. Business users can then use common business terms and other filters to easily search and access the data they need.

It’s important to note that not all of the information in your data lake will have immediate use. However, since it remains in its native format and tagged neatly, the stored data can be accessed at any time and used for data visualization or advanced analytics.

Misconception #3: The More Users Who Access the Data Lake, the Better the Insights

A data lake is only as good as the insights it can provide. It can be easy to assume that the greater the number of users accessing the data lake, the more useful the data lake would be. For instance, data analysts could be using the data to perform advanced analytics, business users could be pulling data for reporting and the security team could be storing security data and using the data lake to identify possible security threats. In theory, this is true. However, you will need robust utilization and management of resources so you can maintain a sense of order.

If you have too many people with different agendas, users can quickly become lost in the robustness of the data lake. User access permissions and data governance can ensure that everyone has access to what they need but doesn’t have access to what they shouldn’t be able to see. This will help to ensure secure data access and will help users find what they are looking for in a more efficient manner. Having access to certain parts of the data lake will help with organization and will keep the data lakes from seeming too overwhelming.

Apart from data specialists, some users may need to be trained to use a data lake, especially if they are expecting something that’s akin to a data warehouse (which contains structured data, unlike a data lake which primarily houses unstructured data). User adoption must be actively encouraged, when appropriate. Depending upon the users’ varying skill levels, a company may need to establish different protocols for accessing the data lake. Data visualization tools provide ease of use for knowledgeable workers, while data scientists might need a greater degree of direct access with limited ability to delete or update data.

The amount of data contained within one of these systems can overwhelm an ordinary user. Furthermore, as a centralized data repository, not all data is for everyone. GDPR compliance has also placed the onus of security on businesses. Personal information (such as credit card details) would be classified as high-risk data that must be protected so it doesn’t fall into the wrong hands. Data lake permissions must be crafted in a way that ensures that the right people have the right access. User access levels that are based upon previously defined security rules can be used to grant permissions to specific portions of the data lake. This makes it much more manageable and easy to operate.

Understanding these three misconceptions will help you set up and approach your data lake in a successful way. To simplify the use and adoption of data lakes, Sertics is a software as a service approach for data lake creation and predictive analytics that utilizes low code. This allows all types of business users to understand and implement the data lake capabilities without the need for data scientists. If you are interested in learning more about the benefits of data lake creation for your business, reach out to the team at Sertics today.

Harshit Gupta has more than eight years of experience as a Senior Developer. He holds a bachelor’s degree in Computer Software Engineering and an MS degree in Computer Science from Arkansas State University. Some of his most notable accomplishments include his dissertation on 3-Tier MapReduce, being named runner-up in the SevenTablets Annual Hack-a-thon, and receiving the “think/WOW” award for his work on Lockton-Dunning Benefits.