What is a Data Lake and what does it mean to you?
Emc defines data lake by five major principles, these principles can be easily remembered with the acronym ISASA. ISASA stands for “Ingest Store Analyze Surface Act”, without the ability to support these functions there is no data lake strategy.
Let’s break down each of these principles to understand better.
This is the ability to collect all the data you care about, making sure your systems can correctly and frequently ingest that data through APIs or batch processes. This will increase the capabilities of your data lake.
Store is getting all the data in one place and breaking down silos is the first and the most important step it is also more functional if you can provide scalable storage and multi- protocol access to all that data. Some examples are, NFS, Sif’s, FTP and newer file systems like HDFS.
Matching the correct data points can be a work of art having the correct systems and the correct talent is the key to finding the relations between all the data you’re gathering.
There needs to be a simple method to display all of the analysis, the data needs to be understood. The easier it is to see the results of the analysis the easier it is to take actions.
This is explained simply by placing four M’s which means “Make Me More Money”, a plan has to be put into place to take the results of the data analysis and fit it into the operating business model.
Let’s take an example of a real world scenario, we’ll study a Casino and see how Data lake can benefit their organization. A data lake is useless unless you understand the desired results.
- We need to determine business
- Collect the appropriate data to help obtain the business objective
- Identify what success looks like
The Casinos business objective is to improve their customer experience, the data lake will help them target the correct customer and success will be measured by increase in customer visits, the casino has already started a Big Data initiative and is successfully ingesting various data sets based on the business objective of a better customer experience.
How organizations make better use of their information resulting in the current enterprise data landscape:
Most of these systems were single vendor solutions, from application to database, even hardware, placing limitations on interoperability and creating costly upgrade scenarios.
This includes HR systems, accounting and billing systems, CRM, and supply chain management among others. These systems form the heart of any modern business. They contain and manage the organization’s most critical business data. Businesses have many options for products in this category and the products do their jobs quite well. But when it comes to data analytics they are by design rigid to ensure strict enforcement of established business rules. As a result they feed data to more analytically inclined systems to operationalize data.
Knowledge Management (KM) Systems
While BI addresses the highly structured data problem, KM addresses the unstructured problem. KM products are oriented around user created data, including email messages and documents. As opposed to making analytical decisions, the data in the KM system is used for information sharing and subjective decision making. There is broad agreement that, in an ideal world, user created data would be used seamlessly alongside structured data to make business decisions, but there is still much work to do to accomplish this goal.
Business Intelligence Systems
BI technologies were developed to take a step in the direction of analytic flexibility and away from the rigidity of enterprise applications. BI is a powerful capability, but as with enterprise applications, the features that fundamentally make it powerful also limit its use. BI requires a significant amount of planning and knowledge of the underlying structure of the data, and thus proves inflexible when adapting to rapidly changing or transient data and struggle to handle large volumes of data.
Log Management and Analysis
In the last decade or so other applications – often referred to as “intelligence applications” – have emerged. Similar to BI they are built to help businesses understand their data; however, they do not use the core business data that is in the BI system. Instead, these applications operate with other data relevant to the business, including server logs, web site activity, and social media data. The importance of these systems is growing but they still have the limitation that they are designed around a specific use case.
Evolution of Data
Formerly, data challenges were a frustration, but as data volumes grew and became more complex, and as organizations began to recognize the value of data from new sources, the frustration is turning into a real source of pain.
In addition to the limitations described above, businesses have begun to encounter a new problem: Big Data Pain. The challenge of deriving value from existing and new sources of data has been made more complex as these data have increased in scale, frequency, and complexity.
The Data Lake is the solution to this big data pain. It is a compliment to existing intelligence applications, business intelligence capabilities, and enterprise applications. The Data Lake is a repository that can store these data-sets, regardless of size or complexity, and quickly extract insights, sharing these with any application or user. The Data Lake provides information in two fundamental ways:
Data Discovery: all data in the Data Lake can be searched within seconds and this search capability is provided to users and applications across the enterprise.
Pre-computed Analytics: targeted insights for specific business needs are derived from data in the Data Lake. Pre-computed analytics are pre-computed for all possible situations so that the Data Lake can instantaneously provide insights such as pattern recognition, anomaly detection, categorization, and recommendation analytics.
Data exists in silos because of previous technical limitations that drove decisions around what data types were hosted in which repositories. Those same limitations prevent organizations from taking advantage of rapidly changing business needs and data from sources, some that were unknown a few years go. The Data Lake eliminates these limitations by providing a new data infrastructure, acting in concert with the organization’s existing data stores and applications while adding support for new data and rapidly changing data types.