Introduction
Data lake architecture is designed to help big data of businesses generate new growth opportunities, outperform existing competitors, and provide a seamless customer experience. However, to get the best out of data and thrive in this digital world, enterprises should possess well-curated, good quality data lakes that will empower digital transformation across an enterprise.
What Is Data Lake?
A data lake is referred to as a storage repository, which is used to store large amounts of structured, semi-structured, and unstructured data. Data can be stored in native format with no fixed limits on account size or file. It offers high data quantity to increase analytic performance and native integration.
Data Lake is like a large container that is very similar to real lakes and rivers. Just like in a lake you have multiple tributaries coming in, a data lake has structured data, unstructured data, machine to machine, logs flowing through in real-time.
Data Lake Architecture
The figure shows the architecture of a Business Data Lake. The lower levels represent data that is mostly at rest while the upper levels show real-time transactional data. This data flow through the system with no or little no latency. Following are important tiers in Data Lake Architecture:
- Ingestion Tier: The tiers on the left side depict the data sources. The data could be loaded into the data lake in batches or in real-time
- Insights Tier: The tiers on the right represent the research side where insights from the system are used. SQL, NoSQL queries, or even excel could be used for data analysis.
- HDFS is a cost-effective solution for both structured and unstructured data. It is a landing zone for all data that is at rest in the system.
- Distillation tier takes data from the storage tire and converts it to structured data for easier analysis.
- Processing tier run analytical algorithms and users queries with varying real time, interactive, batch to generate structured data for easier analysis.
- Unified operations tier governs system management and monitoring. It includes auditing and proficiency management, data management, workflow management.
important aspects
A well-planned approach to designing these areas is essential to any Data Lake implementation. I highly encourage everyone to think of the desired structure they would like to work with. On the other hand, being too strict in these areas will cause Data Desert (opposite to Data Swamp). The Data Lake itself should be more about empowering people, rather than overregulating.
Most of the above problems may be solved by planning the desired structure inside your Data Lake Layers and by putting reliable owners in charge.
From our experience, we see that the organization of Data Lakes can be influenced by:
- Time partitioning
- Data load patterns (Real-time, Streaming, Incremental, Full load, One time)
- Subject areas/source
- Security boundaries
- Downstream app/purpose/uses
- Owner/stewardship
- Retention policies (temporary, permanent, time-fixed)
- Business impact (Critical, High, Medium, Low)
- Confidential classification (Public information, Internal use only, Supplier/partner confidential, Personally identifiable information, Sensitive – financial)
centralization of disparate content sources. Once gathered together (from their “information silos”), these sources can be combined and processed using big data, search and analytics techniques which would have otherwise been impossible.
security measures in the data lake may be assigned in a way that grants access to certain information to users of the data lake that do not have access to the original content source. These users are entitled to the information, yet unable to access it in its source for some reason.
normalized and enriched. This can include metadata extraction, format conversion, augmentation, entity extraction, cross-linking, aggregation, de-normalization, or indexing.
flexible access to the data lake and its content from anywhere. This increases the re-use of the content and helps the organization to more easily collect the data required to drive business decisions.