A data lake is a centralized repository that ingests and stores large volumes of data in its original form. The data files are typically stored in staged zones—raw, cleansed, and curated—so that different types of users may use the data in its various forms to meet their needs.
Data Lake vs. Data Warehouse
| Data lake | Data warehouse | Data lakehouse | |
|---|---|---|---|
| Type | Structured, semi-structured, unstructured | Structured | Structured, semi-structured, unstructured |
| Relational, non-relational | Relational | Relational, non-relational | |
| Schema | Schema on read | Schema on write | Schema on read, schema on write |
| Format | Raw, unfiltered | Processed, vetted | Raw, unfiltered, processed, curated, delta format files |
| Sources | Big data, IoT, social media, streaming data | Application, business, transactional data, batch reporting | Big data, IoT, social media, streaming data, application, business, transactional data, batch reporting |
| Scalability | Easy to scale at a low cost | Difficult and expensive to scale | Easy to scale at a low cost |
| Users | Data scientists, data engineers | Data warehouse professionals, business analysts | Business analysts, data engineers, data scientists |
| Use cases | Machine learning, predictive analytics, real-time analytics | Core reporting, BI | Core reporting, BI, machine learning, predictive analytics |