Glossary

data lake

A centralized storage environment that holds raw, large-scale data from many sources in its native format for later processing and analytics.

A data lake is a centralized storage environment that holds large volumes of data from many sources in their raw or minimally processed form. It is typically implemented on scalable file or object storage and is used as a foundation for analytics, reporting, data science, and AI.

Key characteristics

In industrial and manufacturing contexts, a data lake commonly:

  • Ingests data from OT systems (PLCs, historians, SCADA), MES, ERP, QMS, LIMS, and other applications
  • Stores structured, semi-structured, and unstructured data together (for example, sensor time series, batch records, PDFs, and logs)
  • Preserves data in its original format rather than enforcing a single schema on write
  • Supports multiple downstream uses, such as dashboards, advanced analytics, machine learning, and ad hoc investigations
  • Is often part of an Industry 4.0 or enterprise analytics architecture, alongside data warehouses and operational databases

How a data lake is used operationally

Within manufacturing operations, a data lake commonly serves as:

  • Central collection point for high-volume sources like machine telemetry, quality measurements, and event logs
  • Historical repository that retains long time horizons of data to support trend analysis, process optimization, and investigation of deviations
  • Integration layer where data from MES, ERP, maintenance, and laboratory systems can be combined for cross-functional analytics
  • Source for curated data sets that are refined and then exposed to BI tools, data warehouses, or model training pipelines

In regulated environments, the data lake may need to support traceability, data lineage, controlled access, and retention rules, but it does not by itself constitute a validated system of record.

What a data lake is not

  • It is not the same as a transactional database used by MES, ERP, or SCADA for day-to-day operations.
  • It is not automatically governed, curated, or quality-checked; separate data management processes are required.
  • It is not necessarily a data warehouse, although a warehouse may be built on top of or sourced from a data lake.

Common confusion

  • Data lake vs data warehouse: A data warehouse typically stores cleaned, modeled, and structured data optimized for reporting and standardized analytics. A data lake stores raw or lightly processed data and can support many different schemas and use cases.
  • Data lake vs data lakehouse: A data lakehouse is a newer architectural pattern that combines data lake-style storage with data warehouse-like management and query features. A data lake on its own does not guarantee those warehouse characteristics.

Relation to Industry 4.0 architectures

In Industry 4.0 architectures, the data lake often sits above plant-floor control systems and MES, collecting data from multiple sites and systems. It provides a shared data foundation for enterprise analytics, predictive maintenance models, digital twins, and cross-plant performance analyses, while operational control and compliance records remain in their source systems.

Related FAQ

Let's talk

Ready to See How C-981 Can Accelerate Your Factory’s Digital Transformation?