Data Engineering Cheat Sheet

The core ideas of Data Engineering distilled into a single, scannable reference — perfect for review or quick lookup.

PiqCue — piqcue.com/data-engineering/cheatsheet

Quick Reference

ETL (Extract, Transform, Load)

A data integration pattern where data is first extracted from source systems, then transformed into a suitable format (cleaning, aggregating, enriching), and finally loaded into a target data store such as a data warehouse.

Data Pipeline

An automated sequence of processing steps that moves data from one or more sources through transformations to a destination system. Pipelines can run on a schedule (batch) or continuously (streaming) and are the backbone of data engineering workflows.

Data Warehouse

A centralized, schema-enforced repository optimized for analytical queries and reporting. Data warehouses use columnar storage, indexing, and pre-aggregation to deliver fast read performance on structured, historical data.

Data Lake

A large-scale storage system that holds raw data in its native format, including structured, semi-structured, and unstructured data. Data lakes use cheap object storage and defer schema enforcement to the time of reading (schema-on-read).

Batch Processing

A data processing model in which data is collected over a period of time and then processed as a group. Batch jobs typically run on a schedule and are well-suited for large-volume, latency-tolerant workloads.

Stream Processing

A data processing model where records are processed individually or in micro-batches as soon as they arrive, enabling near-real-time analytics and event-driven architectures. Technologies like Apache Kafka, Apache Flink, and Spark Structured Streaming power this paradigm.

Data Modeling

The process of defining how data is structured, related, and stored in a database or warehouse. Common approaches include dimensional modeling (star and snowflake schemas), normalized modeling (3NF), and Data Vault, each optimized for different query and ingestion patterns.

Data Orchestration

The automated coordination and scheduling of data workflows across multiple systems and tasks, including dependency management, retries, alerting, and monitoring. Orchestrators ensure pipelines run in the correct order and handle failures gracefully.

Schema Evolution

The ability to modify a data schema over time, such as adding, removing, or renaming columns, without breaking downstream consumers. File formats like Avro and Parquet and schema registries (Confluent Schema Registry) support forward and backward compatibility.

Data Governance

The set of policies, standards, and processes that ensure data is accurate, consistent, secure, and used responsibly across an organization. It encompasses data quality, metadata management, access control, lineage tracking, and regulatory compliance.

Key Terms at a Glance

Apache Airflow:An open-source platform to programmatically author, schedule, and monitor data workflows using directed acyclic graphs (DAGs).

Apache Avro:A row-based data serialization framework with support for schema evolution and compact binary encoding.

Apache Flink:A distributed stream processing framework designed for stateful computations over unbounded and bounded data streams.

Apache Kafka:A distributed event streaming platform for high-throughput, fault-tolerant, real-time data pipelines and streaming applications.

Apache Parquet:A columnar storage file format optimized for analytical workloads with efficient compression and encoding schemes.

Apache Spark:A unified analytics engine for large-scale data processing supporting batch, streaming, SQL, machine learning, and graph computations.

Batch Processing:A data processing model where data is collected over time and processed as a group at scheduled intervals.

CDC (Change Data Capture):A technique for detecting and propagating data changes from a source database to downstream systems in near-real time.

Data Catalog:A metadata management tool providing a searchable inventory of data assets with descriptions, ownership, lineage, and quality metrics.

Data Governance:The framework of policies, standards, and processes ensuring data quality, security, privacy, and regulatory compliance.

Data Lake:A centralized storage repository that holds raw data in its native format, supporting schema-on-read access patterns.

Data Lakehouse:An architecture combining the flexibility of data lakes with the performance and management features of data warehouses.

Data Lineage:The tracking of data's origin, movement, and transformation across systems throughout its lifecycle.

Data Modeling:The process of defining how data is structured, related, and stored, using approaches like star schema, snowflake schema, or Data Vault.

Data Observability:The ability to monitor and understand the health of data systems, encompassing freshness, volume, schema, distribution, and lineage.

Get study tips in your inbox

We'll send you evidence-based study strategies and new cheat sheets as they're published.

We'll notify you about updates. No spam, unsubscribe anytime.