Data Engineering Cheat Sheet
The core ideas of Data Engineering distilled into a single, scannable reference — perfect for review or quick lookup.
Quick Reference
ETL (Extract, Transform, Load)
A data integration pattern where data is first extracted from source systems, then transformed into a suitable format (cleaning, aggregating, enriching), and finally loaded into a target data store such as a data warehouse.
Data Pipeline
An automated sequence of processing steps that moves data from one or more sources through transformations to a destination system. Pipelines can run on a schedule (batch) or continuously (streaming) and are the backbone of data engineering workflows.
Data Warehouse
A centralized, schema-enforced repository optimized for analytical queries and reporting. Data warehouses use columnar storage, indexing, and pre-aggregation to deliver fast read performance on structured, historical data.
Data Lake
A large-scale storage system that holds raw data in its native format, including structured, semi-structured, and unstructured data. Data lakes use cheap object storage and defer schema enforcement to the time of reading (schema-on-read).
Batch Processing
A data processing model in which data is collected over a period of time and then processed as a group. Batch jobs typically run on a schedule and are well-suited for large-volume, latency-tolerant workloads.
Stream Processing
A data processing model where records are processed individually or in micro-batches as soon as they arrive, enabling near-real-time analytics and event-driven architectures. Technologies like Apache Kafka, Apache Flink, and Spark Structured Streaming power this paradigm.
Data Modeling
The process of defining how data is structured, related, and stored in a database or warehouse. Common approaches include dimensional modeling (star and snowflake schemas), normalized modeling (3NF), and Data Vault, each optimized for different query and ingestion patterns.
Data Orchestration
The automated coordination and scheduling of data workflows across multiple systems and tasks, including dependency management, retries, alerting, and monitoring. Orchestrators ensure pipelines run in the correct order and handle failures gracefully.
Schema Evolution
The ability to modify a data schema over time, such as adding, removing, or renaming columns, without breaking downstream consumers. File formats like Avro and Parquet and schema registries (Confluent Schema Registry) support forward and backward compatibility.
Data Governance
The set of policies, standards, and processes that ensure data is accurate, consistent, secure, and used responsibly across an organization. It encompasses data quality, metadata management, access control, lineage tracking, and regulatory compliance.
Key Terms at a Glance
Get study tips in your inbox
We'll send you evidence-based study strategies and new cheat sheets as they're published.
We'll notify you about updates. No spam, unsubscribe anytime.