Information Theory Cheat Sheet

The core ideas of Information Theory distilled into a single, scannable reference — perfect for review or quick lookup.

PiqCue — piqcue.com/information-theory/cheatsheet

Quick Reference

Shannon Entropy

A measure of the average uncertainty or information content in a random variable, defined as H(X) = -sum of p(x) log2 p(x) over all outcomes x. Higher entropy means greater unpredictability and more bits needed on average to encode each outcome.

Mutual Information

A measure of the amount of information that one random variable contains about another, quantifying the reduction in uncertainty about one variable given knowledge of the other. It is symmetric: I(X;Y) = I(Y;X).

Channel Capacity

The maximum rate at which information can be reliably transmitted over a communication channel, measured in bits per channel use. Shannon's noisy-channel coding theorem proves that error-free communication is achievable at any rate below channel capacity.

Data Compression (Source Coding)

The process of encoding information using fewer bits than the original representation. Shannon's source coding theorem establishes that the entropy of a source is the fundamental lower limit on the average number of bits per symbol achievable by any lossless compression scheme.

Kullback-Leibler Divergence

A non-symmetric measure of how one probability distribution P diverges from a reference distribution Q, defined as D_KL(P||Q) = sum of P(x) log(P(x)/Q(x)). It is always non-negative and equals zero only when the distributions are identical.

Error-Correcting Codes

Techniques for adding structured redundancy to transmitted data so that the receiver can detect and correct errors introduced by a noisy channel. Shannon proved that codes exist which achieve vanishing error probability at any rate below channel capacity.

Cross-Entropy

A measure from information theory that quantifies the average number of bits needed to encode data from distribution P when using a code optimized for distribution Q. It equals H(P) + D_KL(P||Q), combining true entropy with the divergence penalty.

Joint and Conditional Entropy

Joint entropy H(X,Y) measures the total uncertainty of two variables considered together, while conditional entropy H(Y|X) measures the remaining uncertainty in Y after observing X. The chain rule relates them: H(X,Y) = H(X) + H(Y|X).

Redundancy

The difference between the maximum possible entropy of a source (if all symbols were equally likely) and its actual entropy. Redundancy represents the exploitable structure in data that makes compression possible and also provides natural error resilience.

Noisy-Channel Coding Theorem

Shannon's foundational result proving that for any communication channel with capacity C, there exist coding schemes that allow transmission at rates up to C with arbitrarily small error probability, but no scheme can reliably exceed rate C.

Key Terms at a Glance

Information:A quantifiable reduction in uncertainty about the state of a system, measured in bits (base-2 logarithm) or nats (natural logarithm).

Entropy:The average amount of information or surprise produced by a stochastic source, defined as H(X) = -sum p(x) log p(x).

Bit:The basic unit of information, equal to the information content of a single binary choice between two equally likely outcomes.

Channel:A system that transmits information from a sender to a receiver, potentially introducing noise or distortion during transmission.

Channel Capacity:The tight upper bound on the rate of information that can be reliably transmitted over a channel, measured in bits per use.

Mutual Information:The amount of information that one random variable provides about another, equal to the reduction in entropy of one variable given knowledge of the other.

Conditional Entropy:The expected amount of information needed to describe a random variable Y given that the value of another random variable X is known.

Joint Entropy:The total entropy of a pair of random variables considered together, measuring the combined uncertainty.

KL Divergence:Kullback-Leibler divergence: a non-symmetric measure of the difference between two probability distributions P and Q.

Cross-Entropy:The average number of bits needed to identify an event from distribution P when using a coding scheme optimized for distribution Q.

Redundancy:The portion of a message that is not essential for conveying information, representing the difference between maximum and actual entropy.

Source Coding:The process of compressing a source's output to approach its entropy rate, removing statistical redundancy.

Channel Coding:The process of adding structured redundancy to transmitted data to enable error detection and correction at the receiver.

Huffman Code:An optimal prefix-free variable-length code that minimizes average codeword length for a known symbol probability distribution.

Prefix-Free Code:A code in which no codeword is a prefix of any other codeword, allowing unambiguous instantaneous decoding.

Get study tips in your inbox

We'll send you evidence-based study strategies and new cheat sheets as they're published.

We'll notify you about updates. No spam, unsubscribe anytime.