Skip to content
Adaptive

Learn Information Theory

Read the notes, then try the practice. It adapts as you go.When you're ready.

Session Length

~17 min

Adaptive Checks

15 questions

Transfer Probes

8

Lesson Notes

Information theory is a mathematical framework for quantifying, storing, and communicating information, originally developed by Claude Shannon in his landmark 1948 paper 'A Mathematical Theory of Communication.' At its core, the theory provides precise definitions for intuitive concepts like information content, uncertainty, and redundancy using the language of probability and logarithms. Shannon demonstrated that information can be measured in bits, established fundamental limits on data compression (the source coding theorem), and proved that reliable communication is possible over noisy channels up to a calculable maximum rate (the channel capacity theorem). These results laid the theoretical groundwork for the entire digital age.

The central quantity in information theory is entropy, which measures the average uncertainty or surprise associated with a random variable's outcomes. High entropy indicates greater unpredictability and thus more information content per message, while low entropy signals redundancy and predictability. Related measures such as mutual information, relative entropy (Kullback-Leibler divergence), and conditional entropy allow researchers to quantify how much one variable reveals about another, how different two probability distributions are, and how much uncertainty remains after partial observation. These tools have proven indispensable not only in communications engineering but also in statistics, machine learning, neuroscience, and physics.

Beyond its origins in electrical engineering, information theory has become a unifying language across the sciences. In machine learning, cross-entropy and KL divergence are standard loss functions for training classification models and variational autoencoders. In biology, information-theoretic measures help quantify genetic diversity and neural coding efficiency. In physics, the Bekenstein-Hawking entropy connects information to black hole thermodynamics, while quantum information theory extends Shannon's framework to qubits and entanglement. Whether one is designing a 5G cellular network, compressing video, training a large language model, or studying the thermodynamics of computation, information theory provides the essential quantitative toolkit.

You'll be able to:

  • Analyze Shannon entropy, mutual information, and channel capacity to quantify information content and transmission limits
  • Apply source coding theorems including Huffman coding and arithmetic coding to achieve optimal data compression rates
  • Evaluate error-correcting codes including Hamming, Reed-Solomon, and turbo codes for reliable communication over noisy channels
  • Distinguish between lossless and lossy compression techniques and their information-theoretic bounds for practical applications

One step at a time.

Key Concepts

Shannon Entropy

A measure of the average uncertainty or information content in a random variable, defined as H(X) = -sum of p(x) log2 p(x) over all outcomes x. Higher entropy means greater unpredictability and more bits needed on average to encode each outcome.

Example: A fair coin has entropy of 1 bit (maximum uncertainty for two outcomes), while a biased coin that lands heads 99% of the time has entropy near 0.08 bits because the outcome is highly predictable.

Mutual Information

A measure of the amount of information that one random variable contains about another, quantifying the reduction in uncertainty about one variable given knowledge of the other. It is symmetric: I(X;Y) = I(Y;X).

Example: Knowing today's temperature (X) reduces uncertainty about ice cream sales (Y). The mutual information I(X;Y) quantifies exactly how many bits of sales information are revealed by the temperature reading.

Channel Capacity

The maximum rate at which information can be reliably transmitted over a communication channel, measured in bits per channel use. Shannon's noisy-channel coding theorem proves that error-free communication is achievable at any rate below channel capacity.

Example: The capacity of a bandwidth-limited Gaussian channel is given by C = B log2(1 + S/N), where B is bandwidth, S is signal power, and N is noise power. This formula sets the ultimate speed limit for Wi-Fi, cellular, and satellite links.

Data Compression (Source Coding)

The process of encoding information using fewer bits than the original representation. Shannon's source coding theorem establishes that the entropy of a source is the fundamental lower limit on the average number of bits per symbol achievable by any lossless compression scheme.

Example: Huffman coding assigns shorter codewords to more frequent letters and longer codewords to rare letters, approaching the entropy limit. English text, with entropy around 1-1.5 bits per character, can be compressed well below its 8-bit ASCII representation.

Kullback-Leibler Divergence

A non-symmetric measure of how one probability distribution P diverges from a reference distribution Q, defined as D_KL(P||Q) = sum of P(x) log(P(x)/Q(x)). It is always non-negative and equals zero only when the distributions are identical.

Example: In machine learning, KL divergence measures how far a model's predicted probability distribution is from the true data distribution and is used as the loss function in variational autoencoders.

Error-Correcting Codes

Techniques for adding structured redundancy to transmitted data so that the receiver can detect and correct errors introduced by a noisy channel. Shannon proved that codes exist which achieve vanishing error probability at any rate below channel capacity.

Example: Reed-Solomon codes are used in CDs, DVDs, and QR codes. Even when part of a QR code is scratched or obscured, the redundant information allows the original message to be fully recovered.

Cross-Entropy

A measure from information theory that quantifies the average number of bits needed to encode data from distribution P when using a code optimized for distribution Q. It equals H(P) + D_KL(P||Q), combining true entropy with the divergence penalty.

Example: In training neural network classifiers, cross-entropy loss compares the model's predicted class probabilities against the true one-hot labels, guiding the optimizer to minimize the gap between predicted and actual distributions.

Joint and Conditional Entropy

Joint entropy H(X,Y) measures the total uncertainty of two variables considered together, while conditional entropy H(Y|X) measures the remaining uncertainty in Y after observing X. The chain rule relates them: H(X,Y) = H(X) + H(Y|X).

Example: If X is a city and Y is its weather, knowing the city (X) reduces weather uncertainty. H(Weather|City) is less than H(Weather) alone because different cities have different climate patterns.

More terms are available in the glossary.

Explore your way

Choose a different way to engage with this topic β€” no grading, just richer thinking.

Explore your way β€” choose one:

Explore with AI β†’

Concept Map

See how the key ideas connect. Nodes color in as you practice.

Worked Example

Walk through a solved problem step-by-step. Try predicting each step before revealing it.

Adaptive Practice

This is guided practice, not just a quiz. Hints and pacing adjust in real time.

Small steps add up.

What you get while practicing:

  • Math Lens cues for what to look for and what to ignore.
  • Progressive hints (direction, rule, then apply).
  • Targeted feedback when a common misconception appears.

Teach It Back

The best way to know if you understand something: explain it in your own words.

Keep Practicing

More ways to strengthen what you just learned.

Information Theory Adaptive Course - Learn with AI Support | PiqCue