Information Retrieval Cheat Sheet

The core ideas of Information Retrieval distilled into a single, scannable reference — perfect for review or quick lookup.

PiqCue — piqcue.com/information-retrieval/cheatsheet

Quick Reference

Inverted Index

A data structure that maps each term in a vocabulary to a list of documents (or positions within documents) where that term appears, enabling fast full-text search. It is the fundamental building block of most modern search engines.

TF-IDF (Term Frequency-Inverse Document Frequency)

A numerical statistic that reflects the importance of a term in a document relative to a collection. Term frequency measures how often a term appears in a document, while inverse document frequency reduces the weight of terms that appear in many documents.

Precision and Recall

Two fundamental evaluation metrics in IR. Precision is the fraction of retrieved documents that are relevant, while recall is the fraction of all relevant documents that are retrieved. Together they capture the trade-off between returning only relevant results and returning all relevant results.

Vector Space Model

A mathematical model for representing documents and queries as vectors in a high-dimensional space, where each dimension corresponds to a term. Relevance is computed as the similarity (often cosine similarity) between the query vector and document vectors.

Boolean Retrieval Model

The simplest retrieval model, which treats queries as Boolean expressions (AND, OR, NOT) and returns documents that exactly satisfy the logical conditions. It provides no ranking of results.

BM25 (Best Matching 25)

A probabilistic ranking function used to estimate the relevance of documents to a given query. It extends TF-IDF by incorporating document length normalization and term saturation, and is widely used as a strong baseline in modern search systems.

Relevance Feedback

A technique where the system uses user judgments on initially retrieved documents to refine the query and improve subsequent retrieval results. It can be explicit (user marks relevant documents) or implicit (inferred from click behavior).

Query Expansion

The process of automatically adding additional terms to a user's original query to improve retrieval effectiveness. Terms can be drawn from thesauri, user feedback, or co-occurrence statistics in the document collection.

PageRank

An algorithm developed by Larry Page and Sergey Brin that ranks web pages based on the structure of hyperlinks. A page receives a higher score if it is linked to by many pages, especially by pages that themselves have high PageRank scores.

Neural Information Retrieval

The application of deep learning and neural network models to information retrieval tasks, including learned dense representations (embeddings), cross-encoders for re-ranking, and end-to-end retrieval models that move beyond traditional term-matching approaches.

Key Terms at a Glance

Information Retrieval:The science of searching for and obtaining relevant information from large data collections, encompassing the algorithms and systems behind search engines and digital libraries.

Inverted Index:A data structure mapping terms to the documents and positions where they occur, enabling efficient full-text search.

TF-IDF:A term weighting scheme combining term frequency (how often a term appears in a document) and inverse document frequency (how rare the term is across the collection).

Precision:The proportion of retrieved documents that are relevant to the user's query.

Recall:The proportion of all relevant documents in the collection that are successfully retrieved.

Relevance:The degree to which a retrieved document satisfies the user's information need. Can be binary or graded.

Boolean Retrieval:A retrieval model where queries are expressed as Boolean combinations of terms (AND, OR, NOT) and documents either match or do not.

Vector Space Model:A model representing documents and queries as vectors in term space, using cosine similarity to measure relevance.

BM25:Best Matching 25, a probabilistic ranking function incorporating term frequency saturation and document length normalization.

PageRank:A link analysis algorithm that assigns importance scores to web pages based on the quantity and quality of incoming hyperlinks.

Cosine Similarity:A similarity measure between two vectors computed as the cosine of the angle between them, commonly used to compare document and query vectors.

Stemming:The process of reducing words to their morphological root form to improve term matching across inflectional variants.

Lemmatization:Reducing words to their dictionary base form (lemma) using linguistic analysis, more accurate than stemming.

Stop Words:Highly frequent function words (e.g., 'the', 'and', 'is') often removed during indexing to reduce noise and index size.

Query Expansion:The process of adding related terms to a query to improve recall by bridging vocabulary mismatches.

Get study tips in your inbox

We'll send you evidence-based study strategies and new cheat sheets as they're published.

We'll notify you about updates. No spam, unsubscribe anytime.