Data Science Glossary

25 essential terms — because precise language is the foundation of clear thinking in Data Science.

Showing 25 of 25 terms

A controlled experiment comparing two variants to determine which performs better on a specified metric, using random assignment and statistical hypothesis testing.

Bootstrap Aggregating; an ensemble method that trains multiple models on random subsets of the training data and combines their predictions to reduce variance.

The error introduced by approximating a complex real-world problem with a simplified model. High bias leads to underfitting.

A supervised learning task that assigns input data points to predefined categorical labels based on learned patterns from training data.

An unsupervised learning technique that groups data points into clusters based on similarity, without using predefined labels.

A range of values, derived from sample statistics, that is likely to contain the true population parameter with a specified probability (e.g., 95%).

A resampling method that partitions data into training and validation subsets multiple times to estimate model performance on unseen data.

An automated series of processes that extract, transform, and load data from source systems to analytical destinations.

The process of cleaning, restructuring, and enriching raw data into a format suitable for analysis and modeling.

Techniques that reduce the number of input variables in a dataset while preserving as much information as possible, such as PCA and t-SNE.

Techniques that combine multiple models to produce a prediction that is more accurate and robust than any individual model, including bagging, boosting, and stacking.

The process of creating, transforming, and selecting input variables to improve the predictive performance of machine learning models.

An iterative optimization algorithm that adjusts model parameters by moving in the direction of the steepest decrease of the loss function.

A statistical method for making inferences about a population by testing an assumption (null hypothesis) against observed sample data.

The process of replacing missing data with substituted values, using methods such as mean, median, mode, or model-based prediction.

The default assumption in a statistical test that there is no significant effect or difference between groups being compared.

A data point that differs significantly from other observations, potentially indicating measurement error, data entry mistakes, or genuine extreme values.

A modeling error that occurs when a model captures noise in the training data rather than the underlying pattern, leading to poor generalization.

The probability of obtaining test results at least as extreme as the observed results, assuming the null hypothesis is true.

An ensemble learning method that constructs multiple decision trees during training and outputs the mode (classification) or mean (regression) of individual tree predictions.

A supervised learning technique that models the relationship between dependent and independent variables to predict continuous numerical outcomes.

A technique that adds a penalty to a model's cost function to prevent overfitting by discouraging overly complex parameter values.

A Receiver Operating Characteristic curve that plots the true positive rate against the false positive rate at various classification thresholds to evaluate model performance.

A measure of the dispersion of a dataset relative to its mean, calculated as the square root of the variance.

In statistics, the average of squared deviations from the mean. In machine learning, the sensitivity of a model's predictions to fluctuations in the training data.