Data, variation, sampling, inference, regression, uncertainty, statistical models, and evidence

Statistics

Statistics turns data into evidence by describing variation, estimating unknown quantities, testing claims, and measuring uncertainty.

Main purpose

Statistics helps describe data, estimate unknowns, compare groups, and judge how much uncertainty remains.

Key ingredient

Sampling matters because data usually represents only part of a larger population or process.

Where it appears

Statistics supports science, medicine, public policy, business, sports, economics, quality control, and machine learning.

Scatter plot with a fitted linear regression line through a cloud of data points. — Regression is one statistical tool for describing relationships and uncertainty in observed data.View image on Wikimedia Commons

What statistics studies

Statistics is the discipline of learning from data. It asks how data was collected, what patterns appear, how much variation is present, and what conclusions are justified. Unlike raw arithmetic, statistics keeps uncertainty visible, because data is often incomplete, noisy, biased, or drawn from a changing world.

Describing data

Descriptive statistics summarize what has been observed. Measures such as mean, median, range, variance, and standard deviation describe center and spread. Charts such as histograms, scatter plots, and box plots reveal shape, clusters, outliers, and relationships that a single number can hide.

Samples and populations

A population is the larger group or process of interest, while a sample is the data actually observed. A good sample is chosen so that it can support conclusions about the population. Poor sampling can make a precise calculation misleading, because the numbers may reflect selection bias more than the underlying reality.

Inference

Statistical inference uses sample data to estimate unknown quantities or evaluate claims. Confidence intervals express a range of plausible values under a model, while hypothesis tests ask whether observed data is surprising under a stated assumption. These tools require careful interpretation, especially when many comparisons are being made.

Models and assumptions

Statistical models simplify reality so data can be analyzed. A model might assume independence, a particular distribution, a linear relationship, or similar variability across groups. These assumptions are not just technical details; they shape what the results mean and whether the analysis is trustworthy.

Regression and relationships

Regression studies how one variable changes with another while accounting for variation. Linear regression fits a straight-line relationship, but regression can also handle curves, categories, counts, and many predictors. A fitted relationship can be useful for prediction, but it does not automatically prove cause and effect.

Statistics and probability

Probability and statistics work in opposite but connected directions. Probability starts with a model and asks what data might look like. Statistics starts with data and asks what model or explanation is plausible. Modern analysis often combines both, especially in Bayesian methods, simulations, and machine learning.

Why it matters

Statistics matters because data does not speak for itself. The same dataset can support strong evidence, weak evidence, or a misleading story depending on how it was collected and analyzed. Statistical thinking helps people judge claims, measure risk, design better studies, and make decisions under uncertainty.

Key concepts

A dataset contains observed values collected from a study, process, or experiment.
A population is the larger group or process a study wants to understand.
A sample is the observed subset used to make estimates or comparisons.
A statistic is a number computed from sample data.

Common methods

Descriptive statistics summarize center, spread, and shape.
Confidence intervals express uncertainty around an estimate.
Hypothesis tests compare observed data with a model-based expectation.
Regression models relationships between variables while accounting for variation.

Common misconceptions

Correlation does not by itself prove causation.
A statistically significant result is not always practically important.
A larger dataset does not fix biased sampling or poor measurement.
A model can be mathematically correct but still inappropriate for the question.

Open questions

How can statistical results be communicated without overstating certainty?
Which methods best handle biased, missing, or changing real-world data?
How should statistical evidence be combined with domain expertise in policy and medicine?
How can automated analysis tools make assumptions visible rather than hiding them?