statistical tests, null hypotheses, tail probability, evidence, and uncertainty

P-value

A p-value is a probability calculated under a null model for results at least as extreme as the observed data.

Core meaning

A p-value is calculated assuming the null hypothesis and statistical model are true.

What it measures

It measures how unusual the observed result, or a more extreme one, would be under that model.

Common warning

A p-value is not the probability that the null hypothesis is true.

A standard normal distribution diagram with shaded tail regions around critical values. — P-values are tail probabilities calculated under a specified null model.View image on Wikimedia Commons

What a p-value is

A p-value is a probability from a statistical test. It is calculated under a model where the null hypothesis is assumed true. In plain terms, it asks how often a test would produce data as extreme as, or more extreme than, the data actually observed if the null model were correct.

The null model matters

The p-value is not a property of the data alone. It depends on the null hypothesis, test statistic, sampling design, distributional assumptions, and whether the test is one-sided or two-sided. Changing those choices can change the p-value.

Small p-values

A small p-value means the observed result would be relatively unusual under the null model. Researchers often compare it with a chosen alpha level, such as 0.05, to decide whether to call a result statistically significant. The threshold is a convention or decision rule, not a boundary between truth and falsehood.

Large p-values

A large p-value means the data are not very incompatible with the null model by that test. It does not prove the null hypothesis. The study may be underpowered, noisy, poorly measured, or simply unable to distinguish small effects from no effect.

What it does not say

A p-value does not tell the probability that a hypothesis is true, the probability that results were due to chance alone, the size of an effect, or whether a finding is practically important. Those questions require other evidence, such as effect sizes, confidence intervals, study design, prior knowledge, and replication.

One-sided and two-sided tests

A one-sided p-value looks for extremeness in a specified direction. A two-sided p-value looks for results unusually far from the null in either direction. The choice should usually be made before seeing the data, because choosing afterward can make evidence look stronger than it is.

Misuse and p-hacking

P-values are vulnerable to misuse when researchers try many analyses, outcomes, subgroups, or stopping rules and report only the smallest or most convenient result. Without transparent reporting, a p-value can hide the amount of searching that happened before the final analysis was selected.

Why it matters

P-values remain common in science, medicine, policy analysis, and quality control, so understanding them matters even for readers who prefer estimation or Bayesian methods. Used carefully, a p-value can summarize conflict between data and a model. Used carelessly, it can turn uncertainty into an overconfident label.

Key concepts

Null hypothesisthe baseline claim used to calculate the p-value.
Test statisticthe numerical summary used to measure how far the data are from the null expectation.
Tail probabilitythe probability of results as extreme as, or more extreme than, the observed result under the null model.

What affects it

Sample size, variability, measurement quality, and model assumptions.
The selected test statistic, one-sided or two-sided framing, and alpha threshold.
Analysis choices such as exclusions, transformations, covariates, and multiple comparisons.

Common misconceptions

A p-value of 0.03 does not mean there is a 3 percent chance the null hypothesis is true.
A lower p-value does not automatically mean an effect is large or important.
A p-value just below 0.05 is not categorically different from one just above 0.05.

Open questions

How should fields balance p-values with estimation, prediction, decision analysis, and replication?
When should journals discourage bright-line significance thresholds?
How can statistical software and methods training make p-value interpretation harder to misuse?