p-values, null hypotheses, and alpha levels

Statistical significance

Statistical significance is a rule-based judgment that observed data are unlikely under a specified null hypothesis.

Core meaning

A result is statistically significant when its p-value falls below a chosen significance level.

Common threshold

The conventional alpha level is 0.05, but the right threshold depends on context and consequences.

Major caution

Statistical significance does not prove that an effect is true, important, large, or reproducible.

A curve showing a shaded p-value tail used in statistical significance testing. — Statistical significance usually compares a p-value with a chosen threshold under a null model.View image on Wikimedia Commons

What statistical significance means

Statistical significance is a decision rule used in hypothesis testing. Researchers compare a p-value with a chosen significance level, often called alpha. If the p-value is smaller than alpha, the result is called statistically significant under that test and model.

The null hypothesis

The null hypothesis is the baseline claim being tested, such as no difference, no association, or no treatment effect. A significance test asks how unusual the observed data, or more extreme data, would be if that null hypothesis and the statistical assumptions were true.

P-values and thresholds

A p-value is not the probability that the null hypothesis is true. It is a probability calculated under the null model. The threshold, such as 0.05, is chosen by convention, design, or decision context. Crossing that threshold changes the label, but it does not create a sharp boundary between truth and falsehood.

Practical importance

A tiny effect can be statistically significant in a very large dataset. A meaningful effect can fail to reach significance in a small or noisy study. This is why significance should be read alongside effect size, confidence intervals, study design, measurement quality, and prior evidence.

Type I and type II errors

A type I error occurs when a test rejects a true null hypothesis. The alpha level controls the long-run type I error rate for a specified testing procedure. A type II error occurs when a test fails to detect a real effect. Statistical power describes the chance of detecting a specified effect when it exists.

Misuse and overinterpretation

Statistical significance is often overused as a badge of discovery. Selective reporting, p-hacking, multiple comparisons, flexible modeling, and publication bias can all make significant results look more persuasive than they are. The American Statistical Association has warned against basing scientific conclusions only on whether a p-value passes a threshold.

Better reporting

Stronger reporting states the research question, planned analyses, sample size, effect sizes, confidence intervals, assumptions, exclusions, and any exploratory work. Replication, preregistration, registered reports, open data, and transparent code can make significance claims easier to evaluate.

Why it matters

Statistical significance influences which results are published, funded, repeated, or acted on. Used carefully, it can summarize incompatibility between data and a model. Used carelessly, it can turn uncertain evidence into an overconfident yes-or-no story.

Key concepts

Alpha levelthe chosen cutoff for calling a result statistically significant.
P-valuethe probability of data at least as extreme as observed, assuming the null model is true.
Null hypothesisthe baseline model or claim against which evidence is assessed.

What to report with it

Effect sizes and confidence intervals to show magnitude and uncertainty.
Sample size, study design, assumptions, exclusions, and analysis choices.
Whether the analysis was planned in advance or discovered after looking at the data.

Common misconceptions

Statistical significance does not prove causation.
A nonsignificant result does not prove there is no effect.
A p-value just below 0.05 is not fundamentally different from one just above 0.05.

Open questions

When should fields move away from bright-line p-value thresholds?
How should journals reward careful uncertainty reporting rather than only positive findings?
Which combinations of estimation, decision theory, Bayesian methods, and replication best support real-world decisions?