statistical tests, null hypotheses, p-values, alpha levels, power, and research evidence

Hypothesis testing

Hypothesis testing is a statistical workflow for comparing observed data with a stated baseline claim.

Core purpose

It asks whether observed data are sufficiently inconsistent with a specified null hypothesis.

Main output

Many tests report a test statistic, p-value, and decision about whether to reject the null.

Key caution

A test result depends on design, assumptions, sample size, and the analysis plan.

A t-test distribution curve with a shaded rejection region. — Hypothesis tests compare a test statistic with a reference distribution and a rejection rule.View image on Wikimedia Commons

What hypothesis testing is

Hypothesis testing is a structured way to compare data with a statistical claim. A researcher states a null hypothesis, chooses a test and significance level, collects or analyzes data, and decides whether the observed result is unusual enough under the null model to reject that baseline claim.

Null and alternative hypotheses

The null hypothesis is the baseline, often no difference, no association, or no effect. The alternative hypothesis describes the kind of departure the researcher is looking for. Clear hypotheses matter because the same dataset can support different tests depending on the question being asked.

Test statistic

A test statistic compresses the data into a number that measures departure from the null expectation. Examples include t statistics, z statistics, chi-square statistics, and F statistics. The test statistic is interpreted using a reference distribution that comes from the chosen model and assumptions.

P-values and alpha

A p-value describes how extreme the observed test statistic, or a more extreme one, would be if the null model were true. The alpha level is the cutoff chosen for rejecting the null. A p-value below alpha is often called statistically significant, but that label does not prove the alternative hypothesis.

Errors and power

A type I error occurs when a true null hypothesis is rejected. A type II error occurs when a real effect is missed. Statistical power is the chance that a test will detect a specified effect when that effect truly exists. Power depends on sample size, effect size, variability, design, and the chosen alpha level.

Assumptions and design

A hypothesis test is only as useful as its design and assumptions. Random sampling, random assignment, independence, measurement quality, missing data, distributional assumptions, and preregistered analysis choices all affect whether the test answers the intended question.

Misuse and alternatives

Hypothesis testing is often misused when results are reduced to a yes-or-no threshold. Better reporting includes effect sizes, confidence intervals, uncertainty, sensitivity analyses, and transparent handling of exploratory work. In some settings, estimation, prediction, Bayesian analysis, or decision analysis may answer the practical question more directly.

Why it matters

Hypothesis tests influence scientific publication, medical claims, product experiments, quality control, and policy analysis. Used carefully, they discipline how evidence is compared with a claim. Used carelessly, they can make fragile or biased results look decisive.

Key concepts

Null hypothesisthe baseline claim tested against the data.
Alternative hypothesisthe departure from the null that the test is designed to detect.
Significance levelthe preset threshold for rejecting the null hypothesis.

Common steps

State the null and alternative hypotheses before looking at results.
Choose the test, alpha level, sample size, and analysis rules.
Calculate the test statistic and p-value, then interpret them with effect size and uncertainty.

Common misconceptions

Rejecting the null does not prove the alternative is true.
Failing to reject the null does not prove there is no effect.
Statistical significance is not the same as practical importance.

Open questions

When should fields replace bright-line testing with estimation or decision-focused methods?
How can software help users check assumptions and report uncertainty instead of only p-values?
How should preregistration handle reasonable flexibility in messy real-world data analysis?