Document ID: V_3_12
Section: V_Mathematics_Information
Keywords: statistics, hypothesis testing, p-value, significance, confidence interval, null hypothesis, Type I error, Type II error, power, Fisher, Neyman, Pearson, t-test, ANOVA, chi-squared, regression, effect size, meta-analysis, Bayesian statistics, frequentist, replication crisis, multiple testing, Bonferroni
Category Tags: mathematics, information
Cross-References: V_3_07 — Probability Theory · V_3_11 — Mathematical Optimization · ZC_1_01 — Psychology Overview · R_3_09 — Molecular Phylogenetics · ZA_3_07 — Particle Accelerators
Reliability Tier: Tier 1 (well-documented, peer-reviewed)
Last Updated: Mar 07, 2026 | Source Count: 10 | Weighted Score: 20 | Source Confidence: [2/5] | Confidence: High (well-documented, peer-reviewed)
QUICK SUMMARY
Statistics — the science of collecting, analyzing, and interpreting data under uncertainty — underpins virtually every empirical science, from medicine and psychology to physics and economics. Modern statistical hypothesis testing grew from two competing frameworks: R. A. Fisher's significance testing (1925), which introduced $p$-values as measures of evidence against a null hypothesis, and the Neyman-Pearson framework (1933), which formalized hypothesis testing as a decision procedure between null ($H_0$) and alternative ($H_1$) hypotheses with controlled error rates (Type I: false positive, $\alpha$; Type II: false negative, $\beta$; power: $1-\beta$). The $p < 0.05$ threshold, though widely used, is arbitrary — Fisher intended $p$-values as continuous measures of evidence, not rigid pass/fail criteria. The replication crisis (2010s) revealed that many published findings with $p < 0.05$ fail to replicate, driven by $p$-hacking, publication bias, underpowered studies, and misunderstanding of what $p$-values actually mean ($p$ is NOT the probability that $H_0$ is true). The American Statistical Association's 2019 statement declared that "$p$-values should not be used as the primary basis for scientific conclusions" and urged moving beyond "statistical significance." Modern best practices emphasize effect sizes, confidence intervals, pre-registration, Bayesian methods, and meta-analysis. Bayesian statistics, which directly computes the probability of hypotheses given data via Bayes' theorem, offers a philosophically coherent alternative but introduces prior specification as a subjective element.
1. VERIFIED CLAIMS (Tier 1 — Peer-Reviewed / Established Statistics)
1.1 Hypothesis Testing Frameworks
- KEY FINDING A $p$-value is the probability of observing data as extreme as or more extreme than the observed data, assuming $H_0$ is true — it is NOT the probability that $H_0$ is true, NOT the probability the result occurred by chance, and NOT the probability of making an error; these misinterpretations are pervasive even among scientists (Greenland et al., 2016); correct interpretation requires understanding the sampling distribution under $H_0$
- Fisher significance testing (1925): $p$-values as continuous measures of evidence against $H_0$ — small $p$-values suggest the data are unlikely under $H_0$; Fisher considered $p = 0.05$ a "convenient level" but not a rigid cutoff; he recommended integrating $p$-values with prior knowledge and replication
- Neyman-Pearson framework (1933): Hypothesis testing as a decision between $H_0$ and $H_1$ — pre-specify $\alpha$ (acceptable Type I error rate, typically 0.05); design study with adequate power ($1-\beta$, typically 0.80); reject $H_0$ if test statistic falls in the rejection region; power depends on effect size, sample size, and $\alpha$; Cohen (1988) provided power analysis guidelines
- Standard tests: Student's $t$-test (means of one or two groups; Gosset, 1908); ANOVA (comparing >2 group means; Fisher, 1925); chi-squared test (categorical data; Pearson, 1900); F-test (variance ratios); Mann-Whitney $U$ (nonparametric rank test); Kruskal-Wallis (nonparametric ANOVA); each has assumptions (normality, independence, homoscedasticity) that must be checked
1.2 Estimation and Confidence Intervals
- Point estimation: Sample statistics (mean $\bar{x}$, variance $s^2$, proportion $\hat{p}$) estimate population parameters — properties of good estimators: unbiasedness (expected value equals parameter), consistency (converges to parameter as $n \to \infty$), efficiency (minimum variance); maximum likelihood estimation (MLE, Fisher, 1922) provides a general framework
- Confidence intervals: A 95% CI means: if we repeated the experiment many times, 95% of the constructed intervals would contain the true parameter — NOT that there is a 95% probability the parameter lies in this specific interval (frequentist interpretation; the parameter is fixed); CIs convey both point estimate and precision; increasingly recommended over $p$-values alone
- Effect sizes: Standardized measures of the magnitude of an effect — Cohen's $d$ (standardized mean difference: 0.2 small, 0.5 medium, 0.8 large), $r$ (correlation coefficient), $\eta^2$ (proportion of variance explained in ANOVA), odds ratio (OR, for binary outcomes); effect sizes are essential because statistical significance depends on sample size ($n$ large enough makes any tiny effect "significant")
1.3 Regression and Modeling
- Linear regression: $y = \beta_0 + \beta_1 x + \varepsilon$ — least squares estimation minimizes $\sum (y_i - \hat{y}_i)^2$; Gauss (1809) and Legendre (1805); Gauss-Markov theorem: OLS estimators are BLUE (best linear unbiased estimators) under standard assumptions; multiple regression extends to $p$ predictors: $y = X\beta + \varepsilon$
- Generalized linear models (GLMs): Extend linear regression to non-normal distributions — logistic regression (binary outcomes, logit link), Poisson regression (count data, log link), gamma regression (positive continuous); McCullagh and Nelder (1989); estimated via maximum likelihood; the backbone of applied statistics
- Model selection: AIC (Akaike Information Criterion, 1973): $AIC = 2k - 2\ln(\hat{L})$ — penalizes model complexity; BIC (Bayesian Information Criterion, Schwarz, 1978): $BIC = k\ln(n) - 2\ln(\hat{L})$ — stronger complexity penalty; cross-validation for prediction accuracy; regularization (LASSO, ridge) for high-dimensional models
1.4 Multiple Testing and Modern Corrections
- Multiple testing problem: Testing $m$ hypotheses at $\alpha = 0.05$ yields expected $0.05m$ false positives by chance — Bonferroni correction: use $\alpha/m$ per test (controls family-wise error rate, conservative); Holm's step-down procedure (less conservative); Benjamini-Hochberg (1995) controls false discovery rate (FDR): proportion of rejections that are false positives; FDR is standard in genomics where thousands of simultaneous tests are routine
- $5\sigma$ standard in physics: Particle physics requires $p < 2.87 \times 10^{-7}$ ($5\sigma$) to claim discovery — the Higgs boson was announced at $5\sigma$ in 2012; this severe threshold accounts for the look-elsewhere effect (searching many channels simultaneously) and the extraordinary nature of the claim
2. CREDIBLE CLAIMS (Tier 2 — Academic / Debated but Supported)
2.1 Replication Crisis
- Reproducibility Project: Psychology (2015): Open Science Collaboration attempted to replicate 100 published psychology results — only 36–47% replicated successfully; average effect sizes were half the originals; "replication crisis" galvanized reform across sciences
- Causes: $p$-hacking (trying many analyses until $p < 0.05$, also called "researcher degrees of freedom," Simmons et al., 2011); publication bias (journals preferentially publish positive results); HARKing (Hypothesizing After Results are Known); underpowered studies (median power ~50% in psychology, Szucs and Ioannidis, 2017); "Ioannidis (2005): Most Published Research Findings Are False" — landmark paper arguing that when studies are low-powered, the expected proportion of true positives among significant results is alarmingly low
- Reforms: Pre-registration (specifying hypotheses and analysis before data collection); Registered Reports (peer review before data collection); open data and open code; larger sample sizes; replications valued rather than penalized; ASA 2019 statement: "Don't say 'statistically significant'" — recommends reporting $p$-values alongside effect sizes, confidence intervals, and contextual interpretation
2.2 Bayesian Statistics
- Bayes' theorem applied to hypotheses: $P(H|D) = \frac{P(D|H)P(H)}{P(D)}$ — directly computes the probability of a hypothesis given data; Bayes factors $BF_{10} = P(D|H_1)/P(D|H_0)$ quantify evidence for $H_1$ vs. $H_0$; Jeffreys' scale: $BF > 10$ strong evidence, $BF > 100$ decisive; avoids many problems of $p$-values
- MCMC methods: Markov chain Monte Carlo (Metropolis et al., 1953; Geman and Geman, 1984; Hastings, 1970) makes Bayesian inference computationally feasible — BUGS, JAGS, Stan software; hierarchical models, mixture models, and complex likelihoods are naturally handled; Bayesian methods dominate in astrophysics, epidemiology, and machine learning
- Prior specification debate: Bayesian analysis requires choosing priors — uninformative/flat priors, Jeffreys priors, weakly informative priors (Gelman et al.); sensitivity analysis assesses how conclusions change with different priors; critics argue priors introduce subjectivity; proponents argue all statistical analyses involve subjective choices (model selection, test choice) and Bayesian methods make these explicit
3. SPECULATIVE CLAIMS (Tier 3 — Possible but Unverified)
3.1 Future of Statistical Methodology
- Post-p-value statistics: Whether the scientific community will successfully transition away from the $p < 0.05$ paradigm — some advocate replacing threshold-based testing with estimation-based approaches (effect sizes, CIs), calibrated $p$-values (Benjamin et al., 2018: redefine significance as $p < 0.005$), or Bayesian methods; cultural change is slow
- Causal inference revolution: Pearl's do-calculus and structural causal models formalize causation beyond correlation — DAGs (directed acyclic graphs), potential outcomes framework (Rubin), and instrumental variables enable causal conclusions from observational data under stated assumptions; "the ladder of causation" (Pearl and Mackenzie, 2018); transforming epidemiology, economics, and social science
4. DUBIOUS CLAIMS (Tier 4 — No Credible Source / Contradicted by Evidence)
4.1 "$p < 0.05$ Proves the Hypothesis Is True"
- [FALSE] A $p$-value below 0.05 does not prove $H_1$ is true or that $H_0$ is false — $p$-values are computed under $H_0$; they cannot tell you the probability that a hypothesis is true without additional information (prior probability, via Bayes' theorem); this misunderstanding is the most common statistical error in published science
IMAGES
| # | Description | Filename | Source | License |
|---|
| 1 | Diagram comparing frequentist and Bayesian approaches to hypothesis testing | — | — | — |
Counter-Arguments & Criticisms
No significant counter-arguments exist in the scholarly literature for the core claims presented here. The topic of Statistics Hypothesis Testing represents established knowledge within mathematics and information theory with no active scholarly dispute over the fundamental claims presented in this document.
BIBLIOGRAPHY
- Fisher, R | 1925 | ∅ | Statistical Methods for Research Workers | ∅ | ∅ | A | ∅ | ∅ | ∅ | ∅ | Oliver and Boyd
- Neyman, J.; Pearson, E | 1933 | "On the Problem of the Most Efficient Tests of Statistical Hypotheses" | Philosophical Transactions of the Royal Society A | ∅ | 231::289–337 | S | ∅ | doi:10.1098/rsta.1933.0009 | ∅ | ∅ | ∅
- Greenland, S. et al | 2016 | "Statistical Tests, P Values, Confidence Intervals, and Power: A Guide to Misinterpretations" | European Journal of Epidemiology | ∅ | 31::337–350 | ∅ | ∅ | doi:10.1007/s10654-016-0149-3 | ∅ | ∅ | ∅
- Ioannidis, J | 2005 | "Why Most Published Research Findings Are False" | PLoS Medicine | ∅ | ∅ | P | ∅ | doi:10.1371/journal.pmed.0020124 | ∅ | ∅ | A. , vol; 2, , e124
- Open Science Collaboration. , vol | 2015 | "Estimating the Reproducibility of Psychological Science" | Science | ∅ | ∅ | 349, , aac4716 | ∅ | doi:10.1126/science.aac4716 | ∅ | ∅ | ∅
- Benjamini, Y.; Hochberg, Y | 1995 | "Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing" | Journal of the Royal Statistical Society B | ∅ | 57::289–300 | ∅ | ∅ | doi:10.1111/j.2517-6161.1995.tb02031.x | ∅ | ∅ | ∅
- Wasserstein, R | 2019 | "Moving to a World Beyond 'p < 0.05'" | American Statistician | ∅ | ∅ | L. et al. , vol | ∅ | ∅ | ∅ | ∅ | 73, sup; 1, , pp; 1 19
- Gelman, A. et al | 2013 | ∅ | Bayesian Data Analysis | ∅ | ∅ | CRC Press | 3rd | ∅ | ∅ | ∅ | ∅
- Cohen, J. | 1988 | ∅ | Statistical Power Analysis for the Behavioral Sciences | ∅ | ∅ | Lawrence Erlbaum | 2nd | ∅ | ∅ | ∅ | ∅
- Pearl, J.; Mackenzie, D. | 2018 | ∅ | The Book of Why: The New Science of Cause and Effect | ∅ | ∅ | Basic Books | ∅ | ∅ | ∅ | ∅ | ∅
CROSS-REFERENCE INDEX
New research document — Phase 9 expansion. Last Updated: Mar 07, 2026
<table border="1" cellpadding="12" cellspacing="0" style="border-collapse: collapse; border: 2px solid #888; margin-top: 2em; background: #fafafa;">
<tr><td>
⚠️ AI-Assisted Research Disclaimer
This document was generated and structured with the assistance of AI tools.
While every effort is made to ensure accuracy, AI-assisted content may
contain errors, misattributions, or unintended inaccuracies. **Always
verify claims, dates, and sources independently** before citing or relying
on any information presented here.
- Sources may contain errors. Bibliography entries and cross-references
are checked by automated systems, but mistakes can occur. If something
looks wrong, it may be.
- Speculative and unverified claims are clearly labeled. This project
uses a four-tier evidence system:
- Tier 1 — Verified: Peer-reviewed, established scientific consensus.
- Tier 2 — Credible: Academically supported, debated but grounded.
- Tier 3 — Speculative: Plausible but unverified by mainstream science.
- Tier 4 — Dubious: No credible support or contradicted by evidence.
- This project maps multiple perspectives — not a single truth. Mainstream,
alternative, and skeptical viewpoints are presented side by side for
critical comparison, not endorsement. Inclusion does not imply agreement.
- We are actively improving. Source verification, factuality scoring,
and bibliography enrichment are ongoing. Each revision adds stronger
citations, corrects identified errors, and expands coverage.
📖 For full details on our verification methodology, scoring systems, and
quality metrics, see: Fact-Checking & Verification Systems
Think Openly. Check the sources. Draw your own conclusions.
</td></tr>
</table>