Unpacking p-hacking and publication bias

American Economic Review, 113, November 2023, pp. 2974-3002

by Abel Brodeur, Scott Carrell, David Figlio, Lester Lusher

2024年05月14日 12:21

Summary

Intro

\(P\)-hacking

Definition
Knowingly or otherwise, authors’ actions that produce “more favorable” \(p\)-values (“bias” of Ioannidis 2005Ioannidis, John P. A. 2005. “Why Most Published Research Findings Are False.” PLoS Medicine 2 (8): e124. https://doi.org/10.1371/journal.pmed.0020124.).1 There are some variations in definitions in the literature. This paper uses a publication bias as acts of editors and reviewers, \(p\)-hacking as acts of authors. 2 Andrew Gelman dislikes the term “p-hacking”, because sometimes authors unintentionally and honestily choose to pick a particular estimation that results in cherry picking. Intentions do not matter and “hacking” sounds as if all it counts is intentions. And the authors may respond that they are being honest which clouds the problem. He prefers “forking-paths”. 3 Gelman says the problem is in selective reporting or in p, not hacking per se. IOW you can “hack” if you show everything you did. \(\dots\) which gives rise to multiple testing and an increase in \(p\) values. Problem solved.

Motivations
Presence or a belief of a publication bias (that only papers with low \(p\)-values get published).4 Another motivation: A weakly founded hypothesis.

Effects
Without null effects, published research leads to biased estimates and misleading confidence sets (away from zero).

\(P\) curve is flat, some say

\(P\) curve \(:=\) pdf of \(p\) values from all studies.

Click here to see why as argued by Christensen and Miguel (2018Christensen, Garret, and Edward Miguel. 2018. “Transparency, Reproducibility, and the Credibility of Economics Research.” Journal of Economic Literature 56 (3): 920–80. https://doi.org/10.1257/jel.20171350.).

A \(p\) curve is uniformly distributed under the null of no effect and no \(p\)-hacking.

  • A \(p\) value \(\leqslant\) 4% happens 4% of times.
  • A \(p\) value \(\leqslant\) 5% happens 5% of times.
  • A difference in occurrence of 4% and 5% is 1%.
  • Any additional one percentage point size difference happens at a probability of 1%.
  • So its density is uniform.

But this is a too general argument when Elliot, Kudrin, and Wuthrich (2022) derived the sufficient conditions for non-increasingness…

Existing literature

Empirical distribution of test statistics

  • \(P\) curves constructed from published manuscripts show bunching at 1, 5, 10% (Brodeur et al. 2016Brodeur, Abel, Mathias Lé, Marc Sangnier, and Yanos Zylberberg. 2016. “Star Wars: The Empirics Strike Back.” American Economic Journal: Applied Economics 8 (1): 1–32. https://doi.org/10.1257/app.20150044.; A. Gerber, Malhotra, et al. 2008Gerber, Alan, Malhotra Neil, et al. 2008. “Do Statistical Reporting Standards Affect What Is Published? Publication Bias in Two Leading Political Science Journals.” Quarterly Journal of Political Science 3 (3): 313–26. https://doi.org/10.1561/100.00008024.; A. S. Gerber and Malhotra 2008Gerber, Alan S, and Neil Malhotra. 2008. “Publication Bias in Empirical Sociological Research: Do Arbitrary Significance Levels Distort Published Results?” Sociological Methods & Research 37 (1): 3–30. https://doi.org/10.1177/0049124108318973.; less so for recent RCTs found in Vivalt 2019Vivalt, Eva. 2019. “Specification Searching and Significance Inflation Across Time, Methods and Disciplines.” Oxford Bulletin of Economics and Statistics 81 (4): 797–816. https://doi.org/10.1111/obes.12289.).
  • \(P\) curves constructed from published and working paper manuscripts show similar bunching at 1, 5, 10%, so R&R does not mitigate the problem (Brodeur, Cook, and Heyes 2020Brodeur, Abel, Nikolai Cook, and Anthony Heyes. 2020. “Methods Matter: P-Hacking and Publication Bias in Causal Analysis in Economics.” American Economic Review 110 (11): 3634–60. https://doi.org/10.1257/aer.20190687.).

Theoretical properites of test statistics

  • Sufficient conditions for non-increasing \(p\) curve: A sort of monotone likelihood ratio property (Elliot, Kudrin, and Wuthrich, 2022).5 \(f'_{h}(x)f(x)\geqslant f'(x)_{h}f(x)\) where \(f_{h}\) is a density of test statistic under the alternative and \(f\) is a density of test statistic under the null \(h=0\).
  • Tests (restrictions) derived from non-increasingness of \(p\) curve: Non-increasing frequencies, upper bounds of frequency shown by Elliot, Kudrin, and Wuthrich (2022) can use moment inequality tests of Cox and Shi (2022Cox, Gregory, and Xiaoxia Shi. 2022. Simple Adaptive Size-Exact Testing for Full-Vector and Subvector Inference in Moment Inequality Models.” The Review of Economic Studies 90 (1): 201–28. https://doi.org/10.1093/restud/rdac015.), concavity of distribution function can use distance tests of Carolan and Tebbs (2005Carolan, Christopher A., and Joshua M. Tebbs. 2005. Nonparametric tests for and against likelihood ratio ordering in the two-sample problem.” Biometrika 92 (1): 159–71. https://doi.org/10.1093/biomet/92.1.159.), continuity in density can use RDD test of Cattaneo, Jansson, and Ma (2020Cattaneo, Matias D., Michael Jansson, and Xinwei Ma. 2020. “Simple Local Polynomial Density Estimators.” Journal of the American Statistical Association 115 (531): 1449–55. https://doi.org/10.1080/01621459.2019.1635480.).
  • Related work: Andrews and Kasy (2019Andrews, Isaiah, and Maximilian Kasy. 2019. “Identification of and Correction for Publication Bias.” American Economic Review 109 (8): 2766–94. https://doi.org/10.1257/aer.20180310.) provide estimation of conditional probability of publication to correct for publication bias in estimating effect size.

Contribution of this paper

This paper uses both initial submission (\(p\) hacking) and published (\(p\) hacking and publication bias) versions. Doing so with precise information of the peer review process.

Using only published papers cannot distinguish \(p\)-hacking from a publication bias. This limits the use of interventions.

Data

Sources

Journal of Human Resources
All submissions in 2013-2018.
4 editorial phases: Initial submission (3607), desk rejection (2365), reviewer rejection (1018), publication (223).

Online author and reviewer information
Individual web pages, google scholar web pages, repec, NBER.
Gender, PhD institution, graduation year, tenure status, prior publication history, NBER affiliation. Ranking of PhD program is from department productivity of repec.

Anonymous survey of microeconomists
Population: All (561) authors published using RCT, DID, IV, RDD in top 25 in 2018.

Sampling

Empirical strategy

Design

  1. Tests of \(p\) hacking: Test restrictions on curvature, levels, continuity of \(p\) curve to detect \(p\) hacking.
  2. Tests of publication bias: Use each test stats as an observation and link with editorial decisions in a regression.
  3. External validity (showing microecon-wide relevance) of finding; Use an anonymous survey to authors to reveal QRPs, relate them to submited journals.

Second may be problematic due to omitted variables.

Nonparametric tests using the null of flat \(p\) curve

Fisher’s \(\chi^{2}\) test (omitted in this paper because it always gives \(p\) values of 1)

Used by Simonsohn, Nelson, and Simmons (2014Simonsohn, Uri, Leif D. Nelson, and Joseph P. Simmons. 2014. “P-Curve: A Key to the File-Drawer.” Journal of Experimental Psychology: General 143 (2): 534. https://doi.org/10.1037/a0033242.). Fisher showed that for \(p_{1}, \dots, p_{m}\), \(s=-2*\sum\limits_{k=1}^{m}\ln p_{k}\sim \chi^{2}(2m)\).

Click here to see how to use the Fisher’s test.

Here, each \(p\) is just data and we need the \(p\) values of each data. For continuous tests at 5% level, under the null of uniformly distributed test statistics in \(U[0, .05]\), one can get \(p\) values of each test statistics by dividing with .05.7 For example, when considering 5% significance, a test statistic (\(p\) value) less than .01 can be found with the probability of 20%, so divide .01 with .05 to get .2. For a test statistic less than .02 can be found with the probability of 40%, so divide .02 with .05. So transform each \(p\) values with \(p_{k}/\alpha\) where \(\alpha\) is the level of significance, get \(s\) and \(1-\chi^{2}(2k\)) gives the \(p\) value.

Nonparametric tests using the null of non-increasing \(p\) curve

Binomial tests

Compare the number of tests for \(p\in[.040, .045]\) vs. \(p\in(.045, .050]\). Under the null of no \(p\)-hacking and a non-increasing \(p\) curve, we should have: \[ \# \mbox{ in }\underbrace{[.040, .045]}_{\mathrm{farther \ from \ cutoff}} \geqslant \# \mbox{ in }\underbrace{(.045, .050].}_{\mathrm{closer \ to \ cutoff}} \] Binomial test examines the null of equal or smaller probability in \((.045, .050]\) (also from Simonsohn, Nelson, and Simmons 2014). We can compute the \(p\) value given the data (sample size \(n\) with \(k\) obs in this region) against the null \(\pi=.5\) using the binomial distribution formula.8 \[p=\sum_{i=0}^{k}Pr(X=i)=\sum_{i=0}^{k}\left( \begin{array}{c} n\\ i \end{array} \right)(1-.5)^{n-i}.\]

Conditional moment inequality tests (CS1, CS2B histogram based tests)

Divide into bins and test non-increasingness of \(\Delta\)frequency\(\leqslant 0\) using conditional \(\chi^{2}\) tests of Cox and Shi (2022).

LCM test based on concavity of distribution function

Test empirical distance between points on \(p\) curve and compare against its least concave majorant (LCM of \(f\) is defined as the least of all functions that are concave and above \(f\)). \(T=\sqrt{n}||\mathcal M \hat{G}-\hat{G}||_{\infty} \stackrel{d}{\longrightarrow}||\mathcal M B-B||_{\infty}\) where \(\mathcal M\) is a majorant operator, \(\hat{G}\) is an empirical distribution function, \(B\) is a Brownian bridge on [0,1].9 A Brownian bridge is a Brownian motion that is shaped like a bridge (starting and ending at zero). This intuitively makes sence because points are on the LCM whose difference with the points are zero. But the underlying mathematical intuition is beyond me… 10 \(||\mathbf{a}||_{\infty}\) is the sup-norm \(\max\{|a_{i}|\left|1\leqslant i \leqslant n\right.\}\) or the largest (in absolute) element.

Nonparametric tests using the null of continuity

Discontinuity test

This is a test of RDD by Cattaneo, Jansson, and Ma (2020). It tests discontinuity (jumps) in density. Tests for 1, 5, and 10%.

Parametric tests using the null of no “statistical significance” (with covariates)

Caliper tests

A logit of statistical significance on editorial decisions Tufte (1985Tufte, Edward R. 1985. “Evidence Selection in Statistical Studies of Political Economy: The Distribution of Published Statistics.” Unpublished Manuscript.).11 If Signif at 10% is negatively correlated with Signif at 5%, and if Reject is negatively correlated with Signif at 5%, the estimate of Reject will be upwardly biased for Signif at 10%. 12 Why not run regressions at the paper level, the level where the decisions is made: Decision = \(c_{1}\times\)NumOfSignificant\(_{.05}+c_{2}\times\)NumOfSignificant\(_{.10}+\mathbf{c}'\mathbf{x}+u\)? This will take care of the omitted variable bias arising from separating single test stats of a paper. \[ \mathrm{Significant} = a*\mathrm{Editorial Decision}+\mathbf{b}'\mathbf{x}+e. \] Significant = 0 if test stat does not reach the threshold, 1 if it surpasses it. Done for each 1, 5, 10 % levels. E.g., for 5% level, with wide and narrow bandwidths for \(h\):

  • Coded as 0 if \(z\in[1.96-h, 1.96]\).
  • Coded as 1 if \(z\in(1.96, 1.96+h]\).

If SS at P% is associated with rejections, what does it mean?

  • Given the bunching at P%, it means the peer review rejects \(p\) hacked papers more, on average.
  • If the review blindly/randomly accepts/rejects, there should be no correlation with Significant.
  • The peer review detects or questions the \(p\) hacked papers more.
  • Or, equivalently, the peer review rejects Siginificant = 1 papers more.

The last statement is true but sounds weird, because the peer review does not accept the null results more.13 This is just my gut feeling. But I think I am correct.

  • Once other test stats in the same paper are taken into account, a positive correlation of SS and rejection can become weaker.

Other issues

Rounding problem

Results do not change after dealing with the rounding problem.

Click here to see how to deal with rounding problem.

There are many \(z=2\) (exactly 2) cases.14 Many cases of 3, 4, 5 as well. See Figure A1 (a). This is called the “rounding errors” and is shown to inflate bunching (Kranz and Pütz 2022aKranz, Sebastian, and Peter Pütz. 2022a. “Methods Matter: P-Hacking and Publication Bias in Causal Analysis in Economics: Comment.” American Economic Review 112 (9): 3124–36.).15 Presumably, it may have been 1.999 or 2.001 and the stats may have been reported after rounding at the 3rd digit? Hard to imagine why we should have so many 1.999 or 2.001, though.

To deal with rounding errors, authors incoporate two methods:16 For 1., one can get a CI for SD but not for SE because, in a frequentist world, SE is a fixed population parameter that measures variability of stat (mean) across samples. Here, \(\widehat{SE}=\frac{\widehat{SD}}{\sqrt{N}}\) so authors may be dividing CI bounds with \(\sqrt{N}\)? I don’t know… 17 To get a 95% CI for SD, one uses the relationship \((n−1)\frac{s^{2}}{\sigma^{2}}\sim \chi^{2}(n−1)\) or the sum of standard normal squared errors is distributed as \(\chi^{2}(dof)\), to find \(L, U\) such that \(.95=P(L\leqslant (n−1)\frac{s^{2}}{\sigma^{2}}\leqslant U)\) where \(L, U\), respectively, cut probability 0.025 from the lower and upper tails of (the asymmetrical) distribution \(\chi^{2}(n−1)\). \(P(L\leqslant (n−1)\frac{s^{2}}{\sigma^{2}}\leqslant U)=P\left(\frac{(n−1)s^{2}}{U}\leqslant \sigma^{2}\leqslant \frac{(n−1)s^{2}}{L}\right)\), so CI is \(\left(\frac{(n−1)s^{2}}{U}, \frac{(n−1)s^{2}}{L}\right)\). Use qchisq(c(.025, .975), \(N\)) to get \(L, U\). 18 Drop significand (siginificant digits written as an integer) of \(s\) less than 37. E.g., under the three significant digits, the significand of \(s=.012\) is 12. Significand varies by the choice of significant digits: Under the notation of \(s=12\times 10^{-3}\), significand is 12, while under the notation of \(s=1.2\times 10^{-2}\), significand is 1.2. 19 A smaller significand gives a larger and more coarse value of \(z=\frac{\hat{b}}{s}\) (hence \(p\)). So setting a lowerbound amounts to excluding coarsely rounded \(z\) values.

  1. Randomly sample stats from the implied uniform intervals (“de-rounding”).
  2. Follow Kranz and Putz (2022) and drop coarsely rounded observations.

Under both methods, bunching disappears (Table A3, A4) but \(\hat{a}\) in LPM does not seem to change (Table A7, A10, A13, A16, A18).20 Weird…

SEs

SEs are clusterd at the paper level, because estimated results should be correlated within a paper.21 I would cluster at the co-editor level because the co-editor “treatment” assignment is a few levels higher.

Power?

Just looking at what these tests do, we can imagine:

  • Underpowered: Fisher, LCM, discontinuity.
  • Can be OK: Binomial, CS1, CS2B, Caliper.
  • Caliper forces to be local (by applying a window) and is parametric, others are local and nonparametric.

Review process

Summary statistics

Main results

Before we go into details

Two difficulties in summarising the results.

  1. Results of bunching tests (Table 2) and Caliper tests (all other tables) are inconsistent.
  2. Binomial at 5% and CS2B tests almost always reject no bunching at any decision phase (near zero \(p\) values). These tests are undersized?

I will try to point what authors write. Note:

Initial submissions

Desk rejections

Stat signif (“SS”) is positively correlated with desk rejection at 10% level (Table 4).

Reviewer recommendations

SS gets more reviewer acceptance[A typo in the Table 5 header.][“The mass of p-values around significance thresholds becomes more pronounced as we move from the first figure (rejections) to the third figure (strong positive recommendations).” p.2990. Cannot see it in Figure 4…] at 5% level for both weak and strong recommendations (Table 5), although not so small \(p\) values.27 “(A) weakly positive or strongly positive review were more likely to be marginally significant at the 10% level than negative reviews (though these estimates are not statistically significant).” p.2990. 28 “(W)e include a vector of reviewer level covariates to account for potential correlation in (a) the assignment of manuscripts with marginally significant results to (b) reviewers with a higher propensity to review manuscripts positively. We find little difference in our estimates between columns (2) and (3), suggesting editors do not choose reviewers based on both the paper’s marginal significance and the reviewers propensity to review papers positively or negatively.” p.2990.

Initial vs. final drafts of accepted papers

SS is not correlated with the differences between initial and final drafts (Table 6).29 Authors say this is unsurprising because reviews mostly comment on robustness or heterogeneity tables which this paper does not process (footnote 26).

Accepted vs. rejected drafts

SS is not associated with acceptance/rejection (Table 7)30 This is very, very surprising. Why do we see so few non-SS estimates in the main tables? Is it my bias that my attention is placed mostly on SS estimates?

Published elsewhere vs. never publised among rejected papers

If rejected, 5% SS is more likely to be never published (Table 8)

Robustness checks

Anonymous survey results

A social desirability bias may induce underreporting of these behavior, so questionable research practices (QRPs) may be more prevalent.

Aside: HARKing (modifying original hypothesis, Hypothesize After Results are Known) is bad, which is less known

感想