Main results
Before we go into details
Two difficulties in summarising the results.
- Results of bunching tests (Table 2) and Caliper tests (all other tables) are inconsistent.
- Binomial at 5% and CS2B tests almost always reject no bunching at any decision phase (near zero \(p\) values). These tests are undersized?
I will try to point what authors write. Note:
- Authors tend to rely on Caliper test results in the main text.
- Omitted variable biases for Caliper tests.
Initial submissions
- Non-increasingness has \(p<1\%\) for CS1, CS2B, LCM tests (Table 2). No tests using derounding method of Kranz and Pütz (2022bKranz, Sebastian, and Peter Pütz. 2022b. “Methods Matter: P-Hacking and Publication Bias in Causal Analysis in Economics: Comment.” American Economic Review 112 (9): 3124–36. https://doi.org/10.1257/aer.20210121.).24 Authors use KP method only for regressions (Caliper tests). No reasons given. Given that the \(p\) values under KP method get large for bunching, this omission is inexplicable. 25 At the same time, KP method can take away the bulk of \(p\) hacking. If authors choose to round values coarsely to get \(z=2\), this is \(p\) hacking.
- Binomial and discontinuity tests for bunching:
- 10% level: \(p<1\%\).
- 5% level: Mixed results.
- 1% level: \(p>50\%\) (Table A2).
- Derounded tests (using resampled data from uniform CIs) give large \(p\) values except for CS2B.
- 10% level: \(p<1\%\).
- Caliper tests for statistical significance:
- DID, IV, less so for RCT, are more correlated with SS than RDD (Table 3).
- Author characteristics are not correlated.
- DID, IV, less so for RCT, are more correlated with SS than RDD (Table 3).
Desk rejections
Stat signif (“SS”) is positively correlated with desk rejection at 10% level (Table 4).
- Given desk rejected papers end up with fewer ex post citations (Card et al. 2019Card, David, Stefano DellaVigna, Patricia Funk, and Nagore Iriberri. 2019. “Are Referees and Editors in Economics Gender Neutral?” The Quarterly Journal of Economics 135 (1): 269–327. https://doi.org/10.1093/qje/qjz035.), desk rejection might have filtered out false positive results of 10% level.26 A bit of leap here. It does not have to be filtering of false positives. If most editors reject p<10% (relative to p>10%), the paper will hardly be published and becomes a “bad” paper with fewer ex post citations, or a “false positive.”
- Had the desk rejection been a function of \(p>.1\), the estimate \(\hat{a}\) in Panel A of Table 4 will be negative. Authors suggest the co-editors are showing some values by picking up valuable information (p.2989).
Reviewer recommendations
SS gets more reviewer acceptance[A typo in the Table 5 header.][“The mass of p-values around significance thresholds becomes more pronounced as we move from the first figure (rejections) to the third figure (strong positive recommendations).” p.2990. Cannot see it in Figure 4…] at 5% level for both weak and strong recommendations (Table 5), although not so small \(p\) values.27 “(A) weakly positive or strongly positive review were more likely to be marginally significant at the 10% level than negative reviews (though these estimates are not statistically significant).” p.2990. 28 “(W)e include a vector of reviewer level covariates to account for potential correlation in (a) the assignment of manuscripts with marginally significant results to (b) reviewers with a higher propensity to review manuscripts positively. We find little difference in our estimates between columns (2) and (3), suggesting editors do not choose reviewers based on both the paper’s marginal significance and the reviewers propensity to review papers positively or negatively.” p.2990.
Initial vs. final drafts of accepted papers
SS is not correlated with the differences between initial and final drafts (Table 6).29 Authors say this is unsurprising because reviews mostly comment on robustness or heterogeneity tables which this paper does not process (footnote 26).
- Positive point estimates with large \(p\) values on Initial Submission. Peer review processes do not affect bunching.
Accepted vs. rejected drafts
SS is not associated with acceptance/rejection (Table 7)30 This is very, very surprising. Why do we see so few non-SS estimates in the main tables? Is it my bias that my attention is placed mostly on SS estimates?
- Peer review process does not exacerbate (nor attenuate) \(p\)-hacking.
Published elsewhere vs. never publised among rejected papers
If rejected, 5% SS is more likely to be never published (Table 8)
- So there is no graveyard of working papers with null results.
- “‘bad’ papers tend to \(p\)-hack more” where “bad”=never published.
- If you \(p\)-hack (and JHR rejects your paper), it indicates the likeliness of never being published. Or, indicates the weakness of your estimation and peer review processes can see it, on average.
Robustness checks
- De-rounding (Table A7, A10, A13, A16, A18).
- Krunz and Putz (2022)’s method against coarse rounding for Caliper tests.
- Inclusion of “ambiguous” test stats (Table A20-A24).
- Restrict to the first main table of each paper (Table A25-A29).
- Wider bandwidth (Table A30-A34).