【Michael Stuart】How Does Experimental Philosophy Fare Under the P-Curve?

How Does Experimental Philosophy Fare Under the P-Curve?
Mike Stuart, Edouard Machery and David Colaço

There are different reasons we might be cautious about the results of experimental philosophy. For example, some have criticized the way that x-phi studies are designed (Cullen 2010; Woolfolk, 2011, 2013; Scholl, ms), while others have focused on the extent to which certain studies fail to replicate (Seyedsayamdost 2015a, 2015b; Kim and Yuan 2015; Adleberg, Thompson, and Nahmias 2015; Machery et al. 2017).

These criticisms have mostly led to a productive discussion which has improved the methodology of x-phi. But there is one kind of questionable research practice that is very hard to identify, either by inspecting study design or by performing replications. This is “p-hacking.” P-hacking occurs when researchers go looking for significant p-values instead of making decisions in advance about which sorts of outliers to ignore, how much evidence to collect, which measures to use, which covariates to use, etc. P-hackers tailor their decisions to increase their chances of finding significant p-values, which is very dangerous because “p-hacking can allow researchers to get most studies to reveal significant relationships between truly unrelated variables” (Simonsohn et al. 2014, 535; see also Simmons et al., 2011).

In a paper just published in Analysis, David Colaço, Edouard Machery and I present the results of an exhaustive inquiry into whether and where p-hacking has occurred in experimental philosophy.

We created a corpus of 365 experimental philosophy studies, which includes pretty much all the studies in x-phi from 1997 to 2016. We identified the main statistic and its p-value for each study, its sample size, and so on, and then analyzed the set of reported p-values by means of a “p-curve.” A p-curve takes the p-value from the main finding of each of a set of studies, and displays the percentage of p-values that take on certain p-values.

Once we have the p-curve, we can use its shape as an indicator of whether p-hacking has taken place. If the curve slopes down as we move to the right (because most of the p-values have very low values), then some of the hypotheses tested are true. If it is a uniform straight line, then none of the effects tested are real (all the null hypotheses are true). If many p-values are at or near 0.05 (so the curve slopes up as we move to the right), then scientists have engaged in p-hacking.

So what are the results? Here we describe only some of the findings reported in our article. Overall, the news is definitely good for experimental philosophy, although with a few important caveats!

Taken as a whole, the corpus of experimental philosophy studies has a p-curve that suggests that at least some of the effects investigated by experimental philosophers are real:

P-curve for the entire corpus of experimental philosophy studies

This is not to say that all is rosy, x-phi does have some skeletons in its closet. Specifically, the p-curve for studies performed before 2006 has a shape that indicates some p-hacking. Overall, the situation becomes drastically better after 2006, though we still find some worrying evidence of p-hacking in experimental ethics and experimental epistemology.

Interestingly, there is no significant difference between the so-called negative and positive programs of experimental philosophy: neither show evidence of p-hacking.

One thing we did not do in the paper, but which we will do now for this blog post, is p-curve the publications of particular researchers (with permission, of course!). We report them here, just for fun. As you can see, none of them show evidence of p-hacking.

Joshua Knobe’s p-curve:

Edouard Machery:

Thomas Nadelhoffer:

Shaun Nichols:

David Rose:

Steven Stich:

Justin Sytsma:

John Turri:

The app we used to create these p-curves can be accessed here.

(From X-phiblog)

References

Adleberg T., Thompson M., Nahmias E.. 2015. Do men and women have different philosophical intuitions? Further data. Philosophical Psychology 28: 615–41.

Cullen S. 2010. Survey-driven romanticism. Review of Philosophy and Psychology 1: 275–96.

Kim M., Yuan Y.. 2015. No cross-cultural differences in the Gettier car case intuition: a replication study of Weinberg et al. 2001. Episteme 12: 355–61.

Machery E., Stich S.P., Rose D., Chatterjee A., Karasawa K., Struchiner N., Sirker S., Usui N., Hashimoto T. 2017. Gettier across cultures. Noûs 51: 645–64.

Scholl B. MS. Two kinds of experimental philosophy (and their experimental dangers).

Seyedsayamdost H. 2015a. On normativity and epistemic intuitions: failure of replication. Episteme 12: 95–116.

Seyedsayamdost H. 2015b. On gender and philosophical intuition: failure of replication and other negative results. Philosophical Psychology 28: 642–73.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. 2011. False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science 22: 1359 –1366.

Simonsohn, U., Simmons, J. P. and Nelson, L. D. 2014. P-Curve: A key to the file-drawer. Journal of Experimental Psychology: General 143: 534 –547.

Woolfolk R.L. 2011. Empirical tests of philosophical intuitions. Consciousness and Cognition 20: 415–6.

Woolfolk R.L. 2013. Experimental philosophy: a methodological critique. Metaphilosophy 44: 79–87.