A Post-Publication Peer-Review (3PR) of:
Time, Money, and Morality by Gino, F., & Mogilner, C. (online, 2013). Time, Money, and Morality. Psychological Science. DOI: 10.1177/0956797613506438
The Time, Money, and Morality article has been HIBAR-ed on Twitter and the Blogosphere (e.g., by Rolf Zwaan and Greg Francis ) and the discussion seems to revolve around the validity of the inferences based p-values close to 0.05 (e.g., they raise suspicions of p-hacking).
In short, the article reports of 4 Experiments testing 2 core postulates:
* Postulate 1: Priming
Money activates self-interest and increases unethical behaviour * Postulate 2: Priming
Time activates self-reflection and decreases unethical behaviour
Unethical behaviour is operationalised as taking the opportunity to cheat on a task.
Priming methods vary across experiments, so do the tasks that allow for an opportunity to cheat.
In Experiment 1 the two postulates are tested, Experiments 2-4 concern an assessment of the role of self-reflection on cheating behaviour and is operationalised differently across experiments.
Hold on to your P-curves for a moment… Back to the basics!
In this Post-Publication Peer-Review (3PR) I demonstrate that there is indeed some cause for concern about the way these results are presented and interpreted. Was it p-hacking? … I don’t know and maybe I don’t even care. To me this is an example of sloppy science, p-hacked or not, these results were allowed to be published by expert peers. It is more relevant to discuss the broken system of quality control that should have picked up on at least some of the following issues:
- Important information is missing:
- in general (e.g., number of subjects per condition, sample size determination)
- selectively across experiments (e.g., participants per cell, reporting of effect sizes)
- The analyses used on frequency data are inappropriate
- Invalid or biased inferences and oddities:
- No adjustments for multiple comparisons
- “Marginal significance” shifts ad hoc between
0.1 > p > 0.05
- Obvious intervening/mediator variable is omitted: Accuracy of performance
- No explanation of (conflicting) results across experiments (e.g., variation in amount of cheating)
- No explanation for failing of random assignment to design levels (none of the experiments have equal N samples)
The article under scrutiny is by no means exceptional with respect to such issues, moreover, the way frequency / proportion data are analysed in psychological science is generally awkward and most of the time completely wrong.
I will 3PR the data based on the information in the article and comment on the results:
- Analysis of Proportion / frequency data
- Analysis of Extent of Cheating data
- HAPPE-ing: __H__ypothesing __A__fter __P__ost __P__ublication __E__valuation
I. Analysis of proportion / frequency data
Some concerns can be raised about the significant differences between various conditions in proportion
Cheating reported in the 4 experiments.
First and foremost, no corrections for multiple comparisons are conducted, should one do so, just 2 significant proportion differences remain:
Time in experiment 1 & 4. In Experiment 3, the sample difference
No Mirror: Money - Time was marginally significant in the 2^nd significant digit (original:
p = 0.015, adjusted \(\alpha\) = 0.013, Bonferroni).
Second, no continuity correction is applied, these proportions are calculated from discrete numbers (participants). If a continuity correction is applied, 2-3 significant differences remain, depending on the \(\alpha\)-level chosen:
|Exp.||Contrast||Published||Continuity corrected||Bonferroni adjusted|
|2||Int: Money-Time||<.01||0.1493||~ 0.0125|
|2||Per: Money-Time||>.05||1||> 0.0125|
|2||Money: Int-Per||<.03||0.0856||> 0.0125|
|2||Time: Int-Per||>.05||1||> 0.0125|
|3||Mir: Money-Time||>.05||0.7996||> 0.0125|
|3||NoM: Money-Time||<.003||0.0293||~ 0.0125|
|3||Money: Mir-NoM||>.05||0.0537||> 0.0125|
|3||Time: Mir-NoM||>.05||1||> 0.0125|
|Number sig. results||9||3||Original: 4, Continuity: 2|
This calls for a more appropriate analysis of frequency data:
- Log-linear analysis of observed cell frequencies
- Exact odds ratios of 2x2 sub-tables to test hypotheses using Effect Size CIs
Cheating can be considered a dichotomous response, so logistic regression could also be used, see III. HAPPE-ing)
Experiment 2 & 3 do not list n per condition, the most likely values for n (1. closest to an integer value; 2. as equal as possible; 3. Add to total N) are assumed:
Prime Assessment Ncond * %Cheat = Ncheat (deviation) Money Personality 36
Time Personality 35
Money Intelligence 38
Time Intelligence 33
Prime Assessment Ncond * %Cheat = Ncheat (deviation) Money Mirror 31
Time Mirror 28
Money No Mirror 30
Time No Mirror 31
1. log-linear analysis of observed cell frequencies
Log-linear analysis, or poisson regression using the generalised linear model, can be used to test whether relationships exist among the variables in a multi-way contingency table. Here I analyse the number of participants in each cell of the design: The observed frequencies take the role of the dependent variable and the levels of the design factors such as
Cheating are considered the levels of independent variables (another option would have been a logistic / probit regression with
Cheating as the dependent binary / proportion variable).
Two types of result given for each experiment:
First, a table listing deviance tests for the full (saturated) model. The analysis starts with the NULL model (all frequencies are equal) in the first row. Each subsequent row lists what happens to the deviance (of the model in the previous row) when a factor is added. A significant drop in deviance means adding the factor to the model contributes to predicting the difference between expected and observed frequencies. For hints of corroboration of the hypotheses reported in the paper, significant interactions between a design factor and
Cheating are necessary.
Second, a mosaic plot is displayed, this is a graphical representation of the conditional cell frequencies. The mosaic plot also indicates which residual frequencies (observed - expected) are significantly below (red) or above (blue) the expected frequencies (residuals are interpretable as a Z-score). The coloured cells contribute most to a high and possibly significant \(\chi^2\) value.
Note: The significance of the change in deviance can depend on the order in which factors are added to the model and is not the same as a significant beta weight in a regression model.
>  "Experiment 1"
> Analysis of Deviance Table > > Model: poisson, link: log > > Response: Count > > Terms added sequentially (first to last) > > > Df Deviance Resid. Df Resid. Dev Pr(>Chi) > NULL 5 24.767 > Cheating 1 9.3328 4 15.434 0.0022509 ** > Prime 2 0.0205 2 15.414 0.9898129 > Cheating:Prime 2 15.4136 0 0.000 0.0004497 *** > --- > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>  "Experiment 2"
> Analysis of Deviance Table > > Model: poisson, link: log > > Response: Count > > Terms added sequentially (first to last) > > > Df Deviance Resid. Df Resid. Dev Pr(>Chi) > NULL 7 19.6365 > Cheating 1 13.8608 6 5.7757 0.0001969 *** > Prime 1 0.2536 5 5.5221 0.6145539 > Test 1 0.0000 4 5.5221 1.0000000 > Cheating:Prime 1 1.5057 3 4.0164 0.2197998 > Cheating:Test 1 2.5348 2 1.4816 0.1113599 > Prime:Test 1 0.0307 1 1.4509 0.8609311 > Cheating:Prime:Test 1 1.4509 0 0.0000 0.2283780 > --- > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>  "Experiment 3"
> Analysis of Deviance Table > > Model: poisson, link: log > > Response: Count > > Terms added sequentially (first to last) > > > Df Deviance Resid. Df Resid. Dev Pr(>Chi) > NULL 7 11.4971 > Cheating 1 2.1397 6 9.3574 0.14353 > Prime 1 0.0333 5 9.3241 0.85513 > Test 1 0.0333 4 9.2907 0.85513 > Cheating:Prime 1 4.2369 3 5.0538 0.03955 * > Cheating:Test 1 2.8451 2 2.2086 0.09165 . > Prime:Test 1 0.4973 1 1.7113 0.48069 > Cheating:Prime:Test 1 1.7113 0 0.0000 0.19081 > --- > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>  "Experiment 4"
> Analysis of Deviance Table > > Model: poisson, link: log > > Response: Count > > Terms added sequentially (first to last) > > > Df Deviance Resid. Df Resid. Dev Pr(>Chi) > NULL 5 21.269 > Cheating 1 4.2195 4 17.049 0.0399621 * > Prime 2 0.2876 2 16.762 0.8660723 > Cheating:Prime 2 16.7615 0 0.000 0.0002292 *** > --- > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion log-linear analysis:
This alternative, and in my opinion more appropriate analysis is in agreement with the results after correction for multiple comparisons and continuity:
- The mosaic plots show that there may be some unexpected factors driving the “effects” reported in the paper: * In experiment 1 & 4 it is not so much the observed frequency of people that did cheat, but the number of participants that did not cheat that deviate from the expected frequencies based on table margins.
Money prime caused less people to NOT cheat, whereas the
Time prime caused more people to NOT cheat - If there is a difference in amount of
Cheating between samples, it is likely a “main effect” between the
Money prime (
Cheating:Prime interaction), it is found to cause a significant drop in deviance in Experiments 1, 3 and 4. - Experiment 2 stands out, because observed differences in
Cheating are unlikely due to chance, but none of the other factors contribute to explain differences between expected and observed frequencies.
The point about the mosaic plots is not just semantics or methodologists’ nit-picking. What it tells us is that, e.g. in the mosaic plot Table.1.1, among the observed frequencies of
CheatYES, the cell
Money does not stand out much from
Control from what may be expected by chance, for
CheatNO on the other hand, the cell
Money does stand out as different.
2. Exact odds ratios of 2x2 subtables to test hypotheses using Effect Size CIs
Effect Size Confidence Intervals:
To get a clearer idea about the significance between cell differences I calculate confidence intervals around the effect size associated with contingency tables. The CIs in Figure 1 below are based on the exact Odds Ratio (using the non-central hypergeomteric distribution) for a 2x2 sub-table of the full design obtained from
Fisher's Exact Test, testing against $ H_0: OR = 1 $.
>  "Figure 1. Exact log Odds Ratio's of 2x2 tables comparing frequency of Cheating between independent samples in each experiment."
Here, the Confidence Levels have been adjusted to account for the fact that 3 (EXP1&4) and 4 (EXP2&3) subtables of the full design were compared (
1-(0.05 / #tests)). The exact p-value from Fisher’s exact test reported in the Figure was multiplied by the number of comparisons in each experiment.
Conclusion Proportion data
- If there is an effect, it exists as a “main-effect” difference between the
Timeprimed samples in Experiment 1 and 4.
- Experiment 3
No Mirror: Money - Timeis a marginal case.
- Experiment 2 did not yield any substantial effects.
- 4-5 out of 7 statistical inferences in the paper that are made based on proportion data should be considered invalid.
II. Analysis of extent of cheating
The extent of
Cheating concerns the difference between actual accuracy (which is not provided as a result) and reported accuracy by a participant.
Experiment 1-3 report analyses of extent of
Cheating including means and SD’s. Sample size assumptions for Experiments 2 and 3 are the same as above.
Compare Cohen’s d CIs
I created CIs around the effect sizes based on the means and SD reported for Experiment 1-3 using the
>  "Figure 2. Cohen's d with exact CIs comparing extent of Cheating between independent samples in experiment 1-3."
Conclusion Extent of Cheating
The pattern is the same as the previous analyses: - Experiment 1 shows a clear effect between
- Experiment 3
No Mirror: Money - Time is again a close call
III. HAPPE-ing (Hypothesising After Post-Publication Evaluation)
Should reviewers have noticed these issues with data analysis?
Yes, they should have!
Even without re-analysing the published data as I have done here, the conclusions by the authors can be questioned based on a comparison of very elementary results:
Across four experiments, using different primes and a variety of measures and tasks, we consistently
found that shifting people’s attention to time decreases dishonesty. Priming time makes people reflect
on who they are, and this self-reflection reduces their likelihood of behaving dishonestly.
The clue is to compare the results across the 4 experiments and evaluate whether it is valid to infer that the core postulates have been corroborated. The designs and materials are slightly different each time, but if variation in outcomes (e.g., proportion cheating behaviour) varies systematically with one or more of the experimental differences, there may be another variable at work here.
One result that begs explanation is the drop in proportion
Cheating in all the samples of Experiment 2 when compared to the other experiments. What is special about the procedure and methods? Regrettably more than 1 potential intervening factor changes with respect to Experiment 1.
A second odd omission in the interpretation of the results is the level of accuracy achieved by participants. In Experiments 1-3, the urge to cheat must have been less when a participant had achieved 90% accuracy. Experiment 4 is somewhat different in that the cheating opportunity concerns one “bottleneck” problem that is difficult to solve, but has to be correct in order to make other more easily solvable problems count in adding to the final reward. Here, accuracy could have an opposite effect in which less accurate participants cheat less. If 0 or only 1 extra item past the “bottleneck” item were solved, a participant might be less inclined to cheat than a participant who solved every problem except for the “bottleneck” item.
What is mediating what?
The figure below shows the interaction between the maximal financial incentive that could be awarded and the proportion cheating for each prime and experimental condition (indicating whether a mediator variable was manipulated in addition to being exposed to a prime). Note that the
Intelligence and the
No Mirror condition of Experiments 2 and 3 respectively are considered similar to Experiment 1 and 4, that is, they reflect a condition in which
Self-reflection was not induced by any other means than priming:
This relationship can be tested in a generalised linear model, of course being fully aware that this is exploratory HAPPE-ing. I assume the samples from each experiment are independent and use the number of cheaters vs. no cheaters as the dependent binomial variable. The model contains only those effects for which data are available (e.g., no interactions with both
A generalised linear mixed model (GLMM) with sample ID as a random effect gives similar results.
> > Call: > glm(formula = cbind(CheatYES, CheatNO) ~ Reward + Prime + Mediator + > Reward * Prime + Reward * Mediator, family = binomial, data = reward) > > Deviance Residuals: > Min 1Q Median 3Q Max > -1.1534 -0.6946 -0.1216 0.2508 1.9564 > > Coefficients: > Estimate Std. Error z value Pr(>|z|) > (Intercept) -0.44952 0.21947 -2.048 0.04054 * > Reward 0.01107 0.02186 0.506 0.61253 > PrimeNone 0.58682 0.39730 1.477 0.13967 > PrimeMoney 0.60398 0.28243 2.139 0.03247 * > MediatorSelf-reflection -0.81281 0.31474 -2.583 0.00981 ** > Reward:PrimeNone 0.01672 0.03593 0.465 0.64158 > Reward:PrimeMoney 0.06976 0.03270 2.133 0.03291 * > Reward:MediatorSelf-reflection -0.01894 0.04340 -0.436 0.66257 > --- > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > > (Dispersion parameter for binomial family taken to be 1) > > Null deviance: 76.292 on 13 degrees of freedom > Residual deviance: 11.035 on 6 degrees of freedom > AIC: 82.478 > > Number of Fisher Scoring iterations: 4
>  "Null-model deviance test: p < 1.33525644154704e-11"
In the table above the model
Intercept corresponds to the odds of
Cheating compared to the Null-model when the predictors have the values:
Reward = 0. Compared to the overall probability of observing
Cheating behaviour, it thus seems that when the
Time prime is presented without an induction of
Self-reflection and a financial reward incentive, the odds of
This appears to be a corroboration of the second postulate, but note that in this analysis (just as in the previous analyses), there is no real difference between the
Time prime and prime =
None. The standard errors around these parameters are quite high. A clearer picture emerges when the Intercept is defined as
Reward = 0 and the Odds Ratios are compared (exponentiation of the parameter estimates):
>  "Odds Ratios compared to Prime = None, with profile likelihood CI.95"
> OR 2.5 % 97.5 % > (Intercept) 1.15 0.60 2.21 > Reward 1.03 0.97 1.09 > PrimeTime 0.56 0.25 1.21 > PrimeMoney 1.02 0.47 2.21 > MediatorSelf-reflection 0.44 0.24 0.81 > Reward:PrimeTime 0.98 0.92 1.05 > Reward:PrimeMoney 1.05 0.98 1.14 > Reward:MediatorSelf-reflection 0.98 0.90 1.07
The odds ratios in the table above are multiplicative changes to the
Probability of Cheating = 1 when the predictor increases by 1 unit. So an OR < 1 will decrease the odds of observing
Cheating behaviour and an OR > 1 will increase it. The 95% CIs are based on the profile likelihood and show that in most cases the effect covers a range below and above 1. The range for the effect of
Self-Reflection is always below 1.
One can interpret the modelled relationship between these variables as follows:
* There is a weak positive association between the
Maximal Financial Reward and the
Probability of Cheating
* The association changes with the value of
Prime, becoming stronger when
Money is primed, weaker when
Time is primed
* The induction of
Self-reflection does not cause the association to change, it changes the intercept, the base-line
Probability of Cheating at
Reward = 0
A graphical representation of the model predictions more clearly reveals this relationship:
Conclusions, Discussion and further HAPPE-ing
- The significant results between
Moneyin Experiments 1 and 4 probably arise due to the increase in
Probability of Cheatingwhen there is a financial reward and
- It is unlikely there are any other “real” differences in these data except for the induction of
Self-reflection: Model predictions show it decreases the
Probability of Cheatingby the same amount for different primes
- Note that there were no actual data points for
- The missing predictors in the
Probability of Cheatinganalysis are the actual and reported accuracy of the performance (amount of correctly solved problems and money received respectively). These values cannot be inferred from the extent of cheating analyses. It seems reasonable to assume in most experiments there was less incentive to engage in
Cheatingby participants who were more accurate.
- This brings up the question of whether the effects are driven by some sort of Speed-Accuracy instruction: Naturally,
Time = Money, but taking the time to solve the problems may lead to higher accuracy and less incentive to cheat, likewise a focus on getting as many answers as possible may introduce errors and promote cheating.
In science there is a moral obligation to do the best one can to be as accurate as possible and usually this means it is wise to be as modest as possible about ones’ scientific claims. I am not an expert in this field, but the sheer amount of questions that can be raised about the validity of the inferences made in this paper makes one wonder who the peers were that achieved consensus about the credibility of this research and what their area of expertise was.
I am not saying this is irrelevant, or poor research; the two effects that survive the scrutiny of 3PR are certainly interesting. I am just a little worried this paper says more about the morality of contemporary scientific publishing than the scientific study of moral behaviour.
Some notes about this file:
- This file was created using Markdown in RStudio: Unless otherwise indicated in the code blocks (e.g., by require), the basic R packages are used.
- All the analyses are based on results reported in the publication.
- The one true gospel on statistical inference does not exist and more than one approach to analyse these data may be defensible.
- Therefore: Please be aware these comments and suggestions reflect my own preferences and standards in these matters. If you feel I should change some of my preferences and/or standards please let me know, because I review and adjust them on a regular basis.