I think one of the most exciting findings in parapsychology has been the development by an experimental test by Ed May and co-workers (May et al., 1995) which can determine whether psi results have been produced by psychokinesis (PK) or extrasensory perception (ESP).Presumably the strong statement that May et. al. "showed that PK was *not* present" is driven primarily by the 8.6-sigma refutation quoted in the abstract. May, responding to a post of mine, wrote:

[...]

The above analysis has been applied to large number of experiments using random number generators, in which correlations with operator intention demonstrated the presence of psi (May et al., 1995). However, the analysis of May et al. showed that PK was *not* present.

[...] May, E.C., Utts, J.M., and Spottiswoode, S.J.P. (1995). Decision Augmentation Theory, Journal of Scientific Exploration, 9(4), 453-488.

York suggested that I quote an 8-sigma effect based upon a meta analysis of all the RNG data. Wrong. In the abstract of our JP paper we attribute an 8.6 sigma favor of the influence model to and analysis of a large number of individual button presses of PEAR data.I apologize for my misinterpretation of the statements made by May et.al.; it is very clear from the relevant paragraph (p. 467 of the JSE reference given above) that the 8.6-sigma result comes from examining the data generated by one specific operator at PEAR, at two different sequence lengths.

However, I feel compelled to point out that the figures given in that very same paragraph of the JSE article also show that the 8.6-sigma figure is, to say the least, suspect, and the use of that value without qualification or caveat in the abstract of the article, and in subsequent discussions, is distinctly questionable.

In that paragraph, the authors demonstrate the calculation of the 8.6 figure from the observed Z^2 data in the two subsets used. It is derived by using the observed effect size in the short-sequence data as a prediction for the effect in the long-sequence data. However, they also point out that if one performs the calculation in the other direction, using the long-sequence observation to predict an effect for the short-sequence data, the t-score is only 2.398, rather than 8.643. Since there is no obvious quality that makes one dataset the "prediction" and the other the "test", it is decidedly not clear why one should prefer one value over the other, and the fact that they are in such stark contrast (an overwhelming powerful refutation vs a moderately convincing one) suggests that there is something worrisome about such a strong dependence on an essentially arbitrary choice of analysis method.

I will therefore spend a couple of paragraphs showing the application of a standard, symmetrical test to the data used by May et.al., restricting myself to the data values actually published in the paper lest I be accused of generating a red herring by throwing some extra data into the pot. Fortunately, the raw data used by May et.al. to derive the T-scores above are also given on page 467 of JSE 9/4. The short-sequence data involve 5918 trials at 200 bits per trial; the Stouffer Z for the presence of an anomalous effect is 3.37, the observed Z^2 value is 1.063 +/- 0.019. The long-sequence data comprises 597 trials of 10^5 bits each; the Stouffer Z is 2.45, the observed Z^2 is 1.002 +/- 0.050.

Given these figures, the most natural evaluation for consistency with a hypothesis would seem to be to treat both observations as empirical measurements of a model parameter, and construct a T- or Z- test against the null hypothesis that the parameter has the same value. The first line of May, Utts, and Spottiswoode's Table 1 (p. 461) gives the expected value of Z^2 in terms of model parameters. For the PK or "micro-AP" model, E[Z^2] = 1 + epsilon^2 *n; for the DAT model, it is simply the sum mu(z)^2 + sigma(z)^2. The fact that the latter is a sum of two unknown parameters of the model is essentially irrelevant, since the same sum is measured in both cases.

Calculating from the observed figures listed for Z^2 (we're back on p. 467 again), one finds that for n=200 data epsilon^2*n = 0.063+/-0.019; since n=200, epsilon^2 = (3.15+/-0.95)x10^(-4). (I am keeping more significant digits than I am entitled to in the intermediate figures, to try to avoid too much accumulation of roundoff error.) For the long-sequence data, one finds epsilon^2 *n = 0.002+/-0.05; since n=10^5, this observation gives epsilon^2 = (2+/-50)x10^(-8). Any hand calculator with a square root key will allow the reader to verify that this gives a T score of 3.3 against the micro-AP hypothesis. Both observations are treated symmetrically in this calculation, rather than one being used to predict the other.

For the DAT model, the unknown parameter is simply equal to the observed Z^2, and so the T test is simply the comparison of 1.063+/-0.019 against 1.002 +/- 0.050, yielding T=1.14.

I am not presenting degrees of freedom for any of the T scores, as they have so many that they are for all practical purposes equivalent to Zs.

The two-sample T-test is a straightforward technique that I have found in several standard references on statistics. The predict-and-compare process used by May et.al. to get T=8.6 (or T=2.4, depending on which direction you choose), on the other hand, is not one that I have ever before seen applied to such a simple evaluation. I therefore contend that the data examined on p. 467 of the JSE reference above amount, by standard and well-understood statistical tests, to a T=3.3 result, not to a T=8.6 result, and that people should stop talking about an eight-sigma refutation of PK models for observed anomalies.

York Dobyns

ydobyns@phoenix.princeton.edu