Personal tools
You are here: Home WAME Listserve Discussions Declarations of Statistical Significance and Origin of the P = 0.05 "Cutoff"
navigation
 

Declarations of Statistical Significance and Origin of the P = 0.05 "Cutoff"

January 15, 2009 to January 19, 2009. Members discuss the origins of P < .05 being used as the cutpoint for significance, when the actual P value should be expressed and left for readers to interpret (the origins of the P < .05 cutpoint are quite interesting, ultimately relating to copyright). Some point out that the alpha level, not P value, should be stated, and its level will depend on many factors including the number of comparisons (and need for a Bonferroni correction or other adjustment). Confidence intervals are generally more informative. Several expert members discuss the meaning of statistical vs clinical significance, including whether significance should be reserved for its statistical meaning. Useful references are cited. Finally, Frank Davidoff proposes that the editorial community consider taking a formal stand on analysis of quantitative data.--MW

I've been noticing increasingly in a variety of publications a phrase, in the Methods section, to the effect that "P values less than 0.05 are interpreted as significant".
 
I would be interested sociologically in the forces of using this cliche-like phrase. In most cases, it seems rather irrelevant, as when the authors proceed to provide us with a variety of P values in the text and figures in which case the reader can decide for herself whether to consider the finding significance. In other cases, it appears that the declaration might be relevant but ignored, as when the authors report findings as if we should pay attention to them, when they have either not calculated a P value, or have done so but it is not less than their declared cut-off. In other cases, one is left wondering if that phraseology is meant to imply that whenever a finding is reported, it has some unpublished P value "less than 0.05" that is not overtly declared.
    
All of this confuses the distinction between probabilistic significance (and its many difficulties) and biological significance (and its many difficulties). 
 
Such statements are NEVER given a citation, so one is left with the impression that the authors meant to report their idiosyncratic cut-off when, in practice, a value of "0.05" is widely used (though I have yet to find an "original source" for the use of this cutoff). 
  
The perfunctory and cliched use of the phrase makes me think that it has gained high frequency use in response to the perception that editors require it. Is this true? If so, why? How did this practice arise? How is the requirement communicated to authors?
 
Finally, do any of you know the origin of using "0.05" as a "cut-off" in deciding to declare statistical significance—this is not a WAME question per se, but one of the history of statistics. I've been reading Ian Hacking's histories, but haven't found a reference to it yet.
 
John Rodgers
________________________
My preferred phrase is "Alpha was set at 0.05." However, the more critical issue that is never addressed is whether the article contains a statement confirming that the assumptions underlying the test were met by the data.

Many journals prefer confidence intervals to P values, which avoid many of the above concerns. As a nonstatistician who nevertheless wrote How to Report Statistics in Medicine, I can tell you that many statistical phrases are almost like mantras: the wording is exact enough that even minor deviations change the meaning. And this consistency with stock phrases is important, if irritating. Sometimes, as you note, the phrases are used just because, out of habit, not because they are relevant. As an author's editor, I see this as job security (!), but journal editors need to have a different perspective.

My research into statistical reporting indicates that the most common statistical reporting error is confusing statistical significance with clinical importance.

The original source of the 0.05, as near as I can tell, is RA Fisher, the man who introduced the concept of hypothesis testing. It was his personal habit to use 0.05 and it simply caught on.

Again, my research established that when P values are reported in an article, the alpha level should be reported. That is, authorities publishing on this topic all agree that a P value needs an alpha level.

Tom Lang
________________________
Thank you, John and Tom, for this enlightening exchange. I just wish to add that another reason for stating the alpha level in a paper is that the researcher can theoretically choose whatever alpha level he desires. The convention is to use an alpha level of 0.05, but there are instances when the researcher wants a higher level of confidence and will opt for an alpha of 0.01 or 0.001, for example, or even a lower level of confidence, in which case he may choose an alpha of 0.10. The choice will depend on his/her tolerance for a type I or II error within the context of his particular study. Please correct me if I have said anything misleading, Tom.
 
Maria Luisa Clark
Bulletin of the World Health Organization
________________________
You've got it right. Smaller alpha values are often used to control for multiple comparisons (the Bonferroni correction), as well as to reduce the chance of making a type-1 error. Larger values are used in things like regression analysis, where the first step is "univariate analysis" to determine which variables should be considered in the final model. To "cast a wide net," alphas of 0.1 or 0.2 are used, meaning that the relationship between the individual variable and the outcome has a P value less than 0.1 or 0.2.

Hi to Hooman!

Tom
________________________
Tom Lang is correct. RA Fisher arbitrarily said that 1 in 20 is about right to quantify the meaning of unlikely. You have to realize that, at the time, computing was such that P values were hard to compute. So instead, cutoffs were tabulated at conventional levels, mainly 0.05. Of course, now nobody flips to the back of the book to find the right table and look up the cutoff.  The computer provides the P value. And yet 0.05 still permeates the literature, as noted. And there are 2 things wrong with this, as already alluded to.

First, presumably some decision hinges on the results. Will the drug be approved? Will the study be published? Will the next grant application be funded? Decisions require cutoffs, but also require consideration of the two types of errors (there are, in fact, more than two, but that is another story). What is the harm in one kind of mistake relative to the harm in the other kind of mistake? This consideration should always guide the rationale choice of an alpha level, but almost never does. Consider, for example, two claims, first that broccoli controls tooth decay, and second that arsenic does. How convincing would each case need to be before we were prepared to "prescribe" or at least endorse each one? For broccoli, there is no harm in falsely finding in favor of the claim, because it is healthy anyway. Give it a large alpha level. For arsenic, we had better be absolutely certain. Give it a very small alpha level.
 
Second, is alpha even in the purview of the researcher? Presumably, the research is presented to another body, be it a medical journal or a regulatory authority, who then must make a decision. Does a political candidate tell a voter how convinced the voter must be before this voter is required to vote for this candidate? We all have the right to our own individual alpha levels, so it seems a bit silly for the researcher to be the one who specifies it.
 
Vance Berger
________________________
Great analysis!

Tom
________________________
This topic of statistical significance, brought up by John, as the majority of the issues discussed on the WAME list, has been of interest to me, another non-statistician.

Concerning the origin of the (nearly) ubiquitous value of 0.05, I found a web page titled "Why P=0.05?" (http://www.jerrydallal.com/LHSP/p05.htm). It is written by Gerard E. Dallal of Tufts University.

He quotes from page 44 of the 13th edition of Statistical Methods for Research Workers by RA Fisher (originally published in 1925):

"The value for which P=0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant. Using this criterion we should be led to follow up a false indication only once in 22 trials, even if the statistics were the only guide available. Small effects will still escape notice if the data are insufficiently numerous to bring them out, but no lowering of the standard of significance would meet this difficulty."

Yet, as Dallal, points out, Fisher himself is persistently inconsistent in conforming to this guideline.

I also express my coincidence with the views expressed by Tom and Maria Luisa.

I will be looking forward to further exchanges on this subject.

Fernando Alvarez
________________________
I was taught in graduate school not to make claims of significance—but rather to report the P value and let the reader decide how much weight to give the result. Any binary approach to significance is arbitrary.

Robert Weyant
________________________
Potentially arbitrary, but not necessarily. When a binary decision must be made, a basis is required for making this decision. Often such a basis will be a P value (once other factors have been taken into consideration), and then an alpha level is appropriate. What is not appropriate is the notion that it should always be 0.05.
 
Vance
________________________
There has been considerable discussion of this in the epidemiologic literature, most of it focused on fundamental discontent with the underlying concepts. A good place to start are the contributions of Dr Stephen Goodman:

Goodman SN. P Values, Hypothesis Tests and Likelihood: implications for epidemiology of a neglected historical debate. Am J Epidemiology 1993;137:485-496
Goodman SN. Toward evidence-based medical statistics. 1: The P value fallacy. Ann Intern Med 1999;130:995-1004.
Goodman SN. Toward evidence-based medical statistics. 2: The Bayes Factor. Ann Intern Med 1999;130:1005-1013.
Goodman SN. Of p values and Bayes: A modest proposal. Epidemiology 2001;12:295-297.

A number of others have made important contributions to this discussion, but I think you'll find the best summary of the issues in the first listed article.

Rich Rothenberg
________________________
Hence the value of confidence intervals—these give the reader an even better idea than the P value of the significance of the findings. But I think the general consensus on a 1-in-20 chance of error, while arbitrary, is quite reasonable in a "face validity" sense. And the 95% CI, after all, is based on the same 1-in-20 chance of error—just provides a bit more information than does the P value alone.

David C. Cone
Editor-in-Chief, Academic Emergency Medicine
________________________
Steve Goodman is the statistical reviewer for the Annals of Internal Medicine and a WAME member, although apparently not logged on to the listserve at the moment. He's also very personable and clear in his explanations; even I can understand them. The articles Rich mentions have been quite useful to me.

Tom Lang
_______________________
Thank you! Since Dr Goodman's name has come up so often, maybe I should mention that he and I, both in Maryland, have argued a few times about the role of P values. I do not mean to misrepresent him here, and clearly he can speak for himself once he is logged on. But my understanding of his view is (was) that P values have no role at all, and should be replaced by likelihoods. In the literature, it is quite common to see P value bashing, and rarely does one see a rebuttal, or even a counter-argument. But there is a logical basis for P values, in fact. I am not sure how much of this is already known, so in an effort not to waste your collective time, I will not elaborate unless asked to. All I will say for now is that the chief argument against P values seems to be that they are misunderstood. That all physicians—indeed, all non-statisticians—understand them to be the probability that the null hypothesis is true. And three responses come to mind. First, I doubt that this is true. Second, if it is true, then the answer is not to throw out the baby with the bath water, but rather to recognize the baby for what it is. Let us educate consumers of statistics about what P values really are. Third, it is not all that clear to me that a whole lot of harm is done when that mistake is made (ie, a P value is taken to be the probability of the null hypothesis). Either way, we have a measure of support for or against the null hypothesis, just calibrated differently. It is a problem to mix and match, as in comparing P values (as the probability of the null hypothesis) to Bayes estimates of such probabilities. But if we are always using P values for the same purpose, then we are simply calling a rose by another name, and if that is good enough for Shakespeare, then, maybe it is good enough for us, too?
 
Vance
________________________
Of possible interest:
 
Cowles M, Davis C. On the origins of the .05 level of statistical significance. American Psychologist. 1982; 37:553-558.
 
Michael Berkwits
________________________
In reality, it is not always .05. For example, the FDA requires 2 independent studies, each at the .05 level which translates into a rough level of 0.0025. When there are more than one primary hypothesis, one might use a more extreme level (0.025 or 0.01 for example) to reduce the likelihood of a false positive.
 
Simply put, .05 means a one in twenty chance of a random occurrence of a false positive. Any level that you choose will have a false positive rate. The more extreme the level is, the less likely to reject the hypothesis of no difference even when there is a difference. So 0.05 simply is a compromise between the two types of errors.
 
Many editors, and I mean by this the whole peer review apparatus, prefer reporting P values rather than using a fixed level of significance. It then lets the reader put their own weight/belief into the result.
 
Sam Sussman…with some help by a world renown medical/biological statistician!
________________________
I have to thank Doug Altman, who I was lucky enough to cross paths with today in Baltimore, for alerting me this evening to this cornucopia of P value wisdom in my email WAME folder, which I don't check often enough.

And also to Rich Rothenberg and Tom Lang, who felt I might have something to add. I don't have the volumes of some of real historians of statistics by me at the moment (ie, David Salsburg, Ted Porter, Gerd Gigerenzer and Steven Stigler), but I can share what I think I know about this.

First, John Rodgers is right to wonder about the rhetorical tic that we see in research papers about what P value will be called "significant". I have written about that in some of the papers cited. It means virtually nothing, except it is an unfortunate indicator that the author will be letting a computer do the thinking. Every time I read that phrase, and I read it a lot, I feel sharp retrosternal pain. Whether it's dyspepsia or angina, I'm not sure, but none of the conventional remedies work. This should be a signal to the discerning editor that the authors are planning to cease meaningful thought after entering the "Results" section.

Fisher was indeed the "inventor" of the P value per se, although the use of tail-areas preceded him. And Fernando Cervera is right that the suggestion that "0.05" had some inferential importance did appear in Statistical Methods for Research Workers (henceforth SMRW), which, btw, was one of the most influential and popular books on research methods ever written. A great deal more wisdom (and less absolutism) can be found in that book than many that followed (and that are being written today). If one is really interested in how he thought about inference, his subsequent book Statistical Methods and Scientific Inference (1956) has the best, sometimes befuddling and often entertaining, summary of his quite passionate views on the subject.

But it is extremely interesting to know how the 0.05 cutoff came about, and how Fisher said it should be used. In SMRW, he wanted to reprint the tables of the chi-square and other distributions originally published by Elderton and Pearson (with whom he had a lifelong feud). He was not granted the copyright, so Fisher was stuck. What he did instead was literally turn the tables, ie instead of reporting the tail area for a given Z or Chi-square, he arranged them the reverse way—he listed key tail-areas (eg, 0.10, 0.05, 0.02, 0.01, etc), and reported their associated "critical" Z and chi-square scores (the latter for different degrees of freedom). Fisher's tables were far more compact than Pearson's, and they had enormous influence. His tables could not be used to calculate an "exact" P value for various distributions, which Pearson's allowed, but rather find out if the Z- or chi-square score exceeded certain "critical values" associated with the few tail area probabilities in the table. One such "critical value" was 1.96 standard errors, associated with the tail-area of 0.05. This is how he describes the use of the chi-square table:

"In preparing this table we have borne in mind that in practice we do not want to know the exact value of P for any observed ?2, but, in the first place, whether or not the observed value is open to suspicion. If P is between .1 and .9 there is certainly no reason to suspect the hypothesis tested. If it is below .02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at .05, and consider that higher values of ?2 indicate a real discrepancy."

So the emergence of a threshold was partly an artifact of the way Fisher arranged and reported the statistical tables in the book, which only allowed one to say whether a given test-statistic exceeded the various critical values he reported in the tables. As is indicated in the informal phrase above and in many subsequent mentions of this threshold, the 0.05 cutoff is somewhat arbitrary, and he does not claim that this establishes the falsity of the null hypothesis—it only opens it to "suspicion". In fact, he consistently mentions smaller P values as necessary for proof.

Perhaps his most telling quote, from another paper, is one that also explains the reason why he chose "significance" as the word to denote passing some arbitrary boundary:

"If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent point), or one in a hundred (the 1 per cent point). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance." (Fisher, 1926)

There are two aspects that are revealing about that quote. He explicitly uses the "high" 0.05 threshold mainly to "safely ignore" non-significant deviations (ie, P>0.05), not to regard p<0.05 as proof. As far as what p<0.05 means, the most important and telling aspect of the above quote is that he stresses that a fact is only established if one "rarely fails" to achieve this threshold, ie, if several similar experiments reach it. Not one—several. In a single experiment, it merely says that the observed deviation is, quite literally, "significant", in the sense that it is worth noticing. Not proof, just worth noticing, and the experiment worth repeating. We see him here presaging current practice in meta-analysis.

Now, all of this ignores the importance of the prior probability of the null hypothesis, but that is for another day. The prior statements in this list that p<0.05 or 95% CIs correspond to "1 in 20" chances of error are unfortunately incorrect, or at best, insufficiently precise. If they were right, we wouldn't have most of the problems of inference that we see in medical journals today.

Now, how did the "0.05" come to be (ab)used as it is today? That is more a sociologic than a technical question, and I did write about it in the 1993 paper already cited, and Gigerenzer has as well in his books, as have many others. It is a long, sad story, and unfortunately many medical editors have been, and continue to be, accomplices in this inferential crime of the century. Fisher could be blamed for lighting the match, but not for the conflagration that followed. He in fact was an implacable foe of using the P value as it is today. But we persist. And I will now desist. I hope this was informative.

Steve Goodman
Editor, Clinical Trials
Assoc Editor, Annals of Internal Medicine
________________________
I suggest this 'cliche' (as John describes it) is only useful if it refers to a significance value that was set BEFORE the analysis was done. There have been cases of trials that were designed to show the superiority of one treatment over another which failed to produce (conventionally) statistically significant findings which have then been reported as if they were 'equivalence' or 'non-inferiority' studies (ie, looking to find no difference between the treatments).

You might expect different cut-offs to be set for superiority or non-inferiority studies, so, for journal editors, what's really important is to know which the trial started with.

This is one reason why reviewing a protocol alongside a paper is often a good idea, and why trial registration is vital. A good protocol (or data analysis plan) will specify the cut-offs that have been proposed a priori.

Liz Wager

PS
For anyone interested in reporting equivalence trials, there's a CONSORT statement specifically about this JAMA. 2006 Mar 8;295(10):1152-60. http://jama.ama-assn.org/cgi/content/full/295/10/1152
________________________
Dear Dr Goodman,
 
Thank you so much for taking the time to provide us with this fascinating summary. Is there one article you could recommend on this history?
 
As well, although you say that you will leave prior probability to another day, could you provide us with something on this intriguing topic. It is my (perhaps somewhat primitive?) understanding that P values only have validity when one is investigating a prior hypothesis. If this is so, does this invalidate or weaken the use of such statistical analysis in retrospective analyses?
 
I'd appreciate your thoughts or those of anyone else from WAME.
 
Thanks in advance.
 
A Mark Clarfield
Section Editor, J Amer Geriatr Soc
________________________
I am privileged to bring our colleague Jan Vandenbroucke in this conversation, who was forwarded some of this discussion and requested that I post for him the following interesting tidbit:

A 1930s introductory medical statistics book that was used at the London School of Hygiene and Tropical Medicine (Woods and Russel, An Introduction to Medical Statistics, 1930) had a quite enlightened view on the use of P values, coming close to Fisher’s view as described by Steve Goodman in a previous mail. These authors wrote: “Many people say that a difference from expectation greater than three times this probable error, or twice the standard deviation, is ‘significant’...This is quite arbitrary. It is much better in any important case to state the arithmetical facts as revealed by the standard deviation”. Preceding this quote, the book gives a witty example of a fictitious treatment with an extreme difference from expectation which is nevertheless improbable. The 1930 authors conclude that one should ‘weigh alternatives’. Apparently the teaching of elementary medical statistics has lost this view over the remainder of the 20th century. The original quote, the example and its context can be seen and downloaded at the James Lind Library:

http://www.jameslindlibrary.org/trial_records/20th_Century/1930s/woods/woods-kp.html

Alternatively, go to James Lind Library and search for ‘Woods’: www.jameslindlibrary.org <http://www.jameslindlibrary.org/>
The example was also described in a letter in JAMA: Vandenbroucke JP. Weighing alternatives. JAMA 1988;259:1500

Steve Goodman
________________________
I don't know of a single article per se that covers this history comprehensively, although I am quite sure they exist. Mike Berkwitz has mentioned one that I don't know. Most of my knowledge comes from books of the authors I mentioned earlier, Fisher's own writings, as well those of Richard Royall, Anthony Edwards (a student of Fisher, still alive in the UK) and Fisher's biography by his daughter. A more recent book, with an entertaining title and a somewhat overheated but nevertheless knowledgeable and comprehensively researched text is The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice and Lives, by Ziliak and McCLoskey, U Michigan Press, 2008. It is an interesting amalgam of history, philosophy, technical argument and rhetoric, but its points are valid and it draws from a very wide range of sources—a 23 P single spaced bibliography.

On the question of whether the P value is only "valid" when examining a priori specified hypotheses, I would simply say that such hypotheses usually have higher prior probability (ie, more sound external evidence) than those that are data-derived. It is not what makes the P value "valid" but what makes a conclusion reliable. It is the strength and independence of the external evidence, not the "priorness" of the hypothesis, that makes the "prior probability" low or high.

To summarize the issue here from my perspective, it is not whether one uses P values or not, but whether there is either an informal or formal way that the strength of prior evidence, biological explanation and the strength of the current study come into play in the formation of conclusions and recommendations. P values (as they are used, not as Fisher intended) and hypothesis tests make that difficult, if not impossible, although in sophisticated hands they of course can be interpreted sensibly. The methods and philosophy of Bayesian/likelihood approaches provide a formal language and mathematical framework for that to occur. One can vaguely approach the same sensible outcomes with the understanding of that philosophy, meta-analysis, a deep understanding of the problems of P values, more focus on confidence intervals, modeling the possible effects of bias, and an appreciation that there is no mathematical foundation—aside from Bayes theorem—for saying that there is a "X% chance that my claim is true". So it is certainly possible to make sensible conclusions with the current technology, and if the evidence is extremely strong, that usually renders inferential philosophy irrelevant, ie, the truth is apparent to all. But on the road to that truth (which often is incomplete), it really matters what we think is scientifically relevant to the drawing of conclusions. P values themselves are not the enemy, but the manner of their misinterpretation and use together with the underlying frequentist philosophy makes it extraordinarily difficult to proceed in a logically coherent and defensible way from research question to design, analysis, conclusions and recommendations. I am not "opposed" to P values per se, but rather the philosophy that inevitably accompanies their use. If you want to read more on this, my two 1999 Annals of Internal Medicine articles cover my own perspective in some depth, and there are a multitude of others both before and since. 

SG
________________________
But let us keep our perspective and not get the mistaken impression that the true alpha level is 0.0025, because offsetting this is the discretion exerted by the research team in selecting the endpoints, the patient population, the timing of assessments, grading conventions, methods of randomization, and so on. These can be manipulated so as to make it quite likely that an ineffective treatment will be found effective. Hence, the true alpha level may be quite large, even while the nominal one is rather small.

Vance
________________________
P values are measurements of probability due to chance. P values are also measurements of statistical significance, based on an arbitrary reference point of 0.05, or 1 in 20.

But to understand P values, you have to understand fixed level testing, critical region, and "the smallest fixed level at which the null hypothesis can be rejected". I invite Vance to explain to us this.
 
Hnid Karim
________________________
Happy to. In fact, P values represent freedom from fixed level testing, and from the arbitrary reference point of 0.05.  Contrast these:

Is there, or is there not, sufficient evidence to reject H0 at the 0.05 level?
What is the P value (what is the lowest level at which we could reject H0)?

The P value is an improvement over the old system of one star for 0.05 and two stars for 0.01, or similar systems that tell the reader only whether a certain threshold has been met. With the P value, each reader can apply his or her own individual alpha level, and compare that requirement for evidence to the actual level of evidence obtained. In contrast, if all one knows is that 0.05 has been met, but one wishes to instead use the 0.03 alpha level, then one has no way to know if this level of evidence has been met.

Vance
________________________
My research indicates that exact P values (eg. P = 0.03) are much preferred to threshold values (eg, P < 0.05).  This preference is in keeping with Vance's point that readers can choose which P values they like, IF they can see the exact P value of a study.

Tom
________________________
One note on terminology. It is clear what you mean by “exact p=value”, and it is clear why you would use this term. However, this term is already in use, with a different meaning. It is not my intention to take this discussion too far from where it is, although, on the other hand, the recent suggestion on this thread that manuscripts not be accepted unless they conform to certain standards might make this extension relevant. Without further ado, these precise or numerical P values (shall we call them that instead) are not exact. They serve as nothing more than approximations to the exact P values that could easily be computed, yet for some reason are not. As an example, how often do we see a protocol state, “We will analyze the data with the chi-square test unless one or more cell counts is less than five, in which case we will instead use Fisher’s exact test”? Likewise, how often do we see the same thing with a t-test and Wilcoxon reserved for those rare occasions when the data are not normally distributed?

But the reality is that the data are never normally distributed. They cannot possibly be normally distributed. The ease with which this phrase rolls off the tongue belies a bewildering set of probabilistic statements, including, for example, a positive probability that a blood pressure exceeds ten billion. Or is less than negative ten billion. We have one analysis that is exact with or without normality, and another that is exact only if the data are normally distributed. But they are not. So, in other words, we have one exact P value, and one inexact one. Somehow, we manage to prefer the inexact one.

How can an approximation be preferred to the very quantity it is trying to approximate? One might argue that the chi-square test is of intrinsic interest; it isn’t. If it were, then we would not bother to replace it with Fisher’s exact test in those cases in which the approximation is likely to be poor.

So let us look again at present practice. Use an approximation unless it is likely to be poor, in which case we grudgingly revert back to the exact P value. Would it not be better to upgrade to a more rational system in which we use an approximation unless it is poor (as opposed to being likely to be poor), in which case we grudgingly revert back to the exact P value? We can easily do this by simply comparing the two values, and seeing how close they are. Is this not a better measure of the quality of the approximation than the rather vague set of expected cell counts, which are somehow said to inform us about the quality of the approximation? Shall we say instead that if the chi-square P value is within some tolerance, perhaps 0.001, of the exact P value, then we will use it? Clearly, this would be a step in the right direction. But if this is progress, then why not continue with more progress, and tighten that tolerance interval. How small should it be? In fact, the very notion can be rejected on its face. The only reasonable value would be zero. The approximation is not validated by being close to exact; it is validated only if it is actually exact. So now we may use the approximation if it is perfect, and replicates the exact value, or revert back to the exact value. Or we can save ourselves some time, and simply use the exact P value, without deviation, every time.

Vance
________________________
In addition, I believe one can convert numerical (rather than "exact" P values to Chi-square values, and estimate the summed chi-squares in a meta-analysis—is this correct?  With p-limits, one can only "assume the worst"- so that a meta-analysis is more likely to yield a false negative result than it should.
  
On the other hand, labeling graphs with "exact" values can clutter them up.  I wonder if emphasizing P values rather than p-thresholds actually doesn't move us further away from appreciating the limitations of P values that have been discussed recently in this thread. 

John Rodgers
________________________
Probability and Significance. Could  We Replace "Significance" with "Discrepancy"?

This has been a fascinating thread—it has been very helpful to me in many ways. I'm going to be so bold as to throw out the following (surely crude) understanding of "probabilities" that I have gleaned from various sources.
 
1) The measured frequency of some event (an event being a certain categories of observations, whether "heads" in a single coin toss or “a series of three heads uninterrupted”) in a series of (actual or estimated) observations of like events. ("Like" being profoundly problematic and critical to the identification of an "event".) These are usually called "frequencies" in publications. These contribute to measurements. These often contribute to estimates of Bayesian prior probabilities. Here a theory of errors allows us to estimate the experimental "confidence" in the observed frequency-errors of measurement. 
 
2) The estimated frequency with which such an event would be detected in a series of future like-observations. Problematics here include the previous set, plus the problematics of estimation (models, distributions and prior information). This is more-or-less the sense used in most publications, where the reported P value is supposed to represent the probability of obtaining the observations under some "null hypothesis". The "event" of interest here is the event corresponding to the results of some actual experiment or set of experiments, and the estimate is based on some model of how the universe is assumed without invoking a modification of our understanding. If the "P value" is sufficient, we are encouraged to reject the "null" hypotheses and require some retooling of our understanding. We call that demand for retooling an "effect". The demand for retolling is a decision process, so we also want estimates of type II errors—type I and type II errors are errors of judgment.

3) An estimate of the strength of evidence for or against some hypothesis or model. As such, a component of decision-making, which requires an assessment of strength for and against competing models, proposals. This includes Bayesian posterior probabilities. Critical problematics include all of the above, plus: logical arguments that the competing models are both mutually exclusive and together exhaustive and reasonable estimates of the prior probabilities.
 
Judgements of significance.

There is a claim that confidence levels are more useful than P values. I wonder if this is so in practice. The problem with P values is that they are confused with strength-of-evidence. Are not reports of confidence levels problematic in the same way?

It seems to me unfortunate that we use "significance" to refer to statistical issues.  It is simply too easy to confuse statistical significance in all its interpretations with "biological significance". If we were to ban the uses of "significance" in its statistical context, we might reduce this confusion. We could talk about the confidence we have in some measured value, the "probabilistic discrepancies" (P values) between the observed and predicted ("the results were highly discrepant", rather than "highly significant", and "strength of evidence". "Significance" could be reserved for judgements of worthiness, importance, etc.

The use of "discrepancy" rather than "significance" might encourage the writer and reader to ask "discrepancy with what?"—the discrepancy is in the comparison between two things. In contrast, "significance" tends to suggest that a thing is significant in and of itself—there is no need for a comparison.
 
Thus, in reporting an ANOVA result, we might be able to write that "the observations were highly discrepant [from the predictions of a null hypothesis] but we did not judge them significant [worthy of our further attention].

John Rodgers
________________________
John: Your point about reducing the confusion is well taken. However, the convention, as I understand it from several sources, most of which I have forgotten over the years, is that "significance" in biomedical publications is reserved for its statistical meaning. We thus speak of clinical "importance" or "relevance," rather than clinical significance.

Tom

If we were to ban the uses of "significance" in its statistical context, we might reduce this confusion.
________________________
Tom:  I'd be interested to know the history on convention of which you speak. This is NOT the convention of the NIH, however, which asks proponents to describe the significance of proposed research—here they mean social, scientific, biomedical significance, not statistical significance.

John Rodgers
________________________
John: The Uniform Requirements for Manuscripts Submitted to Medical Journals says under Methods: Statistics:

Avoid nontechnical uses of technical terms in statistics, such as "random" (which implies a randomizing device), "normal," "significant," "correlations," and "sample."

In addition to this distinction in the Uniform Requirements, it is also made in Ed Huth's Medical Style and Format (1987 printing of the first edition, page 272): "Reserve significant for its statistical meaning . . .

The AMA Manual of Style and the CSE Scientific Style and Format do not make the distinction.

In a sense, this may be on those "editor's edits" things, like commas in restrictive vs non restrictive phrases. I think the distinction is useful, however, and I know that it is routinely taught at Medical Writer's workshops. I also make the distinction in the second edition of How To Report Statistics in Medicine (page xviii), which doesn't count because I wrote it, but the book has been well received internationally, so the distinction is out there for those who wish to use it.

Tom

________________________

To those interested in the recent WAME list discussion on statistical significance, P value, and related topics, I would like to suggest the reading of an excellent collection of material: a complete journal issue dedicated to the "Interpretation of Quantitative Research":

Seminars in Hematology

Volume 45, Issue 3, Pages 133-206 (July 2008)
http://www.seminhematol.org/issues/contents?issue_key=S0037-1963(08)X0004-6

  • Interpretation of Research Results: An Indispensable Mission Impossible?John P.A. Ioannidis pages 133-134
  • A Dirty Dozen: Twelve P-Value Misconceptions Steven Goodman pages 135-140
  • Bayesian Interpretation and Analysis of Research Results Sander Greenland pages 141-149
  • Decision-Making When Data and Inferences Are Not Conclusive: Risk-Benefit and Acceptable Regret Approach Iztok Hozo, Michael J. Schell, Benjamin Djulbegovic pages 150-159
  • Perfect Study, Poor Evidence: Interpretation of Biases Preceding Study Design John P.A. Ioannidis pages 160-166
  • Misconceptions, Challenges, Uncertainty, and Progress in Guideline Recommendations Regina Kunz, Benjamin Djulbegovic, Holger J. Schunemann, Martin Stanulla, Paula Muti, Gordon Guyatt pages 167-175
  • Interpreting the Results of Systematic Reviews Mike Clarke pages 176-180
  • Interpretation of Associations in Pharmacoepidemiology David W. Kaufman pages 181-188
  • Interpreting Diagnostic Test Accuracy Studies Patrick M.M. Bossuyt pages 189-195
  • Interpretation of Genomic Data: Questions and Answers Richard Simon pages 196-204

Fernando Alvarez-Cervera

________________________

Maybe the time has finally come for the editorial community, including ICMJE and WAME, to take a formal stand on the analysis of quantitative data. Doing so would build on the apparently quite effective positions they've taken on maintaining access to research data, and registration of trials; it could contribute importantly to better studies, better reporting, and better understanding of science.

The new position might go something like this:

Participating journals will not consider for publication any report of quantitative research that:

- fails to document and justify the rationale for the specific procedures used in analyzing quantitative data;

- limits its analysis to the reporting of p-values;

- accepts any particular p-value (but particularly a value of less than

0.05) as representing statistical significance without explaining its choice of that cutoff level;

- fails to report confidence intervals when they are applicable;

- etc.

Frank Davidoff


Document Actions