Tests of Statistical Significance
Introduction
Kesten Green and Scott Armstrong discuss statistical significance tests in general, and specifically seek experimental studies on how tests of statistical significance affect decision-making
: : : Posting
: : dialog
Kesten Green and I are revising one of the principles for forecasting (on ). It relates to the use of tests of statistical significance.
As those of you who have kept up with the literature in this field will realize, we are building upon the work of many researchers and the principle applies to all sciences. If you have not kept up, we provide references. The Ziliak and McCloskey book provides an informative description of the century-long efforts to acquaint researchers with the harm caused by tests of statistical significance. They report on many non-experimental studies.
We have been unable to find experimental studies on how tests of statistical significance affect decision-making. Are any of you aware of such evidence?
Here is the revised principle:
13.29 Do not use measures of statistical significance to assess a forecasting method or model.
Description: Even when correctly applied, significance tests are dangerous. Statistical significance tests calculate the probability, assuming the analyst’s null hypothesis is true, that relationships apparent in a sample of data are the result of chance variations that arose in selecting the sample. The probability that is calculated is affected by the size of the sample and the choice of null hypothesis. With large samples, even small differences from what would be expected in the data if the null hypothesis were true will be “statistically significant.” Choosing a different null hypothesis can change the conclusion. Statistical significance tests do not provide useful information on material significance or importance. Moreover, the tests are blind to common problems such as non-response error, and response error. The proper approach to analyzing and communicating findings from empirical studies is to (1) calculate and report effect sizes; (2) estimate the range within which the actual effect size is likely to lie by taking account of prior knowledge and all potential sources of error in measuring the effect; and (3) conduct replications, extensions, and meta-analyses.
Purpose: To avoid the selection of invalid models or methods, and the rejection of valid ones.
Conditions: There are no empirically demonstrated conditions on this principle. Statistical significance tests should not be used unless it can be shown that the measures provide a net benefit in the situation under consideration.
Strength of evidence: Strong logical support and non-experimental evidence. There are many examples showing how significance testing has harmed decision-making. Despite repeated appeals for evidence that statistical significance tests can improve decisions, none has been forthcoming. Tests of statistical significance run contrary to the proper purpose of statistics—which is to help users make sense of data. Experimental studies are needed to identify the conditions, if any, under which tests of statistical significance can improve decision-making.
Source of evidence:
Armstrong, J. S. (2007). , International Journal of Forecasting, 23, 321-336, with commentary and a reply.
Hauer, E. (2004). , Accident Analysis and Prevention, 36, 495-500.
Hubbard, R. & Armstrong J. S. (2006). . Journal of Marketing Education, 28, 114-120
Hunter, J.E. & Schmidt, F. L. (1996). . Psychology, Public Policy, and Law, 2, 324-347.
Ziliak, S. T. & McCloskey, D. N. (2008). . Ann Arbor, MI: University of Michigan Press.
J. Scott Armstrong
Dept. of Marketing
The Wharton School
U. of Pennsylvania
Phila., PA 19104
armstrong@wharton.upenn.edu