Categories

# Where Statistics Went Wrong Modeling Random Variation

To-date, there are thousands of statistical distributions published in the Statistics literature. This seems insane. Perhaps the gigantic number of distributions indicates that we are wrong in how we model random variation, as observed in nature??

(Related podcast: Where Statistics Went Wrong Modeling Random Variation (Podcast) )

A model of random variation, generated by a “random variable”, is presented in Statistics in the form of a statistical distribution (like the normal or the exponential).

For example, the weight of people at a certain age is a random variable, and its observed variation may be modeled by the normal distribution; Surgery duration is a random variable, and its observed variation may, at a specified circumstance, be modeled by the exponential distribution.

In the Statistics literature, one may find statistical distributions modeling random variation directly observed in nature (as the above two examples), or random variation associated with a function of random variables (like a sample average calculated from a sample of n observations).

To-date, within the Statistics literature, one may literally find thousands of statistical distributions.

Is this acceptable?

Or perhaps we are wrong in how we model random variation?

Pursuant to a large-scale project, where I have modeled surgery times (a research effort reported in three recent publications, Shore 2020ab, 2021), I have reached certain conclusions of how random variation should be modeled as to be more truthful to reality. The new approach seems to reduce the problem of the insanely gigantic number of distributions, as currently appearing in the Statistics literature.

I have summarized these new insights in a new paper, carrying the title of the post.

The Introduction section of this paper is posted below. Underneath it, one may find a link to the entire article.

Where Statistics Went Wrong Modeling Random Variation

1. Introduction

The development of thousands of statistical distributions to-date is puzzling, if not bizarre. An innocent observer may wonder, how in most other branches of science the historical development shows a clear trend towards unifying the “objects of enquiry” (forces in physics; properties of materials in chemistry; human characteristics in biology), this has not taken place within the mathematical modelling of random variation? Why in Statistics, as the branch of science engaged in modeling random variation observed in nature, the number of “objects of enquiry” (statistical distributions) keeps growing?

In other words: Where has Statistics gone wrong modeling observed random variation?

Based on new insights, gained from a recent personal experience with data-based modeling of surgery time (resulting in a trilogy of published papers, Shore 2020ab, 2021), we present in this paper a new paradigm to modeling observed random variation. A fundamental insight is a new perception of how observed random variation is generated, and how it affects the form of the observed distribution. The latter is perceived to be generated not by a single source of variation (as the common concept of “random variable”, r.v., implies), but by two interacting sources of variation. One source is “Identity”, formed by “identity factors”. This source is represented in the distribution by the mode (if one exists), and it may generate identity-variation. A detailed example for this source, regarding modeling of surgery times, is presented in Shore (2020a). Another source is an interacting error, formed by “non-identity/error factors”. This source generates error variation (separate from identity variation). Combined, the two interacting sources generate the observed random variation. The random phenomenon, generating the latter, may be in two extreme states: An identity-full state (there is only error variation), and an identity-less state (identity factors become so unstable as to be indistinguishable from error factors; identity vanishes; no error can be defined). Scenarios, residing in between these two extreme states, reflect a source of variation with partial lack of identity (LoI).

The new “Random Identity Paradigm”, attributing two contributing sources to observed random variation (rather than a single one, as to date assumed), has far reaching implications to the true relationships between location, scale and shape moments. These are probed and demonstrated extensively in this paper, with numerous examples from current Statistics literature (relate, in particular, to Section 3).

In this paper, we first introduce, in Section 2, basic terms and definitions that form the skeleton for the new random-identity paradigm. Section 3 addresses implications of the new paradigm in the form of six propositions (subsection 3.1) and five predictions (presented as conjectures, subsection 3.2). The latter are empirically supported, in Section 4, with examples from the published Statistics literature. A general model for observed random variation (Shore, 2020a), bridging the gap between current models for the two extreme states (normal, for identity-full state; exponential, for the other), is reviewed in Section 5, and its properties and implications probed. Section 6 delivers some concluding comments.

A link to the complete article: