Stein's Paradox

$\newcommand{\abs}[1]{\left\lvert#1\right\rvert}$ $\newcommand{\norm}[1]{\left\lVert#1\right\rVert}$ $\newcommand{\inner}[1]{\left\langle#1\right\rangle}$ $\DeclareMathOperator*{\argmin}{arg\,min}$ $\DeclareMathOperator*{\argmax}{arg\,max}$ $\DeclareMathOperator*{\E}{\mathbb{E}}$ $\DeclareMathOperator*{\V}{\mathbb{V}}$ $\DeclareMathOperator*{\x}{\mathbb{x}}$

TL;DR

Suppose you have a 2D Normal distribution with unknown mean and variance $1$. You are given a single realization $(x_1, y_1)$ and are asked to give an estimate of the mean. If you say $(x_1, y_1)$, this is indeed the “best” guess for 2D. However, if you said $(x_1, y_1, z_1)$ in 3D, your guess would no longer be the “best”.

Problem Statement

Notation: $\norm{X}^2 = \sum_i^p X_i^2$ for a vector $X\in\mathbb{R}^p$.

Suppose we have a single sample $\x\in\mathbb{R}^p$ from $p$ dimensional random variable $X = (X_1,\ldots,X_p)'$, with $X_i \overset{ind}{\sim} \mathcal{N}(\mu_i, \sigma^2),$ i.e., $p$ independent variables with different (unknown) means but the same variance (assumed to be known). The goal is to get a good estimator $\widehat{\mu} = \widehat{\mu}(x)$ of the mean $\mu = (\mu_1,\ldots,\mu_p)’$ in terms of expected quadratic loss, or risk, $R(\widehat{\mu}, \mu) = \E\left({\norm{\widehat{\mu}-\mu}^2}\right)$.

Estimator $\widehat{\mu}$ strictly dominates $\widetilde{\mu}$ if $R(\widehat{\mu}, \mu) \le R(\tilde{\mu}, \mu), \forall \mu$ and strictly so for some $\mu$.

Estimator $\widehat{\mu}$ is admissible if $\widehat{\mu}$ is not strictly dominated by any other $\widetilde{\mu}$.

James-Stein Estimator

What about $\widehat{\mu}_0:=\x$, i.e., using the sample itself as an estimate for the mean? In fact, this is both least squares and maximum likelihood estimator, and is indeed admissible, but only for $p=1,2$. Surprisingly, it is inadmissible for $p > 2$. James & Stein (1961) showed that

\[\boxed{\widehat{\mu}_{JS} := \left( 1-\frac{(p-2)\sigma^2}{\norm{\x}^2} \right)\x}\]

dominates $\widehat{\mu}_0$ for $p>2$.

Before proving that, notice that the risk of the MLE estimator $\widehat{\mu}_0$ is

\[R(\widehat{\mu}_0, \mu) = \E\left(\norm{X-\mu}^2\right) = \E\left(\sum_i^p(X_i-\mu_i)^2\right) = \sum_i^p \E\left((X_i-\mu_i)^2\right) = p\sigma^2.\]

The proof below shows that the risk of JS can be smaller than $p\sigma^2$.

(Sketch of) Proof: JS estimator has smaller risk than MLE

Risk of James-Stein estimator:

\[{\small \begin{align} &R( \widehat{\mu}_{JS}, \mu) = \E\left(\norm{\left(1-\frac{(p-2)\sigma^2 }{\norm{X}^2}\right)X - \mu}^2 \right) \notag \\ &= \E\left(\norm{(X-\mu) - \frac{(p-2)\sigma^2 X}{\norm{X}^2}}^2 \right) \notag \\ &= \E\left(\norm{X-\mu}^2 \right) - 2 \E\left( \sum_i^p \frac{(X_i-\mu_i)(p-2)\sigma^2 X_i}{\norm{X}^2} \right) + \E\left(\norm{\frac{(p-2)\sigma^2 X}{\norm{X}^2}}^2 \right) \notag \\ &= p\sigma^2 - 2(p-2)\sigma^2 \color{violet}{\sum_i^p \E\left(\frac{(X_i-\mu_i)X_i}{\norm{X}^2} \right)} + (p-2)^2 \sigma^4 \E\left(\frac{1}{\norm{X}^2} \right) \label{eq:midone} \\ &= p\sigma^2 - 2(p-2)\sigma^2 \color{violet}{(p-2)\sigma^2\E\left(\frac{1}{\norm{X}^2} \right)} + (p-2)^2 \sigma^4 \E\left(\frac{1}{\norm{X}^2} \right) \label{eq:midtwo} \\ &= p\sigma^2 - (p-2)^2\sigma^4 \E\left(\frac{1}{\norm{X}^2} \right) \notag \\ &< p\sigma^2 = R( \widehat{\mu}_0, \mu), \notag \end{align} }\]

where the last inequality holds for $p>2$. Having more than $2$ dimensions is an important condition, which also guarantees that $\E(\norm{X}^{-2})$ does not explode (can be seen by writing expectation as integration and switching to spherical polar coordinates).

The tricky part is to prove the equality in Equation $\eqref{eq:midtwo}$, which will require some integration (by parts). Rewriting the expectation in Equation $\eqref{eq:midone}$ as integration,

\[{\small \begin{align} \E\left(\frac{(X_i-\mu_i)X_i}{\norm{X}^2} \right) = \int_{-\infty}^{\infty} \ldots \int_{-\infty}^{\infty} \frac{(x_i - \mu_i)x_i}{\norm{x}^2} \frac{e^{- \frac{1}{2\sigma^2} \norm{x-\mu}^2}}{\sigma^p(2\pi)^{p/2}} dx_1 \ldots dx_p, \label{eq:int} \end{align} }\]

we can choose $u = \frac{x_i}{\norm{x}^2}$ and $dv = \frac{(x_i - \mu_i)e^{- \frac{1}{2\sigma^2} \norm{x-\mu}^2}}{\sigma^{p}(2\pi)^{p/2}} dx_i = d\left(-\frac{e^{- \frac{1}{2\sigma^2} \norm{x-\mu}^2}}{\sigma^{p-2}(2\pi)^{p/2}}\right)$.

Then Equation $\eqref{eq:int}$ becomes

\[{\small \begin{align} \frac{1}{\sigma^{p-2}(2\pi)^{p/2}} \times \int_{-\infty}^{\infty} \ldots \int_{-\infty}^{\infty} \left( \underbrace{\left[-\frac{x_i e^{- \frac{1}{2\sigma^2}\norm{x-\mu}^2}}{\norm{x}^2} \right]_{-\infty}^{\infty}}_{= 0} + \int_{-\infty}^{\infty} e^{- \frac{1}{2\sigma^2} \norm{x-\mu}^2} d \frac{x_i}{\norm{x}^2} \right) \prod_{j\ne i}^p dx_j, \notag \end{align} }\]

and noticing that $\frac{d}{dx_i} \frac{x_i}{\norm{x}^2} = \frac{d}{dx_i} \frac{x_i}{\sum_j^p x_j^2} = \frac{\sum_j^p x_j^2 - 2 x_i^2}{(\sum_j^p x_j^2)^2} = \frac{\norm{x}^2 - 2 x_i^2}{\norm{x}^4},$ we have

\[{\small \begin{align} \int_{-\infty}^{\infty} \ldots \int_{-\infty}^{\infty} \left( \frac{\norm{x}^2 - 2 x_i^2}{\norm{x}^4\sigma^{-2}} \right) \frac{e^{- \frac{1}{2\sigma^2} \norm{x-\mu}^2}}{\sigma^p(2\pi)^{p/2}} dx_1 \ldots dx_p = \E\left(\frac{\norm{X}^2 - 2X_i^2}{\norm{X}^4\sigma^{-2}} \right). \notag \end{align} }\]

Hence, the middle term in Equation $\eqref{eq:midone}$ becomes

\[{\small \begin{align} \color{violet}{\sum_i^p \E\left(\frac{(X_i-\mu_i)X_i}{\norm{X}^2} \right)} &= \E\left(\sum_i^p \frac{\norm{X}^2 - 2X_i^2}{\norm{X}^4\sigma^{-2}} \right) \notag \\ & = \E\left(\frac{p\norm{X}^2 - 2\norm{X}^2}{\norm{X}^4\sigma^{-2}} \right) \notag \\ &= \color{violet}{(p-2)\sigma^2\E\left(\frac{1}{\norm{X}^2} \right)}, \notag \end{align} }\]

which completes the proof.

James-Stein: shrink towards whatever

The above JS estimator $\widehat{\mu}_{JS}$ shrinks the MLE estimator towards $0$ when $(p-2)\sigma^2 < \norm{\x}^2$, which provides an improvement over MLE. Perhaps even more surprisingly, one could shrink the estimate towards any arbitrary finite vector $\nu\in\mathbb{R}^p$, and that would also dominate MLE.

\[\boxed{\widehat{\mu}_{\nu} = \nu + \left(1-\frac{p-2}{\norm{\x}^2}\right)(\x-\nu)}\]

In practice, $\nu = \bar{\mathbb{x}}$ is often used, i.e., shrinking towards the average.

Positive-part James-Stein

It could be shown that JS estimator is strictly dominated by positive-part James-Stein:

\[\boxed{\widehat{\mu}_{\nu+} = \nu + \left(1-\frac{p-2}{\norm{\x}^2}\right)^+(\x-\nu)}\]

where $(\cdot)^+ := \max(\cdot, 0)$. And so turns out the original JS estimator is itself inadmissible. Tragically, the positive-part JS is inadmissible too, as it is dominated by an estimator developed in 1994 by Shao & Strawderman (and yes, this one is inadmissible too). It is not clear if there an admissible estimator when $p>2$.

Why it makes sense, Bayesian explanation

Efron & Morris (1973), provided a Bayesian argument to explain the entire thing. Namely, if one places a prior on $\mu_i$, James-Stein estimator can be derived as a posterior mean estimator. Specifically,

\[{\small \begin{align} X_i \vert \mu_i &\overset{ind.}{\sim} \mathcal{N}(\mu_i, \sigma^2), \quad &i=1, \ldots, p, \notag \\ \mu_i &\sim \mathcal{N}(0, \tau^2), \quad &i=1, \ldots, p, \notag \end{align} }\]

Then the posterior distribution looks like

\[\mu_i \vert X_i \sim \mathcal{N}\left(\frac{\tau^2}{\tau^2 + \sigma^2}X_i, \; \frac{\tau^2 \sigma^2}{\tau^2+\sigma^2} \right),\]

i.e., $\hat{\mu}_{\text{Posterior}} = \E(\mu_i \vert X_i) = \left( 1 - \frac{\sigma^2}{\tau^2+\sigma^2} \right) X_i$, which also effectively shrinks the MLE estimator to $0$. In fact, this shrinkage coincides with the mean of the “shrinkage coefficient” of the JS estimator when $p>2$,

\[\E\left( 1-\frac{(p-2)\sigma^2}{\norm{X_i}^2} \right) = 1 - (p-2)\sigma^2 \E\left(\frac{1}{\norm{X_i}^2} \right) = 1 - \frac{\sigma^2}{(\tau^2+\sigma^2)},\]

as the unconditional distribution of $X_i \sim \mathcal{N}(0, \tau^2 + \sigma^2)$ and so $\frac{1}{\norm{X_i}^2} \sim (\tau^2+\sigma^2)^{-1} \times \text{inv-} \chi_p^2$.

Why it makes sense, Bias-Variance tradeoff

Recall that we have a general decomposition of the risk function (MSE)

\[\E \left( \norm{\hat{\mu} - \mu}^2 \right) = \norm{\E(\hat{\mu}) - \mu}^2 + \E \left(\norm{\hat{\mu} - \E(\hat{\mu}) }^2 \right),\]

into the sum of squared bias and variance. For the MLE estimator $\hat{\mu}_0$, this expression translates to $0^2 + \sigma^2 p$, as the estimator is unbiased. However, a shrinkage estimator of form $\hat{\mu}_\lambda = \lambda \x$, would have

\[\E \left( \norm{\hat{\mu}_\lambda - \mu}^2 \right) = (\lambda-1)^2 \norm{\mu}^2 + \lambda^2 \sigma^2 p,\]

so taking $\lambda < 1$ would inflate the bias but would improve the variance. From the risk minimization standpoint, shrinkage may well turn out to be beneficial, especially when $p$ is large and the variance’s portion in risk function increases. This is what JS estimator ultimately suggests, as the shrinkage amount $\lambda_{JS} = ( 1-(p-2)\sigma^2 / \norm{\x}^2 )$ depends negatively on $p$.

Additionally, bias-variance viewpoint sheds light on why positive-part JS estimator dominates the usual JS estimator: if ever $\lambda_{JS}$ turned to be negative, one would benefit simply setting it to $0$ as both squared bias and variance start to increase once $\lambda$ goes into negative territory.

Other notes

Notably, if $\sigma^2$ is unknown, JS still dominates MLE if we replace $\sigma^2$ with its estimate $\widehat{\sigma}^2=\sum(\x_i - \bar{\mathbb{x}})^2$ when $p>2$.