$\newcommand{\abs}[1]{\left\lvert#1\right\rvert}$ $\newcommand{\norm}[1]{\left\lVert#1\right\rVert}$ $\newcommand{\inner}[1]{\left\langle#1\right\rangle}$ $\newcommand{\as}{\overset{a.s.}{\to}}$ $\newcommand{\d}{\overset{d}{\to}}$ $\DeclareMathOperator*{\argmin}{arg\,min}$ $\DeclareMathOperator*{\argmax}{arg\,max}$ $\DeclareMathOperator*{\E}{\mathbb{E}}$

× lecture notes from many years ago

Add: Bandwidth Estimation, As properties, LC, LL

Nonparametric Density Estimation

For discrete $X$:
$\displaystyle \; \hat{f}(x)=\frac{n_0}{n}=\frac{1}{h}\frac{\text{# of $x_i=x$}}{n}=\frac{1}{n}\sum_{i=1}^{n} \mathbb{1} \{x_i=x \}$
For continuous $X$:
$\displaystyle \hat{f}(x)=\frac{1}{h}\frac{n_0}{n}=\frac{1}{h}\frac{\text{# of $x_i \in $}(x-h/2,x+h/2)}{n}=\frac{1}{hn}\sum_{i=1}^{n} \mathbb{1} \{\frac{x_i-x}{h} \}$
Note: $f(x)=\frac{d}{dx} F(x)={\displaystyle\lim_{h\to0}}\frac{F(x+h/2)-F(x-h/2)}{h}= {\displaystyle\lim_{h \to 0}}\frac{P(x-h/2< X < x+h/2)}{h}$
$\hat{f}(x)$ is not differentiable, use kernel, Rosenblatt (1952): $\hat{f}(x)=\dfrac{1}{nh}\sum_{i=1}^{n} \mathbb{K}(\dfrac{x_i-x}{h})$
(i) standard normal $\mathbb{K}(\psi)=\frac{1}{\sqrt{2\pi}}\exp^{-\frac{1}{2}\psi^2}$
(ii) uniform $\mathbb{K}(\psi)=(2c)^{-1}, \text{ for } -c < \psi < c$ and $0$ o/w.

Bias and Variance of $\hat{f}$

Denote $\hat{f}=\frac{1}{n}\sum_{i=1}^{n}z_i$, where $z_i = \frac{1}{h}\mathbb{K}(\frac{x_i-x}{h})$.

$\displaystyle \E{\hat{f}(x)} = \E{z_1} = \frac{1}{h}\int_{x_1}\mathbb{K}(\underbrace{\frac{x_1-x}{h}}_{\equiv\psi})\underbrace{f(x_1)}_{unknown}dx_1 = \frac{1}{h} \int_{\psi}\mathbb{K}(\psi)f(x+h\psi)h d\psi$
$\approx \int_{\psi}\mathbb{K}(\psi)\big[f(x) + h\psi f^{(1)}(x) + \frac{h^2\psi^2}{2!} f^{(2)}(x) \big]d\psi$
$= f(x) \times 1 + hf^{(1)}(x) \times 0 + \frac{h^2}{2}f^{(2)}(x)\underbrace{\int_{\psi}\psi^2\mathbb{K}(\psi)d\psi}_{\equiv \mu_2}$
$= f(x) + \frac{h^2}{2}f^{(2)}(x)\mu_2$

$\displaystyle \text{BIAS}(\hat{f}(x)) = \frac{h^2}{2}f^{(2)}(x)\mu_2 = O(h^2)$
$\displaystyle \mathbb{V}(\hat{f}(x)) = \frac{1}{n}\mathbb{V}(z_1) = \frac{1}{n}\big[ \E{z_1^2-(\E{z_1})^2} \big] =$

$\displaystyle \E{z_1^2}=\frac{1}{h^2}\int_{x_1}\mathbb{K}^2(\frac{x_1-x}{h})f(x_1)dx_1 =\frac{1}{h^2} \int_{\psi}\mathbb{K}^2(\psi)f(x+h\psi)hd\psi$
$\stackrel{\textit{Taylor}}{\approx} \frac{1}{h} \int_{\psi}\mathbb{K}^2(\psi)\big[ f(x) + h\psi f^{(1)}(x) \big] d\psi\\ =\frac{f(x)}{h}\int_{\psi}\mathbb{K}^2(\psi)d\psi + f^{(1)}(x)\int_{\psi}\psi\mathbb{K}^2(\psi)d\psi$

\[= \displaystyle \underbrace{\frac{f(x)}{nh}\int_{\psi}\mathbb{K}^2(\psi)d\psi} _{O(\frac{1}{nh})} + \underbrace{\frac{f^{(1)}(x)}{n}\int _{\psi}\psi\mathbb{K}^2(\psi)d\psi} _{O(\frac{1}{n})} - \underbrace{\frac{1}{n} (\E{z_1})^2} _{O(\frac{1}{n})+O(\frac{h^2}{n})} \approx \underbrace{\frac{f(x)}{nh}\int _{\psi}\mathbb{K}^2(\psi)d\psi}_{O(\frac{1}{nh})}\]

$\displaystyle \stackrel{Local}{\text{MSE}}(\hat{f}(x))=\text{BIAS}^2(\hat{f}(x)) + \mathbb{V}(\hat{f}(x))$
$= \frac{h^4}{4}(f^{(2)}(x))^2\mu_2^2 + \frac{f(x)}{nh}\int_{\psi}\mathbb{K}^2(\psi)d\psi$
$= h^4\lambda_1(x)+\frac{1}{nh}\lambda_2(x)$
- \[\displaystyle \lambda_1(x) \equiv \frac{1}{4}(f^{(2)}(x))^2\mu_2^2\]
- \[\displaystyle \lambda_2(x) \equiv f(x)\int_{\psi}\mathbb{K}^2(\psi)d\psi,\]
$\displaystyle \stackrel{Global}{\text{IMSE}}(\hat{f}(x))= \int_{x}\stackrel{Local}{\text{MSE}}(\hat{f}(x))dx$
$= h^4\int_{x}\frac{1}{4}(f^{(2)}(x))^2\mu_2^2dx + \frac{1}{nh}\int_{x}f(x)\int_{\psi}\mathbb{K}^2(\psi)d\psi dx$
$= h^4\frac{1}{4}\mu_2^2\int_{x}(f^{(2)}(x))^2dx + \frac{1}{nh}\int_{\psi}\mathbb{K}^2(\psi)d\psi 1$
$= h^4\lambda_1+\frac{1}{nh}\lambda_2$
- \[\displaystyle \lambda_1 \equiv \frac{1}{4}\mu_2^2\int_{x}(f^{(2)}(x))^2dx \leftarrow unknown\]
- \[\displaystyle \lambda_2 \equiv \int_{\psi}\mathbb{K}^2(\psi)d\psi\]
Choose bandwidth $h$ to minimize IMSE
$\displaystyle \frac{\partial \text{IMSE}}{\partial h}= 4h^3 \lambda_1 - \frac{1}{nh^2}\lambda_2 =0 \rightarrow h_{opt}=n^{-1/5}(\frac{\lambda_2}{4\lambda_1})^{1/5} \propto n^{-1/5}$
Substitute $h_{opt}$ into IMSE and minimize wrt to $K(\psi)$ s.t. $\int K{\psi} d\psi =1$ and $\int \psi^2 K{\psi} = 1$ to obtain
$\displaystyle K^{opt}(\psi)=\begin{cases} \frac{3}{4}(1-\psi^2), |\psi|\le 1 \\ 0, \text{ o/w} \end{cases}, \text{ Bartlett's (Epanechnikov's) Kernel}$

Bandwidth Estimation

Note that $\lambda_1$ is unknown because of unknown $f(x)$, hence to obtain $h$ can use one of the following methods.

Ad-hoc Method
Assume $f(x) \sim N(0, \sigma_x^2)$. For $K(\psi)$ normal, it can be shown that $h_{opt}=1.06\sigma_x n^{-1/5}$
Plug-in Method
Repeat the following loop until the difference becomes small \begin{enumerate}
- Start with ad-hoc $h_{opt}$ and estimate $\hat{f}(x)$
- Calculate $\hat{\lambda}_1$ using $\hat{f}^{(2)}(x)$ instead of $f^{(2)}(x)$
- Use $\hat{\lambda}1$ to calculate new $\hat{h}{new}$
- Repeat the loop using the last $\hat{h}_{new}$ obtained \end{enumerate}
ISE (Cross-Validation) Method
Minimize ISE($h$) wrt $h$ (use grid search)

$\displaystyle \text{ISE}(h)= \int_{x} (\hat{f}(x) - f(x))^2 dx$ $\displaystyle = \int_{x} \hat{f}^2(x) dx +\underbrace{\int_{x} f^2(x) dx}_{\text{can be dropped}} - 2\int_{x} \hat{f}(x)f(x) dx \\ = \int_{x} \hat{f}^2(x) dx - 2\E\hat{f}(x) = \int_{x} \hat{f}^2(x) dx - 2 \frac{1}{n}\sum_{i=1}^{n}\hat{f}(x_i)$
- \[\hat{f}(x)=\dfrac{1}{nh}\sum_{i=1}^{n} \mathbb{K}(\dfrac{x_i-x}{h})\]
- \[\hat{f}^2(x)=\dfrac{1}{n^2h^2}\sum_{i=1}^{n}\sum_{j=1}^{n} \mathbb{K}(\dfrac{x_i-x}{h}) \mathbb{K}(\dfrac{x_j-x}{h})\]

$\displaystyle \frac{1}{n^2h^2}\sum_{i=1}^{n}\sum_{j=1}^{n} \int_{x} \mathbb{K}(\frac{x_i-x}{h})\mathbb{K}(\dfrac{x_j-x}{h})dx - \frac{2}{n^2h}\sum_{i=1}^{n}\sum_{j=1}^{n} \mathbb{K}(\frac{x_i-x_j}{h})\\ =\frac{1}{n^2h^2}\sum_{i=1}^{n}\sum_{j=1}^{n} \int_{x} \mathbb{K}(\frac{x_i-x}{h})\mathbb{K}(\dfrac{x_j-x}{h})dx - \frac{2}{n(n-1)h} \underset{i\ne j}{\sum^{n}\sum^{n}} \mathbb{K}(\frac{x_i-x_j}{h})$

Asymptotic Properties of $\hat{f}$

Under assumptions (A1), (A4), (A8) and (A9) we have
$\displaystyle \frac{1}{h} \E K^r \left( \frac{x_1-x}{h} \right) = \int K^r(\psi) f(h\psi +x) d\psi \to f(x) \int K^r (\psi) d\psi \text{ as } n \to \infty$

Asymptotic Mean $\displaystyle \E \hat{f}(x) = \E z_1 = \frac{1}{h} \E K(\frac{x_i-x}{h}) \overset{Lemma}{\to} f(x) \int_{\psi} K(\psi) d \psi = f(x) \text{ as } n\to \infty$
Asymptotic Variance $\displaystyle \mathbb{V} (\hat{f}(x)) = \mathbb{V} (\frac{1}{n} \sum_{i=1}^{n} z_i)$ $= \frac{1}{n} (\E z_1^2 -(\E z_1)^2) = \frac{1}{nh} \left[ \frac{1}{h} \E K^2(\frac{x_i-x}{h})\right] - \frac{1}{n}(\E z_1)^2$ $nh \mathbb{V} (\hat{f}(x)) = \frac{1}{h}\E K^2(\frac{x_i-x}{h}) - h (\E z_1)^2 \overset{Lemma}{\to} f(x) \int K^2 (\psi) d\psi - 0 \cdot f^2(x)$ $= f(x) \int K^2 (\psi) d\psi \text{ as } n\to \infty$
Weak Consistency We have that MSE$(\hat{f}(x)) \to 0$ as $n \to \infty$. Use Chebyshev inequality to show that $\displaystyle P[|\hat{f} - f|\ge \epsilon] \le \frac{\E(\hat{f}-f)^2}{\epsilon^2}$, and hence $\displaystyle P[|\hat{f} - f|\le \epsilon] \to 1$, i.e. $\displaystyle p \lim_{n \to n} \hat{f} = f$
Asymptotic Normality $\displaystyle z \equiv \frac{\hat{f}(x) - \E \hat{f}(x)}{\sqrt{\mathbb{V} (\hat{f}(x))}}$ $= \frac{\frac{1}{n} \sum_{i=1}^{n}(z_i - \E z_i)}{\sqrt{\frac{1}{n} \mathbb{V}(z_1)}}$ $=\sum_{i=1}^{n} L_{n,i}, \text{ where } L_{n,i} \equiv \frac{1}{n} \frac{z_i - \E z_i}{\frac{1}{n} V(z_1)}$

Lyapunov CLT

Let $\{ X_{n,i}\}$ be a sequence of independent (not necessarily identically distributed) RVs, with $\E X_{n,i} = \mu_{n,i}$ and $\mathbb{V} (X_{n,i}) = \sigma^2_n < \infty$. Denote $\displaystyle L_{n,i} \equiv \frac{X_{n,i} - \mu_{n,i}}{\sigma_n}$.
If for some $\delta>0$ the condition $\lim_{n \to \infty} \sum_{i=1}^{n} \E \abs{L_{n,i}}^{2+\delta}=0$ is satisfied, then $\displaystyle \sum_{i=1}^{n} L _{n,i} \overset{d}{\to} \mathcal{N}(0,1)$

Nonparametric estimation

Nadaraya-Watson and other basics

Nonparametric Density Estimation

Bias and Variance of $\hat{f}$

Bandwidth Estimation

Asymptotic Properties of $\hat{f}$

Lyapunov CLT