Comparing Fitted Statistical Models¶

Author: Felipe Maia Polo
E-mail: felipemaiapolo@gmail.com
Website: https://felipemaiapolo.github.io/

General Framework¶

In this example we will assume that there is an unknown probability distribution $F$ of our interest that measures events of $\mathcal{X}=\mathbb{R}^p$ , and that we want to model using a dataset $\{x_i\}_{i=1}^n$ which is an instance of $\{X_i\}_{i=1}^n$ , which is a set of i.i.d. random variables with distribution $F$ .

As possible statistical models, we adopted two families of distributions $\{P_\theta: \theta \in \Theta\}$ e $\{Q_\alpha: \alpha \in A\}$ , assuming that $\Theta \subseteq \mathbb{R}^d$ and $A \subseteq \mathbb{R}^{d'}$ given that $d,d' \in \mathbb{N}$ . Let us assume we have learnt $\hat{\theta}$ and $\hat{\alpha}$ beforehand as candidates for better values for model parameters using a part of the data called the training set $\{x_i\}_{i=n_{te}+1}^n$ .

Our goal is to use the rest of the data $\{x_i\}_{i=1}^{n_{te}}$ as a test set and evaluate which of the two fitted models, $P_\hat{\theta}$ or $Q_\hat{\alpha}$ , is the most appropriate to describe the distribution of interest $F$ . From now on, we assume that $F$ , $P_\hat{\theta}$ and $Q_\hat{\alpha}$ are absolutely continuous distributions with respect to the Lebesgue measure, that is, they have probability density functions $f$ , $p_\hat{\theta}$ , $q_\hat{\alpha} : \mathbb{R}^p \rightarrow \mathbb{R}_+$ . We will then start working directly with the pdfs.

In order to compare the two competing models $p_\hat{\theta}$ and $q_\hat{\alpha}$ , first we define the following loss function $L_f: ~~\mathcal{G} \times \mathcal{X} \rightarrow \mathbb{R}$ , given that $\mathcal{G}$ is a proper subset of the set of all probability density functions. The loss function will tell us how good each model is at explaining each of the data points in the test set individually. Clarifying, if $L_f\left(p_\hat{\theta},x_1\right) \leq L_f\left(q_\hat{\alpha},x_1\right)$ , then $p_\hat{\theta}$ is better or equally good $q_\hat{\alpha}$ to explain the data point $x_1$ .

Since the $L$ function can only assess the suitability of the models at individual data points, we have to define a risk function that will tell us how good each model is good overall. The risk function is given by:

\begin{aligned} R_{f} : & G \to R \\ g \mapsto \underset{X \sim F}{E} L_{f} (g, X) \end{aligned}

$\begin{align} R_f: & ~~\mathcal{G}\rightarrow \mathbb{R}\\ & ~~ g \mapsto \underset{X \sim F}{\mathbb{E}} L_f \left(g,X \right) \end{align}$

If we knew how to integrate in $F$ we could calculate and compare the risk of the two fitted models $p_\hat{\theta}$ and $q_\hat{\alpha}$ and tell which one is better overall. As we do not know the probability measure $F$ , then we have to estimate the risk for each of the models using the empirical risk. If we are provided with a random sample of i.i.d. random variables with distribution $F$ , we could use the following formulation for the empirical risk to estimate true risk:

\begin{aligned} {\hat{R}}_{f} : & G \times X^{m} \to R \\ (g, {x_{i}^{'}}_{i = 1}^{m}) \mapsto \frac{1}{m} \sum_{i = 1}^{m} L_{f} (g, x_{i}^{'}) \end{aligned}

$\begin{align} \hat{R}_f: & ~~~~~~~\mathcal{G} \times \mathcal{X}^{m} \rightarrow \mathbb{R}\\ & \left(g,\{x'_i\}_{i=1}^{m}\right) \mapsto \frac{1}{m}\sum_{i=1}^{m} L_f \left(g,x'_i \right) \end{align}$

So, to compare the two fitted models $p_\hat{\theta}$ and $q_\hat{\alpha}$ and tell which one is better, it is sufficient to compare $\hat{R}_f\left(p_\hat{\theta},\{x_i\}_{i=1}^{n_{te}}\right)$ and $\hat{R}_f\left(q_\hat{\alpha},\{x_i\}_{i=1}^{n_{te}}\right)$ and check which of the two models returns the lowest risk estimate. It may be that, even if there is a divergence between the estimated risks, we cannot say that one model is better than the other, simply because chance can play an important role.

Basically, we could do a test of difference of means using asymptotic theory or numerical resampling methods (e.g. Bootstrap). If $\Delta R_f:=R_f(p_\hat{\theta})-R_f(q_\hat{\alpha})$ , adequate hypothesis would be:

\begin{aligned} H_{0} : Δ R_{f} = 0 \\ H_{1} : Δ R_{f} \neq 0 \end{aligned}

$\begin{align} &H_0: \Delta R_f=0 \\ &H_1: \Delta R_f\neq 0 \end{align}$

In this explanation we will use asymptotic theory as a path to the test. Having a random sample $X_1,...,X_{n_{te}} \overset{iid}{\sim} F$ , we define $\Delta \hat{R}_f\left(\{X_i\}_{i=1}^{n_{te}} \right):=\hat{R}_f\left(p_\hat{\theta},\{X_i\}_{i=1}^{n_{te}}\right)-\hat{R}_f\left(q_\hat{\alpha},\{X_i\}_{i=1}^{n_{te}}\right)$ and assume that $\text{Var}\left(\Delta \hat{R}_f\left(X_1 \right)\right)=\sigma^2< \infty$ . So, using the Central Limit Theorem:

\begin{aligned} \sqrt{n_{t e}} [Δ {\hat{R}}_{f} ({X_{i}}_{i = 1}^{n_{t e}}) - Δ R_{f}] \overset{D}{\to} N (0, σ^{2}) \end{aligned}

$\begin{align} \sqrt{n_{te}}\left[\Delta \hat{R}_f\left(\{X_i\}_{i=1}^{n_{te}} \right) - \Delta R_f\right] \overset{\mathcal{D}}{\rightarrow} \mathcal{N}(0,\sigma^2) \end{align}$

If $H_0$ holds and $n_{te}$ is 'large enough', we assume that $Z_{n_{te}}=\frac{\sqrt{n_{te}}~\cdot~ \Delta \hat{R}_f\left(\{X_i\}_{i=1}^{n_{te}} \right)}{\hat{\sigma}} \approx \mathcal{N}(0,1)$ , given that $\hat{\sigma}=\sqrt{\frac{1}{n_{te}}\sum_{i=1}^{n_{te}}\left[\Delta \hat{R}_f\left(X_i \right) - \Delta \hat{R}_f\left(\{X_i\}_{i=1}^{n_{te}} \right)\right]^2}$ . See that I am using a paired-type test.

$Z_{n_{te}}$ is our test statistic that when we get its realization $z_{n_{te}}$ , we can calculate the p-value and decide whether or not we reject the hypothesis that the two models perform the same and, if it is the case, decide which is the best model based on your estimated risks.

Example¶

In this example we will use the whole theory developed so far to understand how model selection would work in practice. Let's assume that $f(x)=\mathcal{t}_{30}(x)$ , $p_\hat{\theta}(x)=\mathcal{N}\left(x~|~\hat{\theta}_1, \hat{\theta}_2\right)$ and $q_\hat{\alpha}(x)=\mathcal{t}_\hat{\alpha}(x)$ , given that $\mathcal{N}\left(\cdot ~|~\mu,\sigma^2\right)$ and $t_\nu(\cdot)$ denote the Normal and Student's $t$ pdfs. In this example, we assume $\hat{\theta}=(\hat{\theta}_1,\hat{\theta}_2)=(0,1)$ and $\hat{\alpha}=5$ . It is expected that we find that $p_\hat{\theta}$ is better than $q_\hat{\alpha}$ .

We want to decide which model is better and all we have so far is a dataset $\{x_i\}_{i=1}^{n_{te}}$ , which is an instance of $\{X_i\}_{i=1}^{n_{te}}$ , which is a set of i.i.d. random variables with distribution $F$ . In this example we will use the following loss function:

\begin{aligned} L_{f} : & G \times X \to R \\ (g, x) \mapsto log \frac{f (x)}{g (x)} \end{aligned}

$\begin{align} L_f: & ~~~\mathcal{G} \times \mathcal{X} \rightarrow \mathbb{R}\\ & ~~~~~ (g,x) \mapsto \text{log}~\frac{f(x)}{g(x)} \end{align}$

Given that $\mathcal{G}$ is the set of all probability density functions that support equals $\text{supp}(f)=\mathcal{X}=\mathbb{R}$ . The loss function as defined above has a very interesting interpretation: it gives us, on the logarithmic scale, the number of times it is more likely to say that the data point $x$ was sampled from $f$ and not from $g$ . For a Bayesian interpretation, see the first pages of [1]. As a consequence, we have that the risk in this case is given by the Kullback-Leibler divergence between $f$ and $g$ :

\begin{aligned} R_{f} : & G \to R \\ g \mapsto D_{K L} (f | | g) \end{aligned}

$\begin{align} R_f: & ~~~\mathcal{G}\rightarrow \mathbb{R}\\ & ~~~~ g \mapsto D_{KL}(f~||~g) \end{align}$

Knowing that the risk is given by KL divergency, it is clear why $L_f$ is a good loss funcion: if $R_f(p_\hat{\theta})< R_f(q_\hat{\alpha})$ , then $p_\hat{\theta}$ is 'closer' to $f$ relatively to $q_\hat{\alpha}$ . The problem of using this risk function is that we were unable to estimate it in order to make a comparison between models, because we don't know $f$ . A solution is to breakdown the KL divergency as follows:

\begin{aligned} D_{K L} (f | | g) & = \underset{X \sim f}{E} log \frac{f (X)}{g (X)} \\ = \underset{X \sim f}{E} log f (X) + \underset{X \sim f}{E} - log g (X) \\ = - \underset{X \sim f}{E} log g (X) + C \end{aligned}

$\begin{align} D_{KL}(f~||~g)&= \underset{X \sim f}{\mathbb{E}} \text{log}~\frac{f(X)}{g(X)}\\[.75em] &= \underset{X \sim f}{\mathbb{E}} \text{log}~f(X)+\underset{X \sim f}{\mathbb{E}}-\text{log}~g(X)\\[.75em] &= -\underset{X \sim f}{\mathbb{E}}\text{log}~g(X) +C \end{align}$

As long as $C$ is a constant that does not depend on our model $g$ , we can adopt the following surrogate (i) risk and respective (ii) empirical risk functions:

\begin{aligned} (i) & S_{f} (g) := - \underset{X \sim f}{E} log g (X) \\ (ii) & {\hat{S}}_{f} (g, {x_{i}^{'}}_{i = 1}^{m}) := - \frac{1}{m} \sum_{i = 1}^{m} log g (x_{i}^{'}) \end{aligned}

$\begin{align} \text{(i)}~~&S_f(g):= -\underset{X \sim f}{\mathbb{E}}\text{log}~g(X)\\[.75em] \text{(ii)}~~&\hat{S}_f\left(g,\{x'_i\}_{i=1}^{m}\right) := -\frac{1}{m}\sum_{i=1}^{m}\text{log}~g(x'_i) \end{align}$

$S_f(g)$ is know as the cross-entropy of the distribution $g$ relative to a distribution $f$ . To decide between the two possible models, let's compare $\hat{S}_f\left(p_\hat{\theta},\{x_i\}_{i=1}^{n_{te}}\right)$ and $\hat{S}_f\left(q_\hat{\alpha},\{x_i\}_{i=1}^{n_{te}}\right)$ also calculating the p-value for the hypothesis test explained above.

Experiment¶

In the experiment below, we will take $n_{te} =$ 10,000. Let's sample from $f$ and define functions for our models:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, t

n_test = 10000
X_test = np.random.standard_t(df=30, size=n_test)

models = {'p': lambda x: norm.pdf(x, loc=0, scale=1),
          'q': lambda x: t.pdf(x, df=5)}

Estimating the risk for both fitted models p and q:

for m in ['p','q']:
    print("Model: {:} --- Estimated Risk = {:.4f}".format(m, -np.mean(np.log(models[m](X_test)))))

Model: p --- Estimated Risk = 1.4608
Model: q --- Estimated Risk = 1.4800

It seems that model $p$ is better than $q$ . Defining an asymptotic confidence interval function with confidence level $\gamma$ :

def CI(x, gamma=.95):
    std_error = np.std(x)/np.sqrt(x.shape[0])
    error = norm.ppf(1-(1-gamma)/2)*std_error
    mean = np.mean(x)
    return mean, mean-error, mean+error

Getting the Confidence Interval ( $\gamma=.95$ ) of the mean difference:

aux = (-np.log(models['p'](X_test)))-(-np.log(models['q'](X_test)))
out=CI(aux, .95)

print("Empirical Risk Difference= {:.4f} --- CI = [{:.4f},{:.4f}]".format(out[0], out[1], out[2]))

Empirical Risk Difference= -0.0192 --- CI = [-0.0240,-0.0143]

Estimating the test statistic and associated p-value:

std_error = np.std(aux)/np.sqrt(n_test)
z = np.mean(aux)/std_error
p_val = 2*norm.cdf(-np.abs(z))

print("z= {:.4f} --- P-value = {:.4f}".format(z, p_val))

z= -7.7082 --- P-value = 0.0000

It seems the performance of the two tested models really differs, $p_{\hat{\theta}}$ being better than $q_{\hat{\alpha}}$ .

References:¶

[1] Kullback, S. (1997). Information theory and statistics. Courier Corporation.