Comparing Fitted Statistical Models

General Framework

In this example we will assume that there is an unknown probability distribution F of our interest that measures events of X=Rp, and that we want to model using a dataset {xi}i=1n which is an instance of {Xi}i=1n, which is a set of i.i.d. random variables with distribution F.

As possible statistical models, we adopted two families of distributions {Pθ:θΘ} e {Qα:αA}, assuming that ΘRd and ARd given that d,dN. Let us assume we have learnt θ^ and α^ beforehand as candidates for better values for model parameters using a part of the data called the training set {xi}i=nte+1n.

Our goal is to use the rest of the data {xi}i=1nte as a test set and evaluate which of the two fitted models, Pθ^ or Qα^, is the most appropriate to describe the distribution of interest F. From now on, we assume that F, Pθ^ and Qα^ are absolutely continuous distributions with respect to the Lebesgue measure, that is, they have probability density functions f, pθ^, qα^:RpR+. We will then start working directly with the pdfs.

In order to compare the two competing models pθ^ and qα^, first we define the following loss function Lf:  G×XR, given that G is a proper subset of the set of all probability density functions. The loss function will tell us how good each model is at explaining each of the data points in the test set individually. Clarifying, if Lf(pθ^,x1)Lf(qα^,x1), then pθ^ is better or equally good qα^ to explain the data point x1.

Since the L function can only assess the suitability of the models at individual data points, we have to define a risk function that will tell us how good each model is good overall. The risk function is given by:


Rf:  GR  gEXFLf(g,X)

If we knew how to integrate in F we could calculate and compare the risk of the two fitted models pθ^ and qα^ and tell which one is better overall. As we do not know the probability measure F, then we have to estimate the risk for each of the models using the empirical risk. If we are provided with a random sample of i.i.d. random variables with distribution F, we could use the following formulation for the empirical risk to estimate true risk:


R^f:       G×XmR(g,{xi}i=1m)1mi=1mLf(g,xi)

So, to compare the two fitted models pθ^ and qα^ and tell which one is better, it is sufficient to compare R^f(pθ^,{xi}i=1nte) and R^f(qα^,{xi}i=1nte) and check which of the two models returns the lowest risk estimate. It may be that, even if there is a divergence between the estimated risks, we cannot say that one model is better than the other, simply because chance can play an important role.

Basically, we could do a test of difference of means using asymptotic theory or numerical resampling methods (e.g. Bootstrap). If ΔRf:=Rf(pθ^)Rf(qα^), adequate hypothesis would be:


H0:ΔRf=0H1:ΔRf0

In this explanation we will use asymptotic theory as a path to the test. Having a random sample X1,...,XnteiidF, we define ΔR^f({Xi}i=1nte):=R^f(pθ^,{Xi}i=1nte)R^f(qα^,{Xi}i=1nte) and assume that Var(ΔR^f(X1))=σ2<. So, using the Central Limit Theorem:


nte[ΔR^f({Xi}i=1nte)ΔRf]DN(0,σ2)

If H0 holds and nte is 'large enough', we assume that Znte=nte  ΔR^f({Xi}i=1nte)σ^N(0,1), given that σ^=1ntei=1nte[ΔR^f(Xi)ΔR^f({Xi}i=1nte)]2. See that I am using a paired-type test.

Znte is our test statistic that when we get its realization znte, we can calculate the p-value and decide whether or not we reject the hypothesis that the two models perform the same and, if it is the case, decide which is the best model based on your estimated risks.


Example

In this example we will use the whole theory developed so far to understand how model selection would work in practice. Let's assume that f(x)=t30(x), pθ^(x)=N(x | θ^1,θ^2) and qα^(x)=tα^(x), given that N( | μ,σ2) and tν() denote the Normal and Student's t pdfs. In this example, we assume θ^=(θ^1,θ^2)=(0,1) and α^=5. It is expected that we find that pθ^ is better than qα^.

We want to decide which model is better and all we have so far is a dataset {xi}i=1nte, which is an instance of {Xi}i=1nte, which is a set of i.i.d. random variables with distribution F. In this example we will use the following loss function:


Lf:   G×XR     (g,x)log f(x)g(x)

Given that G is the set of all probability density functions that support equals supp(f)=X=R. The loss function as defined above has a very interesting interpretation: it gives us, on the logarithmic scale, the number of times it is more likely to say that the data point x was sampled from f and not from g . For a Bayesian interpretation, see the first pages of [1]. As a consequence, we have that the risk in this case is given by the Kullback-Leibler divergence between f and g:


Rf:   GR    gDKL(f || g)

Knowing that the risk is given by KL divergency, it is clear why Lf is a good loss funcion: if Rf(pθ^)<Rf(qα^), then pθ^ is 'closer' to f relatively to qα^. The problem of using this risk function is that we were unable to estimate it in order to make a comparison between models, because we don't know f. A solution is to breakdown the KL divergency as follows:


DKL(f || g)=EXflog f(X)g(X)=EXflog f(X)+EXflog g(X)=EXflog g(X)+C

As long as C is a constant that does not depend on our model g, we can adopt the following surrogate (i) risk and respective (ii) empirical risk functions:


(i)  Sf(g):=EXflog g(X)(ii)  S^f(g,{xi}i=1m):=1mi=1mlog g(xi)

Sf(g) is know as the cross-entropy of the distribution g relative to a distribution f. To decide between the two possible models, let's compare S^f(pθ^,{xi}i=1nte) and S^f(qα^,{xi}i=1nte) also calculating the p-value for the hypothesis test explained above.

Experiment

In the experiment below, we will take nte= 10,000. Let's sample from f and define functions for our models:

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, t

n_test = 10000
X_test = np.random.standard_t(df=30, size=n_test)

models = {'p': lambda x: norm.pdf(x, loc=0, scale=1),
          'q': lambda x: t.pdf(x, df=5)}

Estimating the risk for both fitted models p and q:

In [2]:
for m in ['p','q']:
    print("Model: {:} --- Estimated Risk = {:.4f}".format(m, -np.mean(np.log(models[m](X_test)))))
Model: p --- Estimated Risk = 1.4608
Model: q --- Estimated Risk = 1.4800

It seems that model p is better than q. Defining an asymptotic confidence interval function with confidence level γ:

In [3]:
def CI(x, gamma=.95):
    std_error = np.std(x)/np.sqrt(x.shape[0])
    error = norm.ppf(1-(1-gamma)/2)*std_error
    mean = np.mean(x)
    return mean, mean-error, mean+error

Getting the Confidence Interval (γ=.95) of the mean difference:

In [4]:
aux = (-np.log(models['p'](X_test)))-(-np.log(models['q'](X_test)))
out=CI(aux, .95)

print("Empirical Risk Difference= {:.4f} --- CI = [{:.4f},{:.4f}]".format(out[0], out[1], out[2]))
Empirical Risk Difference= -0.0192 --- CI = [-0.0240,-0.0143]

Estimating the test statistic and associated p-value:

In [5]:
std_error = np.std(aux)/np.sqrt(n_test)
z = np.mean(aux)/std_error
p_val = 2*norm.cdf(-np.abs(z))

print("z= {:.4f} --- P-value = {:.4f}".format(z, p_val))
z= -7.7082 --- P-value = 0.0000

It seems the performance of the two tested models really differs, pθ^ being better than qα^.


References:

[1] Kullback, S. (1997). Information theory and statistics. Courier Corporation.