In this example we will assume that there is an unknown probability distribution of our interest that measures events of , and that we want to model using a dataset which is an instance of , which is a set of i.i.d. random variables with distribution .
As possible statistical models, we adopted two families of distributions e , assuming that and given that . Let us assume we have learnt and beforehand as candidates for better values for model parameters using a part of the data called the training set .
Our goal is to use the rest of the data as a test set and evaluate which of the two fitted models, or , is the most appropriate to describe the distribution of interest . From now on, we assume that , and are absolutely continuous distributions with respect to the Lebesgue measure, that is, they have probability density functions , , . We will then start working directly with the pdfs.
In order to compare the two competing models and , first we define the following loss function , given that is a proper subset of the set of all probability density functions. The loss function will tell us how good each model is at explaining each of the data points in the test set individually. Clarifying, if , then is better or equally good to explain the data point .
Since the function can only assess the suitability of the models at individual data points, we have to define a risk function that will tell us how good each model is good overall. The risk function is given by:
If we knew how to integrate in we could calculate and compare the risk of the two fitted models and and tell which one is better overall. As we do not know the probability measure , then we have to estimate the risk for each of the models using the empirical risk. If we are provided with a random sample of i.i.d. random variables with distribution , we could use the following formulation for the empirical risk to estimate true risk:
So, to compare the two fitted models and and tell which one is better, it is sufficient to compare and and check which of the two models returns the lowest risk estimate. It may be that, even if there is a divergence between the estimated risks, we cannot say that one model is better than the other, simply because chance can play an important role.
Basically, we could do a test of difference of means using asymptotic theory or numerical resampling methods (e.g. Bootstrap). If , adequate hypothesis would be:
In this explanation we will use asymptotic theory as a path to the test. Having a random sample , we define and assume that . So, using the Central Limit Theorem:
If holds and is 'large enough', we assume that , given that . See that I am using a paired-type test.
is our test statistic that when we get its realization , we can calculate the p-value and decide whether or not we reject the hypothesis that the two models perform the same and, if it is the case, decide which is the best model based on your estimated risks.
In this example we will use the whole theory developed so far to understand how model selection would work in practice. Let's assume that , and , given that and denote the Normal and Student's pdfs. In this example, we assume and . It is expected that we find that is better than .
We want to decide which model is better and all we have so far is a dataset , which is an instance of , which is a set of i.i.d. random variables with distribution . In this example we will use the following loss function:
Given that is the set of all probability density functions that support equals . The loss function as defined above has a very interesting interpretation: it gives us, on the logarithmic scale, the number of times it is more likely to say that the data point was sampled from and not from . For a Bayesian interpretation, see the first pages of [1]. As a consequence, we have that the risk in this case is given by the Kullback-Leibler divergence between and :
Knowing that the risk is given by KL divergency, it is clear why is a good loss funcion: if , then is 'closer' to relatively to . The problem of using this risk function is that we were unable to estimate it in order to make a comparison between models, because we don't know . A solution is to breakdown the KL divergency as follows:
As long as is a constant that does not depend on our model , we can adopt the following surrogate (i) risk and respective (ii) empirical risk functions:
is know as the cross-entropy of the distribution relative to a distribution . To decide between the two possible models, let's compare and also calculating the p-value for the hypothesis test explained above.
In the experiment below, we will take 10,000. Let's sample from and define functions for our models:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, t
n_test = 10000
X_test = np.random.standard_t(df=30, size=n_test)
models = {'p': lambda x: norm.pdf(x, loc=0, scale=1),
'q': lambda x: t.pdf(x, df=5)}
Estimating the risk for both fitted models p
and q
:
for m in ['p','q']:
print("Model: {:} --- Estimated Risk = {:.4f}".format(m, -np.mean(np.log(models[m](X_test)))))
It seems that model is better than . Defining an asymptotic confidence interval function with confidence level :
def CI(x, gamma=.95):
std_error = np.std(x)/np.sqrt(x.shape[0])
error = norm.ppf(1-(1-gamma)/2)*std_error
mean = np.mean(x)
return mean, mean-error, mean+error
Getting the Confidence Interval () of the mean difference:
aux = (-np.log(models['p'](X_test)))-(-np.log(models['q'](X_test)))
out=CI(aux, .95)
print("Empirical Risk Difference= {:.4f} --- CI = [{:.4f},{:.4f}]".format(out[0], out[1], out[2]))
Estimating the test statistic and associated p-value:
std_error = np.std(aux)/np.sqrt(n_test)
z = np.mean(aux)/std_error
p_val = 2*norm.cdf(-np.abs(z))
print("z= {:.4f} --- P-value = {:.4f}".format(z, p_val))
It seems the performance of the two tested models really differs, being better than .
[1] Kullback, S. (1997). Information theory and statistics. Courier Corporation.