19 Dec 2021

The Statistical Decision Theory

This blog is inspired by an excellent introduction to the Statistical Decision Theory presented in (The Elements of Statistics Learning). I aim to provide a more detailed explanation of this theory with rigorous mathematical derivations to support the author’s solutions.

We begin by presenting a general framework for making predictions based on a real-valued input vector. Let XRp be a real-valued random input vector (covariates), and YR be the real-valued random output vector. The joint probability distribution for X and Y is given as Pr(X,Y). We seek a function f(X) that uses covariates X for predicting an output that matches Y as closely as possible.

The theory requires a loss function L(Y,f(X)) for penalizing errors in prediction. For convenience, we will choose the Squared Error Loss as loss function, which is defined as, L(Y,f(X))=(Yf(X))2. We define the Expected Prediction Error as the criterion for choosing f.

EPE(f)=E(Yf(X))2 EPE(f)=(yf(x))2Pr(dx,dy)

We can convert the joint distribution above to a density function according to the property P(X,YR)=fXY(x,y)dxdy, where fXY is the density function. We now define EPE as,

EPE(f)=(yf(x))2fXY(x,y)dxdy

Using Bayes rule, fXY(x,y)=fY|X(y|x)fX(x) . We plug this into the formulation above as,

EPE(f)=(yf(x))2fY|X(y|x)fX(x)dydx EPE(f)=fX(x)[(yf(x))2fY|Xdy]dx EPE(f)=fX(x)[(yf(x))2fY|Xdy]dx

We can use the definitions of Expectations EX and EY|X to simplify the equation above as,

EPE(f)=EX(yf(x))2fY|Xdy EPE(f)=EXEY|X([Yf(X)]2|X)

Note that for each point xX, when we condition on X=x we observe that we only need to minimize EY|X. This is done as,

EPE(f)=EY|X([Yf(X)]2|X=x)

We also note here that the modelling function f can be any arbitrary function that can be formed to generate any output value. In the equation above, we therefore don’t require f(X) to be specified and instead can replace it with a constant c. The constant c has a value that minimizes the EPE pointwise.

Later, we will observe that whatever optimal value we obtain for the constant c, we can select different modelling functions to obtain this optimal value. As an example I will include later in this post, in the context of designing a Linear Regressor as the modelling function, the function f will look like f(x)=xTβ. It may already be apparent to readers the usefulness of substituting the predictions of the function with a constant representing the optimum value for minimizing EPE.

We now represent the problem as,

EPE(f)=argmincEY|X([Yc]2|X=x)

Using the linearity of expectations, we can represent this as,

EPE(f)=argminc(EY|X(Y2|X=x)2cEY|X(Y|X=x)+c2)

We differentiate w.r.t c and set the right hand side to 0,

0=2EY|X(Y|X=x)+2c c=EY|X(Y|X=x)

We obtain the solution to the Decision Theory as,

f(x)=EY|X(Y|X=x)

This tells us that the best prediction of Y at any point X=x is the conditional mean, given that the predictions are measured using average squared error.


Tags: