This blog is inspired by an excellent introduction to the Statistical Decision Theory presented in (The Elements of Statistics Learning). I aim to provide a more detailed explanation of this theory with rigorous mathematical derivations to support the author’s solutions.
We begin by presenting a general framework for making predictions based on a real-valued input vector. Let be a real-valued random input vector (covariates), and be the real-valued random output vector. The joint probability distribution for and is given as . We seek a function that uses covariates for predicting an output that matches as closely as possible.
The theory requires a loss function for penalizing errors in prediction. For convenience, we will choose the Squared Error Loss as loss function, which is defined as, . We define the Expected Prediction Error as the criterion for choosing .
We can convert the joint distribution above to a density function according to the property , where is the density function. We now define as,
Using Bayes rule, . We plug this into the formulation above as,
We can use the definitions of Expectations and to simplify the equation above as,
Note that for each point , when we condition on we observe that we only need to minimize . This is done as,
We also note here that the modelling function can be any arbitrary function that can be formed to generate any output value. In the equation above, we therefore don’t require to be specified and instead can replace it with a constant . The constant has a value that minimizes the pointwise.
Later, we will observe that whatever optimal value we obtain for the constant , we can select different modelling functions to obtain this optimal value. As an example I will include later in this post, in the context of designing a Linear Regressor as the modelling function, the function will look like . It may already be apparent to readers the usefulness of substituting the predictions of the function with a constant representing the optimum value for minimizing .
We now represent the problem as,
Using the linearity of expectations, we can represent this as,
We differentiate w.r.t and set the right hand side to 0,
We obtain the solution to the Decision Theory as,
This tells us that the best prediction of at any point is the conditional mean, given that the predictions are measured using average squared error.