Different Perspective of Line Regression

narcissuskid
发布于 2024-03-18 / 143 阅读 / 0 评论 / 0 点赞

Different Perspective of Line Regression

1. Line regression

1.1 Ordinary Least Squares (OLS) Perspective

1.1.1 Model Representation:

The linear regression model is represented as:
\hat{Y} = X\theta
where:

  • X is the design matrix with dimensions n \times (m+1) (including the intercept term).
  • \theta is the parameter vector with dimensions (m+1) \times 1.

1.1.2 Loss Function:

The mean squared error (MSE) is given by:

  • No regularization
    J(\theta) = \frac{1}{n}(X\theta - Y)^T(X\theta - Y)
  • L1 regularization
    J(\theta) = \frac{1}{n}(X\theta - Y)^T(X\theta - Y) + \lambda \|\mathbf{\theta}\|_1
  • L2 regularization
    J(\theta) = \frac{1}{n}(X\theta - Y)^T(X\theta - Y) + \lambda\theta^T\theta

1.1.3 Objective:

Given a dataset with n samples and m features, we want to find the parameter vector \theta that minimizes the mean squared error (MSE) between predicted values \hat{Y} and actual values y.
\theta = \arg\min_\theta{J(\theta)}

1.1.4 Solution:

  1. Gradient of \theta:

    • No regularization
      \frac{\partial J}{\partial \theta} = \frac{2}{n}X^T(X\theta - y)
    • L1 regularization
      \frac{\partial J}{\partial \theta} = \frac{2}{n}X^T(X\theta - y)+\lambda \text{sign}(\mathbf{\theta})
    • L2 regularization
      \frac{\partial J}{\partial \theta} = \frac{2}{n}X^T(X\theta - y)+2\lambda\theta
  2. Closed-Form for \theta:

    • No regularization
      \begin{equation} \begin{split} & \frac{2}{n}X^T(X\theta - y) = 0 \\ & X^TX\theta - X^Ty = 0 \\ & \theta = (X^TX)^{-1}X^Ty \end{split} \end{equation}
    • L1 regularization

    The closed-form solution for OLS with L1 penalty doesn’t have a direct analytical solution due to the non-differentiability of the L1 norm. However, it can be solved using optimization algorithms like coordinate descent or proximal gradient descent. These algorithms iteratively update the coefficients until convergence.

    • L2 regularization
      \begin{equation} \begin{split} & \frac{2}{n}X^T(X\theta - y)+2\lambda\theta = 0 \\ & (X^TX+\lambda n I)\theta = X^Ty \\ & \theta = (X^T X + \lambda n I)^{-1} X^Ty \end{split} \end{equation}

1.2 Maximum Likelihood Estimation Perspective

1.2.1 Model Representation:

The linear regression model is represented as:
y^{(i)} = x^{(i)T}\theta + \epsilon
where:

  • x^{(i)} is the ith example which is a vector with dimensions $ (m+1) \times 1 $. (including the intercept term).
  • \theta is the parameter vector with dimensions $ (m+1) \times 1 $.
  • $\epsilon $ is the error term assumed to be normally distributed with mean zero and variance $ \sigma^2 $, that is \epsilon\sim{N(0,\sigma^2)}.
  • y^{(i)}|x^{(i)},\theta\sim{N(x^{(i)T}\theta,\sigma^2)}.

1.2.2 Likelihood and posterior Function:

Under the assumption of Gaussian errors, the likelihood function of the observed data is given by the product of the probability density functions of the individual observations:

  • No regularization
    \begin{equation} \begin{split} P(X,Y|\theta) &= P(Y|X,\theta)*P(X|\theta) \\ &\propto P(Y|X,\theta) \\ & = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y^{(i)} - x^{(i)}\theta)^2}{2\sigma^2}\right) \\ & = \frac{1}{\sqrt{2\pi\sigma^2}} \exp{\left(-\frac{\sum_{i=1}^{n} (y^{(i)} - x^{(i)}\theta)^2}{2\sigma^2}\right)} \\ & = \frac{1}{\sqrt{2\pi\sigma^2}} \exp{\left(-\frac{(X\theta - Y)^T(X\theta - Y)}{2\sigma^2}\right)} \end{split} \end{equation}
  • L1 regularization

We assume \theta\sim{Laplace(0,b)}, so the priori probability of \theta is
P(\theta) = \frac{1}{2b}\exp{(-\frac{\left|\theta\right|}{b})}
The posterior probability of \theta is
\begin{equation} \begin{split} P(\theta|X,Y) &= \frac{P(X,Y,\theta)}{P(X,Y)} \\ & \propto P(X,Y,\theta) \\ & = P(Y|X,\theta)*P(X|\theta)*P(\theta) \\ & \propto P(Y|X,\theta)*P(\theta) \\ & = \frac{1}{\sqrt{2\pi\sigma^2}} \exp{\left(-\frac{(X\theta - Y)^T(X\theta - Y)}{2\sigma^2}\right)} * \frac{1}{2b}\exp{(-\frac{\left|\theta\right|}{b})} \\ & = \frac{1}{2b\sqrt{2\pi\sigma^2}} \exp{\left(-\frac{(X\theta - Y)^T(X\theta - Y)+\frac{2\sigma^2}{b}\left|\theta\right|}{2\sigma^2}\right)}\\ \end{split} \end{equation}

  • L2 regularization

We assume \theta\sim{N(0,\Sigma)}, so the priori probability of \theta is
P(\theta) = \frac{1}{(\sqrt{2\pi})^m\left|\Sigma\right|^\frac{1}{2}}\exp{(-\frac{\theta^T\Sigma^{-1}\theta}{2})}
The posterior probability of \theta is
\begin{equation} \begin{split} P(\theta|X,Y) & \propto P(Y|X,\theta)*P(\theta) \\ & = \frac{1}{\sqrt{2\pi\sigma^2}} \exp{\left(-\frac{(X\theta - Y)^T(X\theta - Y)}{2\sigma^2}\right)} * \frac{1}{(\sqrt{2\pi})^m\left|\Sigma\right|^\frac{1}{2}}\exp{(-\frac{\theta^T\Sigma^{-1}\theta}{2})} \\ &\propto \exp{\left(-\frac{(X\theta - Y)^T(X\theta - Y)+\sigma^2\theta^T\Sigma^{-1}\theta}{2\sigma^2}\right)}\\ \end{split} \end{equation}

1.2.3 Objective:

We seek to find the parameter vector \theta that maximizes the likelihood or posterior function of the observed data

  • No regularization
    \begin{equation} \begin{split} \theta = \arg\max_\theta{P(X,Y|\theta)}&=\arg\min_\theta{(X\theta - Y)^T(X\theta - Y)}\\ &= \arg\min_\theta{J(\theta)} \end{split} \end{equation}
  • L1 regularization
    \begin{equation} \begin{split} \theta = \arg\max_\theta{P(\theta|X,Y)}&=\arg\min_\theta{(X\theta - Y)^T(X\theta - Y)}+\frac{2\sigma^2}{b}\left|\theta\right| \\ &=\arg\min_\theta{(X\theta - Y)^T(X\theta - Y)}+\lambda\left|\theta\right| \\ &= \arg\min_\theta{J(\theta)} \end{split} \end{equation}
  • L2 regularization
    \begin{equation} \begin{split} \theta = \arg\max_\theta{P(\theta|X,Y)} &=\arg\min_\theta{(X\theta - Y)^T(X\theta - Y)}+\sigma^2\theta^T\Sigma^{-1}\theta\\ &\\ &=\arg\min_\theta{(X\theta - Y)^T(X\theta - Y)}+\lambda\theta^T\theta\\ &= \arg\min_\theta{J(\theta)} \end{split} \end{equation}
    Although the object function is different, the form is the same with OLS.

1.3. Column space Perspective


If we take X\theta as a vector from the column space of X, then X\theta - Y is the residual vector, and (X\theta - Y)^T(X\theta - Y) is the square of residual vector magnitude. So Minimizing MSE is equivalent to find the shortest residual vector which is orthogonal to the column space, that is:

\begin{equation} \begin{split} & X^T(X\theta - Y) = 0 \\ & X^TX\theta =X^TY \\ & \theta = (X^TX)^{-1}X^TY \end{split} \end{equation}

2. Bayesian Line regression

Bayesian regression is a probabilistic approach to regression analysis that incorporates prior knowledge about the model parameters into the modeling process. Unlike classical regression methods, Bayesian regression provides a framework for modeling uncertainty in the parameters and making probabilistic predictions.

2.1 Model Representation

In Bayesian regression, we model the relationship between the independent variables X and the dependent variable Y using a probabilistic model. The key components of the model are:

  • Likelihood: This represents the conditional distribution of the dependent variable y given the independent variables X and the model parameters \theta. It is typically assumed to be Gaussian:

p(Y|X, \theta) = \mathcal{N}(X\theta, \sigma^2 I)

  • Prior: This represents our beliefs about the distribution of the model parameters \theta before observing any data. It is specified as a probability distribution, often chosen to be a Gaussian distribution:

p(\theta) = \mathcal{N}(\mu, \Sigma)

  1. Posterior: This represents the updated beliefs about the parameters after observing the data X and Y. According to Bayes’ theorem, the posterior is proportional to the likelihood times the prior:

p(\theta|X, Y) \propto p(Y|X, \theta) \cdot p(\theta)

2.2 Model Learning

The goal of learning in Bayesian regression is to estimate the posterior distribution of the model parameters \theta given the observed data X and Y. This involves updating the prior distribution using Bayes’ theorem to obtain the posterior distribution.

According to the problem description, assuming the prior distribution is a multivariate Gaussian distribution N(\mu, \Sigma), hence:

p(\theta) = \frac{1}{\sqrt{(2\pi)^p \lvert \Sigma \rvert}} \exp\left(-\frac{1}{2}(\theta - \mu)^T \Sigma^{-1} (\theta - \mu)\right)

The likelihood function is given by:

p(Y|X, \theta) = \frac{1}{\sqrt{(2\pi)^n\sigma^2}} \exp\left(-\frac{1}{2\sigma^2}(Y - X\theta)^T(Y - X\theta)\right)

Therefore, the expression for the posterior probability is:

\begin{equation} \begin{split} p(\theta|X, Y) & \propto \exp\left(-\frac{1}{2}\left[\sigma^{-2}(Y - X\theta)^T(Y - X\theta) + (\theta - \mu)^T \Sigma^{-1} (\theta - \mu)\right]\right) \\ & \propto \exp(-\frac{1}{2}[\theta^T(\frac{X^TX}{\sigma^2}+\Sigma^{-1})\theta-2\theta^T(\frac{X^TY}{\sigma^2}+\Sigma^{-1}\mu)]) \end{split} \end{equation}

As both the prior distribution p(\theta) and the likelihood function p(Y|X, \theta) are Gaussian distributions, then the posterior probability will be a Gaussian distribution. In this case, we can calculate the mean and variance of the posterior probability. If we assume posterior probability is a multivariate Gaussian distribution N(\mu_{\text{post}}, \Sigma_{\text{post}}), hence:

\begin{equation} \begin{split} p(\theta|X, Y) &= \frac{1}{\sqrt{(2\pi)^p \lvert \Sigma_{\text{post}} \rvert}} \exp\left(-\frac{1}{2}(\theta - \mu_{\text{post}})^T \Sigma_{\text{post}}^{-1} (\theta - \mu_{\text{post}})\right) \\ &= \frac{1}{\sqrt{(2\pi)^p \lvert \Sigma_{\text{post}} \rvert}} \exp\left(-\frac{1}{2}[\theta^T\Sigma_{\text{post}}^{-1}\theta-2\theta^T\Sigma_{\text{post}}^{-1}\mu_{\text{post}}+\mu_{\text{post}}^T\Sigma_{\text{post}}^{-1}\mu_{\text{post}}]\right) \\ & \propto \exp(-\frac{1}{2}[\theta^T(\frac{X^TX}{\sigma^2}+\Sigma^{-1})\theta-2\theta^T(\frac{X^TY}{\sigma^2}+\Sigma^{-1}\mu)]) \end{split} \end{equation}
So the covariance matrix (\Sigma_{\text{post}}) of the posterior distribution:

\Sigma_{\text{post}} = (\frac{X^TX}{\sigma^2}+\Sigma^{-1})^{-1}

The mean vector (\mu_{\text{post}}) of the posterior distribution:

\mu_{\text{post}} = \Sigma_{\text{post}}(\frac{X^TY}{\sigma^2}+\Sigma^{-1}\mu)

That is

p(\theta|X, Y) \sim N(\Sigma_{\text{post}}(\frac{X^TY}{\sigma^2}+\Sigma^{-1}\mu),(\frac{X^TX}{\sigma^2}+\Sigma^{-1})^{-1})

2.3 Model Inference

The prediction of Bayesian linear regression is to get the predictive distribution of target value y_* when given input feature vector x_* which can be obtained from the posterior probability distribution. As the predictive distribution can be represented as a Gaussian distribution, it can be written as:

p(y_* | x_*, X, y, \alpha) = \mathcal{N}(\mu_*, \sigma_*^2)

\mu_* is the mean of the predicted target value y_*, and \sigma_*^2 is the variance of the predicted target value y_*, which can be calculated using the following formulas:

\begin{equation} \begin{split} \mu_* &= \mathbb{E}[y_*] \\ &= \mathbb{E}[x_*^T\theta_{\text{post}} + \epsilon] \\ &=x_*^T \mathbb{E}[\theta_{\text{post}}] \\ &=x_*^T\mu_{\text{post}} \end{split} \end{equation}

\begin{equation} \begin{split} \sigma_*^2 &= \text{Var}[y_*] \\ &= \text{Var}[x_*^T\theta_{\text{post}} + \epsilon] \\ &= \text{Var}[x_*^T\theta_{\text{post}}] + \text{Var}[\epsilon] \\ &= x_*^T\text{Var}[\theta_{\text{post}}]x_* + \sigma^2 \\ &= \sigma^2 + x_*^T \Sigma_{\text{post}} x_* \end{split} \end{equation}

where:

  • x_* is the new input feature vector.
  • \mu_{\text{post}} is the expected value of the posterior probability distribution of the parameter vector.
  • \Sigma_{\text{post}} is the variance-covariance matrix of the posterior probability distribution of the parameter vector.
  • \epsilon is the error term assumed to be normally distributed with mean zero and variance \sigma^2, that is \epsilon\sim{N(0,\sigma^2)}.

Thus, we obtain the mean \mu_* and variance \sigma_*^2 of the predictive distribution, and consequently, the probability density function of the predictive distribution.

3. Gaussian Process Regression

3.1 Weight Space Perspective

In Gaussian Process Regression (GPR), we start by considering a bayesian linear regression model whose input data \mathbf{x} maps the to a high-dimensional feature space by a feature mapping function \phi(\mathbf{x}), the model can be expressed as follows:

\mathbf{y} = \phi(\mathbf{x})^T\theta + \epsilon

where:

  • \phi(\mathbf{x}) is the feature mapping function
  • \epsilon is assumed to be Gaussian noise with mean 0 and variance \sigma^2.
  • p(\theta)\sim N(0, \Sigma)

So the covariance matrix \Sigma_{\text{post}} of the posterior distribution is:

\Sigma_{\text{post}} = (\frac{\phi(X)^T\phi(X)}{\sigma^2}+\Sigma^{-1})^{-1}

The mean vector \mu_{\text{post}} of the posterior distribution is:

\mu_{\text{post}} = \Sigma_{\text{post}}\frac{\phi(X)^TY}{\sigma^2}

As the Woodbury Formula is:
(A + UCV)^{-1} = A^{-1} - A^{-1}U(C^{-1} + VA^{-1}U)^{-1}VA^{-1}
So

\begin{equation} \begin{split} \Sigma_{\text{post}} &= (\Sigma^{-1}+\phi(X)^T\sigma^{-2}\phi(X))^{-1} \\ &=\Sigma -\Sigma\phi(X)^T(\sigma^2I+\phi(X)\Sigma\phi(X)^T)^{-1}\phi(X)\Sigma \end{split} \end{equation}

\begin{equation} \begin{split} \mu_{\text{post}} &= \Sigma_{\text{post}}\frac{\phi(X)^TY}{\sigma^2} \\ &= \Sigma\frac{\phi(X)^TY}{\sigma^2} - \Sigma\phi(X)^T(\sigma^2I+\phi(X)\Sigma\phi(X)^T)^{-1}\phi(X)\Sigma\frac{\phi(X)^TY}{\sigma^2} \\ \end{split} \end{equation}

So the mean and variance of the predicted target value \mathbf{y_*} when given a new input feature vector \mathbf{x_*} can be calculated using the following formulas:

\begin{equation} \begin{split} \mu_* &= \phi(\mathbf{x_*})^T\mu_{\text{post}} \\ &= \phi(\mathbf{x_*})^T\Sigma\frac{\phi(X)^TY}{\sigma^2} -\phi(\mathbf{x_*})^T\Sigma\phi(X)^T(\sigma^2I+\phi(X)\Sigma\phi(X)^T)^{-1}\phi(X)\Sigma\frac{\phi(X)^TY}{\sigma^2} \end{split} \end{equation}

\begin{equation} \begin{split} \sigma_*^2 &= \sigma^2 + \phi(\mathbf{x_*})^T \Sigma_{\text{post}}\phi(\mathbf{x_*}) \\ &= \sigma^2 + \phi(\mathbf{x_*})^T\Sigma\phi(\mathbf{x_*}) - \phi(\mathbf{x_*})^T\Sigma\phi(X)^T(\sigma^2I+\phi(X)\Sigma\phi(X)^T)^{-1}\phi(X)\Sigma\phi(\mathbf{x_*}) \end{split} \end{equation}

As \phi(\mathbf{x})^T\Sigma\phi(\mathbf{x'}) = \phi(\mathbf{x})^T\Sigma^\frac{1}{2}\Sigma^\frac{1}{2}\phi(\mathbf{x'}) and \Sigma is a symmetric matrix, \phi(\mathbf{x})^T\Sigma\phi(\mathbf{x'}) can be wrote as

\begin{equation} \begin{split}\phi(\mathbf{x})^T\Sigma\phi(\mathbf{x'}) &=(\Sigma^\frac{1}{2}\phi(\mathbf{x}))^T(\Sigma^\frac{1}{2}\phi(\mathbf{x'})) \\ &= \Psi(\mathbf{x})^T\Psi(\mathbf{x'}) \\ &= K(\mathbf{x},\mathbf{x'}) \end{split} \end{equation}

where

  • \Psi(\mathbf{x}) is a linear transformation that \Psi(\mathbf{x}) = \Sigma^\frac{1}{2}\phi(\mathbf{x})
  • K(\mathbf{x}, \mathbf{x}') is the covariance function (kernel).

The mean and variance of \mathbf{y_*} can be simplificated as

\begin{equation} \begin{split} \mu_* &= \phi(\mathbf{x_*})^T\Sigma\frac{\phi(X)^TY}{\sigma^2} -\phi(\mathbf{x_*})^T\Sigma\phi(X)^T(\sigma^2I+\phi(X)\Sigma\phi(X)^T)^{-1}\phi(X)\Sigma\frac{\phi(X)^TY}{\sigma^2} \\ &= \sigma^{-2}K(\mathbf{x_*},X)(I-(\sigma^2I+K(X,X))^{-1}K(X,X))Y \end{split} \end{equation}

\begin{equation} \begin{split} \sigma_*^2 &= \sigma^2 + \phi(\mathbf{x_*})^T\Sigma\phi(\mathbf{x_*}) - \phi(\mathbf{x_*})^T\Sigma\phi(X)^T(\sigma^2I+\phi(X)\Sigma\phi(X)^T)^{-1}\phi(X)\Sigma\phi(\mathbf{x_*}) \\ &= \sigma^2 + K(\mathbf{x_*},\mathbf{x_*}) - K(\mathbf{x_*},X)(\sigma^2I+K(X,X))^{-1}K(X,\mathbf{x_*}) \end{split} \end{equation}

Another method to get \mu_* is below,
As
\begin{equation} \begin{split} & \Sigma_{\text{post}} = (\frac{\phi(X)^T\phi(X)}{\sigma^2}+\Sigma^{-1})^{-1}\\ & \Sigma_{\text{post}}^{-1} = \frac{\phi(X)^T\phi(X)}{\sigma^2}+\Sigma^{-1} \\ & \Sigma_{\text{post}}^{-1}\Sigma = \frac{\phi(X)^T\phi(X)\Sigma}{\sigma^2}+I \\ & \Sigma_{\text{post}}^{-1}\Sigma\phi(X)^T = \frac{\phi(X)^T\phi(X)\Sigma\phi(X)^T}{\sigma^2}+\phi(X)^T \\ & \Sigma\phi(X)^T = \sigma^{-2}\Sigma_{\text{post}}\phi(X)^T[\phi(X)\Sigma\phi(X)^T+\sigma^2I] \\ & \sigma^{-2}\Sigma_{\text{post}}\phi(X)^T = \Sigma\phi(X)^T[\phi(X)\Sigma\phi(X)^T+\sigma^2I]^{-1} \\ \end{split} \end{equation}

So the mean vector \mu_* is:

\begin{equation} \begin{split} \mu_* &= \phi(\mathbf{x_*})^T\mu_{\text{post}} \\ &= \phi(\mathbf{x_*})^T\Sigma_{\text{post}}\frac{\phi(X)^TY}{\sigma^2} \\ &= \phi(\mathbf{x_*})^T\Sigma\phi(X)^T[\phi(X)\Sigma\phi(X)^T+\sigma^2I]^{-1}Y \\ &= K(\mathbf{x_*},X)[K(X,X)+\sigma^2I]^{-1}Y \end{split} \end{equation}

3.2 Function Space Perspective

In GPR, we are interested in modeling the underlying function f(\mathbf{x}) directly rather than the weights \mathbf{w}. We assume that the function f(\mathbf{x}) itself follows a Gaussian process:

f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}'))

where:

  • m(\mathbf{x}) is the mean function,
  • k(\mathbf{x}, \mathbf{x}') is the covariance function (kernel).

We can treat the observed data and the output values of the point to be predicted as a joint Gaussian distribution and utilize the properties of joint probability for inference.

Assuming we have a set of observed data \{(\mathbf{x_i}, \mathbf{y_i})\}_{i=1}^n, where \mathbf{x_i} is the input and \mathbf{y_i} is the corresponding output. Our goal is to predict the unknown function f(\mathbf{x_i}), and given a new input \mathbf{x_*}, we want to predict the corresponding output \mathbf{x_*}).

Firstly, we assume that the output values of the observed data \mathbf{y_i} are composed of the unknown function f(\mathbf{x_i}) and independent identically distributed Gaussian noise terms \epsilon_i:

\mathbf{y_i} = f(\mathbf{x_i}) + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2)

We construct a joint Gaussian distribution for the observed data and the output values of the point to be predicted:

Y_\text{joint} = \begin{bmatrix} Y \\ \mathbf{y_*} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix}m(X) \\ m(\mathbf{x_*})\end{bmatrix}, \begin{bmatrix} K+\sigma^2 & k_* \\ k_*^T & k(\mathbf{x_*}, \mathbf{x_*}) \end{bmatrix}\right)

Here

  • K is the covariance matrix of the observed data
  • k_* is the covariance vector between the observed data and the point to be predicted,
  • k(\mathbf{x_*}, \mathbf{x_*}) is the self-covariance of the point to be predicted.
  • \sigma^2 is the variance of the noise.

We can then utilize the properties of joint Gaussian distribution, given the observed data Y, to compute the conditional probability distribution of the point to be predicted \mathbf{y_*}:

p(y_* | Y) = \mathcal{N}(\mu_*, \sigma_*^2)

where \mu_* and \sigma_*^2 are the mean and variance of the point to be predicted

Let A=\begin{bmatrix}p & I\end{bmatrix}, B = AY_\text{joint} is also a normal distribution with variance as below:

\begin{equation} \begin{split} \text{Var}B &= \text{Var}(AY_\text{joint}) \\ &= A\text{Var}(Y_\text{joint})A^T \\ &= \begin{bmatrix}p & I\end{bmatrix}\begin{bmatrix} K+\sigma^2 & k_* \\ k_*^T & k(\mathbf{x_*}, \mathbf{x_*}) \end{bmatrix}\begin{bmatrix}p^T \\ I\end{bmatrix} \\ &= \begin{bmatrix}p(K+\sigma^2I)+k_* & pk_*+k(\mathbf{x_*}, \mathbf{x_*})\end{bmatrix}\begin{bmatrix}p^T \\ I\end{bmatrix} \\ \end{split} \end{equation}

And let p(K+\sigma^2I)+k_*=0 to get a p to make calculations simple, that is p=-k_*(K+\sigma^2I)^{-1}, so

A=\begin{bmatrix}-k_*(K+\sigma^2I)^{-1} & I\end{bmatrix}

\begin{equation} \begin{split} \text{Var}(B) = \text{Var}(AY_\text{joint}) \\ &= A\text{Var}(Y_\text{joint})A^T \\ &= \begin{bmatrix}0 & -k_*(K+\sigma^2I)^{-1}k_*+k(\mathbf{x_*}, \mathbf{x_*})\end{bmatrix}\begin{bmatrix}p^T \\ I\end{bmatrix} \\ &= -k_*(K+\sigma^2I)^{-1}k_*+k(\mathbf{x_*}, \mathbf{x_*}) \end{split} \end{equation}

\begin{equation} \begin{split} \mathbb{E}[B] &= \mathbb{E}[AY_\text{joint}] \\ &= A\mathbb{E}[Y_\text{joint}] \\ &=\begin{bmatrix}-k_*(K+\sigma^2I)^{-1} & I\end{bmatrix} \begin{bmatrix}m(X) \\ m(\mathbf{x_*})\end{bmatrix} \\ &=-k_*(K+\sigma^2I)^{-1}m(X) + m(\mathbf{x_*}) \end{split} \end{equation}

\begin{equation} \begin{split} B &=AY \\ &= \begin{bmatrix}-k_*(K+\sigma^2I)^{-1} & I\end{bmatrix} \begin{bmatrix}Y \\ y_*\end{bmatrix} \\ &=-k_*(K+\sigma^2I)^{-1}Y + y_* \end{split} \end{equation}

\begin{equation} \begin{split} y_*|Y &= B + k_*(K+\sigma^2I)^{-1}Y \\ \end{split} \end{equation}

\begin{equation} \begin{split} \mathbb{E}[y_*|Y] &= \mathbb{E}[B + k_*(K+\sigma^2I)^{-1}Y] \\ &= \mathbb{E}[B] + k_*(K+\sigma^2I)^{-1}Y \\ &= -k_*(K+\sigma^2I)^{-1}m(X) + m(\mathbf{x_*}) + k_*(K+\sigma^2I)^{-1}Y \\ &= m(\mathbf{x_*}) + k_*(K+\sigma^2I)^{-1}[Y-m(X)] \end{split} \end{equation}

\begin{equation} \begin{split} \text{Var}(y_*|Y) &= \text{Var}(B + k_*(K+\sigma^2I)^{-1}Y) \\ &= \text{Var}(B) \\ &= -k_*(K+\sigma^2I)^{-1}k_*+k(\mathbf{x_*}, \mathbf{x_*}) \end{split} \end{equation}

So the mean and variance of the point to be predicted is:

  • \mu_* = m(\mathbf{x_*}) + k_*(K+\sigma^2I)^{-1}[Y-m(X)]
  • \sigma_*^2 = -k_*(K+\sigma^2I)^{-1}k_*+k(\mathbf{x_*}, \mathbf{x_*})

This is the same with weight space !

Reference


评论