Different Perspective of Line Regression

1. Line regression

1.1 Ordinary Least Squares (OLS) Perspective

1.1.1 Model Representation:

The linear regression model is represented as:
$\hat{Y} = X\theta$
where:

$X$ is the design matrix with dimensions $n \times (m+1)$ (including the intercept term).
$\theta$ is the parameter vector with dimensions $(m+1) \times 1$ .

1.1.2 Loss Function:

The mean squared error (MSE) is given by:

No regularization
$J(\theta) = \frac{1}{n}(X\theta - Y)^T(X\theta - Y)$
L1 regularization
$J(\theta) = \frac{1}{n}(X\theta - Y)^T(X\theta - Y) + \lambda \|\mathbf{\theta}\|_1$
L2 regularization
$J(\theta) = \frac{1}{n}(X\theta - Y)^T(X\theta - Y) + \lambda\theta^T\theta$

1.1.3 Objective:

Given a dataset with $n$ samples and $m$ features, we want to find the parameter vector $\theta$ that minimizes the mean squared error (MSE) between predicted values $\hat{Y}$ and actual values $y$ .
$\theta = \arg\min_\theta{J(\theta)}$

1.1.4 Solution:

Gradient of $\theta$ :
- No regularization
  $\frac{\partial J}{\partial \theta} = \frac{2}{n}X^T(X\theta - y)$
- L1 regularization
  $\frac{\partial J}{\partial \theta} = \frac{2}{n}X^T(X\theta - y)+\lambda \text{sign}(\mathbf{\theta})$
- L2 regularization
  $\frac{\partial J}{\partial \theta} = \frac{2}{n}X^T(X\theta - y)+2\lambda\theta$
Closed-Form for $\theta$ :
- No regularization
  $\begin{equation} \begin{split} & \frac{2}{n}X^T(X\theta - y) = 0 \\ & X^TX\theta - X^Ty = 0 \\ & \theta = (X^TX)^{-1}X^Ty \end{split} \end{equation}$
- L1 regularization
The closed-form solution for OLS with L1 penalty doesn’t have a direct analytical solution due to the non-differentiability of the L1 norm. However, it can be solved using optimization algorithms like coordinate descent or proximal gradient descent. These algorithms iteratively update the coefficients until convergence.
- L2 regularization
  $\begin{equation} \begin{split} & \frac{2}{n}X^T(X\theta - y)+2\lambda\theta = 0 \\ & (X^TX+\lambda n I)\theta = X^Ty \\ & \theta = (X^T X + \lambda n I)^{-1} X^Ty \end{split} \end{equation}$

1.2 Maximum Likelihood Estimation Perspective

1.2.1 Model Representation:

The linear regression model is represented as:
$y^{(i)} = x^{(i)T}\theta + \epsilon$
where:

$x^{(i)}$ is the $i$ th example which is a vector with dimensions $ (m+1) \times 1 $. (including the intercept term).
$\theta$ is the parameter vector with dimensions $ (m+1) \times 1 $.
$\epsilon $ is the error term assumed to be normally distributed with mean zero and variance $ \sigma^2 $, that is $\epsilon\sim{N(0,\sigma^2)}$ .
$y^{(i)}|x^{(i)},\theta\sim{N(x^{(i)T}\theta,\sigma^2)}$ .

1.2.2 Likelihood and posterior Function:

Under the assumption of Gaussian errors, the likelihood function of the observed data is given by the product of the probability density functions of the individual observations:

No regularization
$\begin{equation} \begin{split} P(X,Y|\theta) &= P(Y|X,\theta)*P(X|\theta) \\ &\propto P(Y|X,\theta) \\ & = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y^{(i)} - x^{(i)}\theta)^2}{2\sigma^2}\right) \\ & = \frac{1}{\sqrt{2\pi\sigma^2}} \exp{\left(-\frac{\sum_{i=1}^{n} (y^{(i)} - x^{(i)}\theta)^2}{2\sigma^2}\right)} \\ & = \frac{1}{\sqrt{2\pi\sigma^2}} \exp{\left(-\frac{(X\theta - Y)^T(X\theta - Y)}{2\sigma^2}\right)} \end{split} \end{equation}$
L1 regularization

We assume $\theta\sim{Laplace(0,b)}$ , so the priori probability of $\theta$ is
$P(\theta) = \frac{1}{2b}\exp{(-\frac{\left|\theta\right|}{b})}$
The posterior probability of $\theta$ is
$\begin{equation} \begin{split} P(\theta|X,Y) &= \frac{P(X,Y,\theta)}{P(X,Y)} \\ & \propto P(X,Y,\theta) \\ & = P(Y|X,\theta)*P(X|\theta)*P(\theta) \\ & \propto P(Y|X,\theta)*P(\theta) \\ & = \frac{1}{\sqrt{2\pi\sigma^2}} \exp{\left(-\frac{(X\theta - Y)^T(X\theta - Y)}{2\sigma^2}\right)} * \frac{1}{2b}\exp{(-\frac{\left|\theta\right|}{b})} \\ & = \frac{1}{2b\sqrt{2\pi\sigma^2}} \exp{\left(-\frac{(X\theta - Y)^T(X\theta - Y)+\frac{2\sigma^2}{b}\left|\theta\right|}{2\sigma^2}\right)}\\ \end{split} \end{equation}$

L2 regularization

We assume $\theta\sim{N(0,\Sigma)}$ , so the priori probability of $\theta$ is
$P(\theta) = \frac{1}{(\sqrt{2\pi})^m\left|\Sigma\right|^\frac{1}{2}}\exp{(-\frac{\theta^T\Sigma^{-1}\theta}{2})}$
The posterior probability of $\theta$ is
$\begin{equation} \begin{split} P(\theta|X,Y) & \propto P(Y|X,\theta)*P(\theta) \\ & = \frac{1}{\sqrt{2\pi\sigma^2}} \exp{\left(-\frac{(X\theta - Y)^T(X\theta - Y)}{2\sigma^2}\right)} * \frac{1}{(\sqrt{2\pi})^m\left|\Sigma\right|^\frac{1}{2}}\exp{(-\frac{\theta^T\Sigma^{-1}\theta}{2})} \\ &\propto \exp{\left(-\frac{(X\theta - Y)^T(X\theta - Y)+\sigma^2\theta^T\Sigma^{-1}\theta}{2\sigma^2}\right)}\\ \end{split} \end{equation}$

1.2.3 Objective:

We seek to find the parameter vector $\theta$ that maximizes the likelihood or posterior function of the observed data

No regularization
$\begin{equation} \begin{split} \theta = \arg\max_\theta{P(X,Y|\theta)}&=\arg\min_\theta{(X\theta - Y)^T(X\theta - Y)}\\ &= \arg\min_\theta{J(\theta)} \end{split} \end{equation}$
L1 regularization
$\begin{equation} \begin{split} \theta = \arg\max_\theta{P(\theta|X,Y)}&=\arg\min_\theta{(X\theta - Y)^T(X\theta - Y)}+\frac{2\sigma^2}{b}\left|\theta\right| \\ &=\arg\min_\theta{(X\theta - Y)^T(X\theta - Y)}+\lambda\left|\theta\right| \\ &= \arg\min_\theta{J(\theta)} \end{split} \end{equation}$
L2 regularization
$\begin{equation} \begin{split} \theta = \arg\max_\theta{P(\theta|X,Y)} &=\arg\min_\theta{(X\theta - Y)^T(X\theta - Y)}+\sigma^2\theta^T\Sigma^{-1}\theta\\ &\\ &=\arg\min_\theta{(X\theta - Y)^T(X\theta - Y)}+\lambda\theta^T\theta\\ &= \arg\min_\theta{J(\theta)} \end{split} \end{equation}$
Although the object function is different, the form is the same with OLS.

1.3. Column space Perspective

If we take $X\theta$ as a vector from the column space of $X$ , then $X\theta - Y$ is the residual vector, and $(X\theta - Y)^T(X\theta - Y)$ is the square of residual vector magnitude. So Minimizing MSE is equivalent to find the shortest residual vector which is orthogonal to the column space, that is:

$\begin{equation} \begin{split} & X^T(X\theta - Y) = 0 \\ & X^TX\theta =X^TY \\ & \theta = (X^TX)^{-1}X^TY \end{split} \end{equation}$

2. Bayesian Line regression

Bayesian regression is a probabilistic approach to regression analysis that incorporates prior knowledge about the model parameters into the modeling process. Unlike classical regression methods, Bayesian regression provides a framework for modeling uncertainty in the parameters and making probabilistic predictions.

2.1 Model Representation

In Bayesian regression, we model the relationship between the independent variables $X$ and the dependent variable $Y$ using a probabilistic model. The key components of the model are:

Likelihood: This represents the conditional distribution of the dependent variable $y$ given the independent variables $X$ and the model parameters $\theta$ . It is typically assumed to be Gaussian:

$p(Y|X, \theta) = \mathcal{N}(X\theta, \sigma^2 I)$

Prior: This represents our beliefs about the distribution of the model parameters $\theta$ before observing any data. It is specified as a probability distribution, often chosen to be a Gaussian distribution:

$p(\theta) = \mathcal{N}(\mu, \Sigma)$

Posterior: This represents the updated beliefs about the parameters after observing the data $X$ and $Y$ . According to Bayes’ theorem, the posterior is proportional to the likelihood times the prior:

$p(\theta|X, Y) \propto p(Y|X, \theta) \cdot p(\theta)$

2.2 Model Learning

The goal of learning in Bayesian regression is to estimate the posterior distribution of the model parameters $\theta$ given the observed data $X$ and $Y$ . This involves updating the prior distribution using Bayes’ theorem to obtain the posterior distribution.

According to the problem description, assuming the prior distribution is a multivariate Gaussian distribution $N(\mu, \Sigma)$ , hence:

$p(\theta) = \frac{1}{\sqrt{(2\pi)^p \lvert \Sigma \rvert}} \exp\left(-\frac{1}{2}(\theta - \mu)^T \Sigma^{-1} (\theta - \mu)\right)$

The likelihood function is given by:

$p(Y|X, \theta) = \frac{1}{\sqrt{(2\pi)^n\sigma^2}} \exp\left(-\frac{1}{2\sigma^2}(Y - X\theta)^T(Y - X\theta)\right)$

Therefore, the expression for the posterior probability is:

$\begin{equation} \begin{split} p(\theta|X, Y) & \propto \exp\left(-\frac{1}{2}\left[\sigma^{-2}(Y - X\theta)^T(Y - X\theta) + (\theta - \mu)^T \Sigma^{-1} (\theta - \mu)\right]\right) \\ & \propto \exp(-\frac{1}{2}[\theta^T(\frac{X^TX}{\sigma^2}+\Sigma^{-1})\theta-2\theta^T(\frac{X^TY}{\sigma^2}+\Sigma^{-1}\mu)]) \end{split} \end{equation}$

As both the prior distribution $p(\theta)$ and the likelihood function $p(Y|X, \theta)$ are Gaussian distributions, then the posterior probability will be a Gaussian distribution. In this case, we can calculate the mean and variance of the posterior probability. If we assume posterior probability is a multivariate Gaussian distribution $N(\mu_{\text{post}}, \Sigma_{\text{post}})$ , hence:

$\begin{equation} \begin{split} p(\theta|X, Y) &= \frac{1}{\sqrt{(2\pi)^p \lvert \Sigma_{\text{post}} \rvert}} \exp\left(-\frac{1}{2}(\theta - \mu_{\text{post}})^T \Sigma_{\text{post}}^{-1} (\theta - \mu_{\text{post}})\right) \\ &= \frac{1}{\sqrt{(2\pi)^p \lvert \Sigma_{\text{post}} \rvert}} \exp\left(-\frac{1}{2}[\theta^T\Sigma_{\text{post}}^{-1}\theta-2\theta^T\Sigma_{\text{post}}^{-1}\mu_{\text{post}}+\mu_{\text{post}}^T\Sigma_{\text{post}}^{-1}\mu_{\text{post}}]\right) \\ & \propto \exp(-\frac{1}{2}[\theta^T(\frac{X^TX}{\sigma^2}+\Sigma^{-1})\theta-2\theta^T(\frac{X^TY}{\sigma^2}+\Sigma^{-1}\mu)]) \end{split} \end{equation}$
So the covariance matrix (\Sigma_{\text{post}}) of the posterior distribution:

$\Sigma_{\text{post}} = (\frac{X^TX}{\sigma^2}+\Sigma^{-1})^{-1}$

The mean vector (\mu_{\text{post}}) of the posterior distribution:

$\mu_{\text{post}} = \Sigma_{\text{post}}(\frac{X^TY}{\sigma^2}+\Sigma^{-1}\mu)$

That is

$p(\theta|X, Y) \sim N(\Sigma_{\text{post}}(\frac{X^TY}{\sigma^2}+\Sigma^{-1}\mu),(\frac{X^TX}{\sigma^2}+\Sigma^{-1})^{-1})$

2.3 Model Inference

The prediction of Bayesian linear regression is to get the predictive distribution of target value $y_*$ when given input feature vector $x_*$ which can be obtained from the posterior probability distribution. As the predictive distribution can be represented as a Gaussian distribution, it can be written as:

$p(y_* | x_*, X, y, \alpha) = \mathcal{N}(\mu_*, \sigma_*^2)$

$\mu_*$ is the mean of the predicted target value $y_*$ , and $\sigma_*^2$ is the variance of the predicted target value $y_*$ , which can be calculated using the following formulas:

$\begin{equation} \begin{split} \mu_* &= \mathbb{E}[y_*] \\ &= \mathbb{E}[x_*^T\theta_{\text{post}} + \epsilon] \\ &=x_*^T \mathbb{E}[\theta_{\text{post}}] \\ &=x_*^T\mu_{\text{post}} \end{split} \end{equation}$

$\begin{equation} \begin{split} \sigma_*^2 &= \text{Var}[y_*] \\ &= \text{Var}[x_*^T\theta_{\text{post}} + \epsilon] \\ &= \text{Var}[x_*^T\theta_{\text{post}}] + \text{Var}[\epsilon] \\ &= x_*^T\text{Var}[\theta_{\text{post}}]x_* + \sigma^2 \\ &= \sigma^2 + x_*^T \Sigma_{\text{post}} x_* \end{split} \end{equation}$

where:

$x_*$ is the new input feature vector.
$\mu_{\text{post}}$ is the expected value of the posterior probability distribution of the parameter vector.
$\Sigma_{\text{post}}$ is the variance-covariance matrix of the posterior probability distribution of the parameter vector.
$\epsilon$ is the error term assumed to be normally distributed with mean zero and variance $\sigma^2$ , that is $\epsilon\sim{N(0,\sigma^2)}$ .

Thus, we obtain the mean $\mu_*$ and variance $\sigma_*^2$ of the predictive distribution, and consequently, the probability density function of the predictive distribution.

3. Gaussian Process Regression

3.1 Weight Space Perspective

In Gaussian Process Regression (GPR), we start by considering a bayesian linear regression model whose input data $\mathbf{x}$ maps the to a high-dimensional feature space by a feature mapping function $\phi(\mathbf{x})$ , the model can be expressed as follows:

$\mathbf{y} = \phi(\mathbf{x})^T\theta + \epsilon$

where:

$\phi(\mathbf{x})$ is the feature mapping function
$\epsilon$ is assumed to be Gaussian noise with mean 0 and variance $\sigma^2$ .
$p(\theta)\sim N(0, \Sigma)$

So the covariance matrix $\Sigma_{\text{post}}$ of the posterior distribution is:

$\Sigma_{\text{post}} = (\frac{\phi(X)^T\phi(X)}{\sigma^2}+\Sigma^{-1})^{-1}$

The mean vector $\mu_{\text{post}}$ of the posterior distribution is:

$\mu_{\text{post}} = \Sigma_{\text{post}}\frac{\phi(X)^TY}{\sigma^2}$

As the Woodbury Formula is:
$(A + UCV)^{-1} = A^{-1} - A^{-1}U(C^{-1} + VA^{-1}U)^{-1}VA^{-1}$
So

$\begin{equation} \begin{split} \Sigma_{\text{post}} &= (\Sigma^{-1}+\phi(X)^T\sigma^{-2}\phi(X))^{-1} \\ &=\Sigma -\Sigma\phi(X)^T(\sigma^2I+\phi(X)\Sigma\phi(X)^T)^{-1}\phi(X)\Sigma \end{split} \end{equation}$

$\begin{equation} \begin{split} \mu_{\text{post}} &= \Sigma_{\text{post}}\frac{\phi(X)^TY}{\sigma^2} \\ &= \Sigma\frac{\phi(X)^TY}{\sigma^2} - \Sigma\phi(X)^T(\sigma^2I+\phi(X)\Sigma\phi(X)^T)^{-1}\phi(X)\Sigma\frac{\phi(X)^TY}{\sigma^2} \\ \end{split} \end{equation}$

So the mean and variance of the predicted target value $\mathbf{y_*}$ when given a new input feature vector $\mathbf{x_*}$ can be calculated using the following formulas:

$\begin{equation} \begin{split} \mu_* &= \phi(\mathbf{x_*})^T\mu_{\text{post}} \\ &= \phi(\mathbf{x_*})^T\Sigma\frac{\phi(X)^TY}{\sigma^2} -\phi(\mathbf{x_*})^T\Sigma\phi(X)^T(\sigma^2I+\phi(X)\Sigma\phi(X)^T)^{-1}\phi(X)\Sigma\frac{\phi(X)^TY}{\sigma^2} \end{split} \end{equation}$

$\begin{equation} \begin{split} \sigma_*^2 &= \sigma^2 + \phi(\mathbf{x_*})^T \Sigma_{\text{post}}\phi(\mathbf{x_*}) \\ &= \sigma^2 + \phi(\mathbf{x_*})^T\Sigma\phi(\mathbf{x_*}) - \phi(\mathbf{x_*})^T\Sigma\phi(X)^T(\sigma^2I+\phi(X)\Sigma\phi(X)^T)^{-1}\phi(X)\Sigma\phi(\mathbf{x_*}) \end{split} \end{equation}$

As $\phi(\mathbf{x})^T\Sigma\phi(\mathbf{x'}) = \phi(\mathbf{x})^T\Sigma^\frac{1}{2}\Sigma^\frac{1}{2}\phi(\mathbf{x'})$ and $\Sigma$ is a symmetric matrix, $\phi(\mathbf{x})^T\Sigma\phi(\mathbf{x'})$ can be wrote as

$\begin{equation} \begin{split}\phi(\mathbf{x})^T\Sigma\phi(\mathbf{x'}) &=(\Sigma^\frac{1}{2}\phi(\mathbf{x}))^T(\Sigma^\frac{1}{2}\phi(\mathbf{x'})) \\ &= \Psi(\mathbf{x})^T\Psi(\mathbf{x'}) \\ &= K(\mathbf{x},\mathbf{x'}) \end{split} \end{equation}$

where

$\Psi(\mathbf{x})$ is a linear transformation that $\Psi(\mathbf{x}) = \Sigma^\frac{1}{2}\phi(\mathbf{x})$
$K(\mathbf{x}, \mathbf{x}')$ is the covariance function (kernel).

The mean and variance of $\mathbf{y_*}$ can be simplificated as

$\begin{equation} \begin{split} \mu_* &= \phi(\mathbf{x_*})^T\Sigma\frac{\phi(X)^TY}{\sigma^2} -\phi(\mathbf{x_*})^T\Sigma\phi(X)^T(\sigma^2I+\phi(X)\Sigma\phi(X)^T)^{-1}\phi(X)\Sigma\frac{\phi(X)^TY}{\sigma^2} \\ &= \sigma^{-2}K(\mathbf{x_*},X)(I-(\sigma^2I+K(X,X))^{-1}K(X,X))Y \end{split} \end{equation}$

$\begin{equation} \begin{split} \sigma_*^2 &= \sigma^2 + \phi(\mathbf{x_*})^T\Sigma\phi(\mathbf{x_*}) - \phi(\mathbf{x_*})^T\Sigma\phi(X)^T(\sigma^2I+\phi(X)\Sigma\phi(X)^T)^{-1}\phi(X)\Sigma\phi(\mathbf{x_*}) \\ &= \sigma^2 + K(\mathbf{x_*},\mathbf{x_*}) - K(\mathbf{x_*},X)(\sigma^2I+K(X,X))^{-1}K(X,\mathbf{x_*}) \end{split} \end{equation}$

Another method to get $\mu_*$ is below,
As
$\begin{equation} \begin{split} & \Sigma_{\text{post}} = (\frac{\phi(X)^T\phi(X)}{\sigma^2}+\Sigma^{-1})^{-1}\\ & \Sigma_{\text{post}}^{-1} = \frac{\phi(X)^T\phi(X)}{\sigma^2}+\Sigma^{-1} \\ & \Sigma_{\text{post}}^{-1}\Sigma = \frac{\phi(X)^T\phi(X)\Sigma}{\sigma^2}+I \\ & \Sigma_{\text{post}}^{-1}\Sigma\phi(X)^T = \frac{\phi(X)^T\phi(X)\Sigma\phi(X)^T}{\sigma^2}+\phi(X)^T \\ & \Sigma\phi(X)^T = \sigma^{-2}\Sigma_{\text{post}}\phi(X)^T[\phi(X)\Sigma\phi(X)^T+\sigma^2I] \\ & \sigma^{-2}\Sigma_{\text{post}}\phi(X)^T = \Sigma\phi(X)^T[\phi(X)\Sigma\phi(X)^T+\sigma^2I]^{-1} \\ \end{split} \end{equation}$

So the mean vector $\mu_*$ is:

$\begin{equation} \begin{split} \mu_* &= \phi(\mathbf{x_*})^T\mu_{\text{post}} \\ &= \phi(\mathbf{x_*})^T\Sigma_{\text{post}}\frac{\phi(X)^TY}{\sigma^2} \\ &= \phi(\mathbf{x_*})^T\Sigma\phi(X)^T[\phi(X)\Sigma\phi(X)^T+\sigma^2I]^{-1}Y \\ &= K(\mathbf{x_*},X)[K(X,X)+\sigma^2I]^{-1}Y \end{split} \end{equation}$

3.2 Function Space Perspective

In GPR, we are interested in modeling the underlying function $f(\mathbf{x})$ directly rather than the weights $\mathbf{w}$ . We assume that the function $f(\mathbf{x})$ itself follows a Gaussian process:

$f(\mathbf{x}) \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}'))$

where:

$m(\mathbf{x})$ is the mean function,
$k(\mathbf{x}, \mathbf{x}')$ is the covariance function (kernel).

We can treat the observed data and the output values of the point to be predicted as a joint Gaussian distribution and utilize the properties of joint probability for inference.

Assuming we have a set of observed data $\{(\mathbf{x_i}, \mathbf{y_i})\}_{i=1}^n$ , where $\mathbf{x_i}$ is the input and $\mathbf{y_i}$ is the corresponding output. Our goal is to predict the unknown function $f(\mathbf{x_i})$ , and given a new input $\mathbf{x_*}$ , we want to predict the corresponding output $\mathbf{x_*}$ ).

Firstly, we assume that the output values of the observed data $\mathbf{y_i}$ are composed of the unknown function $f(\mathbf{x_i})$ and independent identically distributed Gaussian noise terms $\epsilon_i$ :

$\mathbf{y_i} = f(\mathbf{x_i}) + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2)$

We construct a joint Gaussian distribution for the observed data and the output values of the point to be predicted:

$Y_\text{joint} = \begin{bmatrix} Y \\ \mathbf{y_*} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix}m(X) \\ m(\mathbf{x_*})\end{bmatrix}, \begin{bmatrix} K+\sigma^2 & k_* \\ k_*^T & k(\mathbf{x_*}, \mathbf{x_*}) \end{bmatrix}\right)$

Here

$K$ is the covariance matrix of the observed data
$k_*$ is the covariance vector between the observed data and the point to be predicted,
$k(\mathbf{x_*}, \mathbf{x_*})$ is the self-covariance of the point to be predicted.
$\sigma^2$ is the variance of the noise.

We can then utilize the properties of joint Gaussian distribution, given the observed data $Y$ , to compute the conditional probability distribution of the point to be predicted $\mathbf{y_*}$ :

$p(y_* | Y) = \mathcal{N}(\mu_*, \sigma_*^2)$

where $\mu_*$ and $\sigma_*^2$ are the mean and variance of the point to be predicted

Let $A=\begin{bmatrix}p & I\end{bmatrix}$ , $B = AY_\text{joint}$ is also a normal distribution with variance as below:

$\begin{equation} \begin{split} \text{Var}B &= \text{Var}(AY_\text{joint}) \\ &= A\text{Var}(Y_\text{joint})A^T \\ &= \begin{bmatrix}p & I\end{bmatrix}\begin{bmatrix} K+\sigma^2 & k_* \\ k_*^T & k(\mathbf{x_*}, \mathbf{x_*}) \end{bmatrix}\begin{bmatrix}p^T \\ I\end{bmatrix} \\ &= \begin{bmatrix}p(K+\sigma^2I)+k_* & pk_*+k(\mathbf{x_*}, \mathbf{x_*})\end{bmatrix}\begin{bmatrix}p^T \\ I\end{bmatrix} \\ \end{split} \end{equation}$

And let $p(K+\sigma^2I)+k_*=0$ to get a $p$ to make calculations simple, that is $p=-k_*(K+\sigma^2I)^{-1}$ , so

$A=\begin{bmatrix}-k_*(K+\sigma^2I)^{-1} & I\end{bmatrix}$

$\begin{equation} \begin{split} \text{Var}(B) = \text{Var}(AY_\text{joint}) \\ &= A\text{Var}(Y_\text{joint})A^T \\ &= \begin{bmatrix}0 & -k_*(K+\sigma^2I)^{-1}k_*+k(\mathbf{x_*}, \mathbf{x_*})\end{bmatrix}\begin{bmatrix}p^T \\ I\end{bmatrix} \\ &= -k_*(K+\sigma^2I)^{-1}k_*+k(\mathbf{x_*}, \mathbf{x_*}) \end{split} \end{equation}$

$\begin{equation} \begin{split} \mathbb{E}[B] &= \mathbb{E}[AY_\text{joint}] \\ &= A\mathbb{E}[Y_\text{joint}] \\ &=\begin{bmatrix}-k_*(K+\sigma^2I)^{-1} & I\end{bmatrix} \begin{bmatrix}m(X) \\ m(\mathbf{x_*})\end{bmatrix} \\ &=-k_*(K+\sigma^2I)^{-1}m(X) + m(\mathbf{x_*}) \end{split} \end{equation}$

$\begin{equation} \begin{split} B &=AY \\ &= \begin{bmatrix}-k_*(K+\sigma^2I)^{-1} & I\end{bmatrix} \begin{bmatrix}Y \\ y_*\end{bmatrix} \\ &=-k_*(K+\sigma^2I)^{-1}Y + y_* \end{split} \end{equation}$

$\begin{equation} \begin{split} y_*|Y &= B + k_*(K+\sigma^2I)^{-1}Y \\ \end{split} \end{equation}$

$\begin{equation} \begin{split} \mathbb{E}[y_*|Y] &= \mathbb{E}[B + k_*(K+\sigma^2I)^{-1}Y] \\ &= \mathbb{E}[B] + k_*(K+\sigma^2I)^{-1}Y \\ &= -k_*(K+\sigma^2I)^{-1}m(X) + m(\mathbf{x_*}) + k_*(K+\sigma^2I)^{-1}Y \\ &= m(\mathbf{x_*}) + k_*(K+\sigma^2I)^{-1}[Y-m(X)] \end{split} \end{equation}$

$\begin{equation} \begin{split} \text{Var}(y_*|Y) &= \text{Var}(B + k_*(K+\sigma^2I)^{-1}Y) \\ &= \text{Var}(B) \\ &= -k_*(K+\sigma^2I)^{-1}k_*+k(\mathbf{x_*}, \mathbf{x_*}) \end{split} \end{equation}$

So the mean and variance of the point to be predicted is:

$\mu_* = m(\mathbf{x_*}) + k_*(K+\sigma^2I)^{-1}[Y-m(X)]$
$\sigma_*^2 = -k_*(K+\sigma^2I)^{-1}k_*+k(\mathbf{x_*}, \mathbf{x_*})$

This is the same with weight space !

Reference

[1] OLS
[2] OLS-likelihood
[3] Ridge Regression-Frequentist
[4] Ridge Regression-Bayesian
[5]
Bayesian linear regression1
[6]
Bayesian linear regression2
[7]
Bayesian linear regression3
[8]
Bayesian linear regression4
[9]
Bayesian linear regression5
[10] Gaussian process regression
[11] GP-weight space
[12] GP-weight space to function space
[13] GP-function space
[14]
Gaussian distribution-marginal probability
[15]
Gaussian distribution joint probability
[16] GP

菜单

Different Perspective of Line Regression

Different Perspective of Line Regression

1. Line regression

1.1 Ordinary Least Squares (OLS) Perspective

1.1.1 Model Representation:

1.1.2 Loss Function:

1.1.3 Objective:

1.1.4 Solution:

1.2 Maximum Likelihood Estimation Perspective

1.2.1 Model Representation:

1.2.2 Likelihood and posterior Function:

1.2.3 Objective:

1.3. Column space Perspective

2. Bayesian Line regression

2.1 Model Representation

2.2 Model Learning

2.3 Model Inference

3. Gaussian Process Regression

3.1 Weight Space Perspective

3.2 Function Space Perspective

Reference

评论

Different Perspective of Line Regression

15.7. Word Similarity and Analogy

15.5. Word Embedding with Global Vectors (GloVe)

15.6. Subword Embedding

15.4. Pretraining word2vec

15.3. The Dataset for Pretraining Word Embeddings

15.2. Approximate Training

15.1. Word Embedding (word2vec)

解决docker部署的jupyter容器中matplotlib中文乱码

pyspider安装报错解决