随机过程

2024-06-06 04:51| 来源: 网络整理| 查看: 265

高斯过程定义

　　定义：若对于任意时刻ti(i=1,2,...,n)，随机过程的任意n维随机变量Xi=X(ti)(i=1,2,...,n)服从高斯分布，则称X(t)为高斯随机过程或正太过程。

高斯过程的特性高斯随机过程完全由它的均值和协方差函数决定。高斯随机过程在不同时刻ti，tk的取值不相关和相互独立等价，即平稳高斯过程在任意两个不同时刻不相关，则也一定是相互独立的。高斯过程的广义平稳性意味着严格平稳性，即广义平稳高斯过程一定是严平稳过程。高斯（变量）随机过程通过线性系统（线性变换）还是高斯的。高斯变量之和仍然是高斯变量。平稳高斯过程与确定时间信号之和也是高斯过程。除非确定信号不随时间变化的，否则将不再是平稳过程。如果高斯过程的积分存在，它也将是高斯分布的随机变量或随机过程。平稳高斯过程导数的一维概率密度也是高斯分布的。平稳高斯过程导数的二维概率密度是高斯分布的，平稳高斯过程与其导数的联合概率密度也是高斯分布的。对于n维高斯分布，则任意k维(kf:R→Rf:R→R that we want to model. We have data x=[x1,…,xN]T,y=[y1,…,yN]Tx=[x1,…,xN]T,y=[y1,…,yN]T where yi=f(xi)yi=f(xi). We want to predict the value of ff at some new, unobserved points x∗x∗. Modeling Functions using Gaussians

The key idea behind GPs is that a function can be modeled using an infinite dimensional multivariate Gaussian distribution. In other words, every point in the input space is associated with a random variable and the joint distribution of these is modeled as a multivariate Gaussian.

Ok, so what does that mean and what does it actually look like? Well lets start with a simpler case: a unit 2D Gaussian. How can we start to view this as a distribution over functions? Here’s what we have:

(y0y1)∼N((00),(1001))(y0y1)∼N((00),(1001))

Normally this is visualised as a 3D bell curve with the probability density represented as height. But what if, instead of visualising the whole distribution, we just sample from the distribution. Then we will have two values which we can plot on an graph. Let’s do this 10 times, putting the first value at x=0x=0 and the second at x=1x=1 and then drawing a line between them.

def plot_unit_gaussian_samples(D): p = figure(plot_width=800, plot_height=500, title='Samples from a unit {}D Gaussian'.format(D)) xs = np.linspace(0, 1, D) for color in Category10[10]: ys = np.random.multivariate_normal(np.zeros(D), np.eye(D)) p.line(xs, ys, line_width=1, color=color) return pshow(plot_unit_gaussian_samples(2))

Looking at all these lines on a graph, it starts to look like we’ve just sampled 10 linear functions… What if we now use a 20-dimensional Gaussian, joining each of the sampled points in order?

show(plot_unit_gaussian_samples(20))

These definitely look like functions of sorts but they look far too noisy to be useful for our purposes. Let’s think a bit more about what we want from these samples and how we might change the distribution to get better looking samples…

The multivariate Gaussian has two parameters, its mean and covariance matrix. If we changed the mean then we would only change the overall trend (i.e. if the mean was ascending integers, e.g. np.arange(D) then the samples would have an overall positive linear trend) but there would still be that jagged noisy shape. For this reason we tend to leave the GP mean as zero - they actually turn out to be powerful enough to model many functions without changing this.

Instead we want some notion of smoothness: i.e. if two input points are close to each other then we expect the value of the function at those points to be similar. In terms of our model: random variables corresponding to nearby points should have similar values when sampled under their joint distribution (i.e. high covariance).

The covariance of these points is defined in the covariance matrix of the Gaussian. Suppose we have an NN dimensional Gaussian modeling y0,…,yNy0,…,yN, then the covariance matrix ΣΣ is N×NN×N and its (i,j)(i,j)-th element is Σij=cov(yi,yj)Σij=cov(yi,yj). In other words ΣΣ is symmetric and stores the pairwise covariances of all the jointly modeled random variables.

Smoothing with Kernels

So how should we define our covariance function? This is where the vast literature on kernels comes in handy. For our purposes we will choose a squared exponential kernel which (in its simplest form) is defined by:

κ(x,x′)=exp(− (x−x′)22)κ(x,x′)=exp⁡(− (x−x′)22)

This function (which we plot in a moment) is 1 when x=x′x=x′ and tends to zero as its arguments drift apart.

def k(xs, ys, sigma=1, l=1): """Sqared Exponential kernel as above but designed to return the whole covariance matrix - i.e. the pairwise covariance of the vectors xs & ys. Also with two parameters which are discussed at the end.""" # Pairwise difference matrix. dx = np.expand_dims(xs, 1) - np.expand_dims(ys, 0) return (sigma ** 2) * np.exp(-((dx / l) ** 2) / 2)def m(x): """The mean function. As discussed, we can let the mean always be zero.""" return np.zeros_like(x)

We can plot this kernel to show how it’s maximised when x=x′x=x′ and then smoothly falls off as the two inputs start to differ.

N = 100x = np.linspace(-2, 2, N)y = np.linspace(-2, 2, N)d = k(x, y)color_mapper = LinearColorMapper(palette="Plasma256", low=0, high=1)p = figure(plot_width=400, plot_height=400, x_range=(-2, 2), y_range=(-2, 2), title='Visualisation of k(x, x\')', x_axis_label='x', y_axis_label='x\'', toolbar_location=None)p.image(image=[d], color_mapper=color_mapper, x=-2, y=-2, dw=4, dh=4)color_bar = ColorBar(color_mapper=color_mapper, ticker=BasicTicker(), label_standoff=12, border_line_color=None, location=(0,0))p.add_layout(color_bar, 'right')show(p)

So, to get the sort of smoothness we want we will consider two random variables yiyi and yjyjplotted at xixi and xjxj to have covariance cov(yi,yj)=κ(xi,xj)cov(yi,yj)=κ(xi,xj) – the closer they are together the higher their covariance.

Using the kernel function from above we can get this matrix with k(xs, xs). Now lets try plotting another 10 samples from the 20D Gaussian, but this time with the new covariance matrix. When we do this we get:

p = figure(plot_width=800, plot_height=500)D = 20xs = np.linspace(0, 1, D)for color in Category10[10]: ys = np.random.multivariate_normal(m(xs), k(xs, xs)) p.circle(xs, ys, size=3, color=color) p.line(xs, ys, line_width=1, color=color)show(p)

Now we have something thats starting to look like a distribution over (useful) functions! And we can see how as the number of dimensions tends to infinity we don’t have to connect points any more because we will have a point for every possible choice of input.

Let’s use more dimensions and see what it looks like across a bigger range of inputs:

n = 100xs = np.linspace(-5, 5, n)K = k(xs, xs)mu = m(xs)p = figure(plot_width=800, plot_height=500)for color in Category10[5]: ys = np.random.multivariate_normal(mu, K) p.line(xs, ys, line_width=2, color=color)show(p) Making Predictions using the Prior & Observations

Now that we have a distribution over functions, how can we use training data to model a hidden function so that we can make predictions?

First of all we need some training data! And to get that we are going to create our secret function ff.

The Target Function ff

For this intro we’ll use a 5th order polynomial:

f(x)=0.03x5+0.2x4−0.1x3−2.4x2−2.5x+6f(x)=0.03x5+0.2x4−0.1x3−2.4x2−2.5x+6

I chose this because it has a nice wiggly graph but we could have chosen anything.

# coefs[i] is the coefficient of x^icoefs = [6, -2.5, -2.4, -0.1, 0.2, 0.03]def f(x): total = 0 for exp, coef in enumerate(coefs): total += coef * (x ** exp) return totalxs = np.linspace(-5.0, 3.5, 100)ys = f(xs)p = figure(plot_width=800, plot_height=400, x_axis_label='x', y_axis_label='f(x)', title='The hidden function f(x)')p.line(xs, ys, line_width=2)show(p) Getting into the Maths

Now we get to the heart of GPs. Theres a bit more maths required but it only consists of consolidating what we have so far and using one trick to condition our joint distribution on observed data.

So far we have a way to model p(y|x)p(y|x) using a multivariate normal:

p(y|x)=N(y|m(x),K)p(y|x)=N(y|m(x),K)

where K=κ(x,x)K=κ(x,x) and m(x)=0m(x)=0.

This is a prior distribution representing the kind out outputs yy that we expect to see over some inputs xx before we observe any data.

So we have some training data with inputs xx, and outputs y=f(x)y=f(x). Now lets say we have some new points x∗x∗ where we want to predict y∗=f(x∗)y∗=f(x∗).

x_obs = np.array([-4, -1.5, 0, 1.5, 2.5, 2.7])y_obs = f(x_obs)x_s = np.linspace(-8, 7, 80)

Now recalling the definition of a GP, we will model the joint distribution of all of yy and y∗y∗as:

(yy∗)∼N((m(x)m(x∗)),(KKT∗K∗K∗∗))(yy∗)∼N((m(x)m(x∗)),(KK∗K∗TK∗∗))

where K=κ(x,x)K=κ(x,x), K∗=κ(x,x∗)K∗=κ(x,x∗) and K∗∗=κ(x∗,x∗)K∗∗=κ(x∗,x∗). As before we are going to stick with a zero mean.

However this is modeling p(y,y∗|x,x∗)p(y,y∗|x,x∗) and we only want a distribution over y∗y∗!

Conditioning Multivariate Gaussians

Rather than deriving it from scratch we can just make use of this standard result. If we have a joint distribution over yy and y∗y∗ as above, and we want to condition on the data we have for yy then we have the following:

p(y∗|x∗,x,y)μ∗Σ∗=N(y∗|μ∗,Σ∗)=m(x∗)+KT∗K−1(y−m(x))=K∗∗−KT∗K−1K∗p(y∗|x∗,x,y)=N(y∗|μ∗,Σ∗)μ∗=m(x∗)+K∗TK−1(y−m(x))Σ∗=K∗∗−K∗TK−1K∗

Now we have a posterior distribution over y∗y∗ using a prior distribution and some observations!

NB: The code below would not be used in practice since KK can often be poorly conditioned, so its inverse might be inaccurate. A better approach is covered in part II of this guide!

K = k(x_obs, x_obs)K_s = k(x_obs, x_s)K_ss = k(x_s, x_s)K_sTKinv = np.matmul(K_s.T, np.linalg.pinv(K))mu_s = m(x_s) + np.matmul(K_sTKinv, y_obs - m(x_obs))Sigma_s = K_ss - np.matmul(K_sTKinv, K_s)

And that’s it! We can now use these two parameters to draw samples from the conditional distribution. Here we plot them against the true function f(x)f(x) (the dashed black line). Since we are using a GP we also have uncertainty information in the form of the variance of each random variable. We know the variance of the ii-th random variable will be Σ∗iiΣ∗ii - in other words the variances are just the diagonal elements of Σ∗Σ∗. Here we plot the samples with an uncertainty of ±2±2 standard deviations.

p = figure(plot_width=800, plot_height=600, y_range=(-7, 8))y_true = f(x_s)p.line(x_s, y_true, line_width=3, color='black', alpha=0.4, line_dash='dashed', legend='True f(x)')p.cross(x_obs, y_obs, size=20, legend='Training data')stds = np.sqrt(Sigma_s.diagonal())err_xs = np.concatenate((x_s, np.flip(x_s, 0)))err_ys = np.concatenate((mu_s + 2 * stds, np.flip(mu_s - 2 * stds, 0)))p.patch(err_xs, err_ys, alpha=0.2, line_width=0, color='grey', legend='Uncertainty')for color in Category10[3]: y_s = np.random.multivariate_normal(mu_s, Sigma_s) p.line(x_s, y_s, line_width=1, color=color)p.line(x_s, mu_s, line_width=3, color='blue', alpha=0.4, legend='Mean')show(p) What Next? - GP Regression and Noisy Data

In practice we need to do a bit more work to get good predictions. You may have noticed that the kernel implementation contained two parameters σσ and ll. If you try changing those when sampling from the prior then you can see how σσ changes the vertical variation and ll changes the horizontal scale. So we would need to change these to reflect our prior belief about the hidden function ff. For instance if we expect ff to have a much bigger range of outputs (for the domain we are interested in) then we would need to scale up σσaccordingly (try scaling the return value of ff by 100 to see what happens, then setsigma=100). In fact, as with anything that uses kernels, we might change our kernel entirely if we expect a different kind of function (e.g. a periodic function).

Picking the kernel is up to a human expert but choosing the parameters can be done automatically by minimising a loss term. This is the realm of Gaussian process regression.

Finally we should consider how to handle noisy data - i.e. when we can’t get perfect samples of the hidden function ff. In this case we need to factor this uncertainty into the model to get better generalisation.

These two topics will be the focus of Introduction to Gaussian Processes - Part II.

Resources Machine Learning - A Probabilistic Perspective, Chapter 15 by Kevin P. Murphy Introduction to Gaussian processes on YouTube by Nando de Freitas

【本文地址】

随机过程

随机过程

今日新闻

推荐新闻