{"title": "Sparse Greedy Gaussian Process Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 619, "page_last": 625, "abstract": null, "full_text": "Sparse Greedy \n\nGaussian Process Regression \n\nAlex J. Smola\u00b7 \n\nRSISE and Department of Engineering \n\nPeter Bartlett \n\nRSISE \n\nAustralian National University \n\nAustralian National University \n\nCanberra, ACT, 0200 \nAlex.Smola@anu.edu.au \n\nCanberra, ACT, 0200 \n\nPeter.Bartlett@anu.edu.au \n\nAbstract \n\nWe present a simple sparse greedy technique to approximate the \nmaximum a posteriori estimate of Gaussian Processes with much \nimproved scaling behaviour in the sample size m. In particular, \ncomputational requirements are O(n2m), storage is O(nm), the \ncost for prediction is 0 ( n) and the cost to compute confidence \nbounds is O(nm), where n \u00ab: m. We show how to compute a \nstopping criterion, give bounds on the approximation error, and \nshow applications to large scale problems. \n\n1 \n\nIntroduction \n\nGaussian processes have become popular because they allow exact Bayesian analysis \nwith simple matrix manipulations, yet provide good performance. They share with \nSupport Vector machines and Regularization Networks the concept of regularization \nvia Reproducing Kernel Hilbert spaces [3], that is, they allow the direct specification \nof the smoothness properties of the class of functions under consideration. However, \nGaussian processes are not always the method of choice for large datasets, since they \ninvolve evaluations of the covariance function at m points (where m denotes the \nsample size) in order to carry out inference at a single additional point. This may \nbe rather costly to implement -\npractitioners prefer to use only a small number of \nbasis functions (Le. covariance function evaluations). \n\nFurthermore, the Maximum a Posteriori (MAP) estimate requires computation, \nstorage, and inversion of the full m x m covariance matrix Kii = k( Xi, Xi) where \nXl! ... ,xm are training patterns. While there exist techniques [2, 8] to reduce the \ncomputational cost of finding an estimate to O(km2 ) rather than O(m3 ) when \nthe covariance matrix contains a significant number of small eigenvalues, all these \nmethods still require computation and storage of the full covariance matrix. None \nof these methods addresses the problem of speeding up the prediction stage (except \nfor the rare case when the integral operator corresponding to the kernel can be \ndiagonalized analytically [8]). \n\nWe devise a sparse greedy method, similar to those proposed in the context of \nwavelets [4], solutions of linear systems [5] or matrix approximation [7] that finds \n\n\u00b7Supported by the DFG (Sm 62-1) and the Australian Research Council. \n\n\fan approximation of the MAP estimate by expanding it in terms of a small subset \nof kernel functions k (Xi, . ). Briefly, the technique works as follows: given a set of \n(already chosen) kernel functions, we seek the additional function that increases \nthe posterior probability most. We add it to the set of basis functions and repeat \nuntil the maximum is approximated sufficiently well. A similar approach for a tight \nupper bound on the posterior probability gives a stopping criterion. \n\n2 Gaussian Process Regression \n\nConsider a finite set X = {Xl.'\" xm} of inputs. In Gaussian Process Regression, \nwe assume that for any such set there is a covariance matrix K with elements \nKij = k( Xi, Xj). We assume that for each input X there is a corresponding output \ny(x), and that these outputs are generated by \n(1) \nwhere t(x) and e are both normal random variables, with e rv N(O, ( 2 ) and \nt = (t(Xl), ... , t(xm))T rv N(O, K). We can use Bayes theorem to determine the \ndistribution of the output y(x) at a (new) input x. Conditioned on the data (X,y), \nthe output y(x) is normally distributed. It follows that the mean of this distribution \nis the maximum a posteriori probability (MAP) estimate of y. We are interested in \nestimating this mean, and also the variance. \n\ny(x) = t(x) + e \n\nIt is possible to give an equivalent parametric representation of y that is more con(cid:173)\nvenient for our purposes. We may assume that the vector y = (y(Xl)\"\" ,y(xm))T \nof outputs is generated by \n(2) \nwhere a rv N(O, K- 1 ) and e rv N(O, ( 2 1). Consequently the posterior probability \np(aly, X) over the latent variables a is proportional to \n\ny=Ka+e, \n\nexp(-2;21Iy - Ka11 2) exp(-!a TKa) \n\n(3) \nand the conditional expectation of y(x) for a (new) location X is E[y(x)ly,X] = \nk T aopt, where k T denotes the vector (k( Xl. x), ... , k (xm, x)) and aopt is the value \nof a that maximizes (3). Thus, it suffices to compute aopt before any predictions \nare required. The problem of choosing the MAP estimate of a is equivalent to the \nproblem of minimizing the negative log-posterior, \n\nminimize [-y T Ka + !a T (a2 K + KT K) a] \n\naEW\" \n\n(4) \n\n(ignoring constant terms and rescaling by ( 2 ). It is easy to show that the mean of \nthe conditional distribution of y(x) is k T (K +(21)-ly, and its variance is k(x, x) + \na 2 - k T (K + ( 21)-lk (see, for example, [2]). \n\n3 Approximate Minimization of Quadratic Forms \n\nFor Gaussian process regression, searching for an approximate solution to (4) relies \non the assumption that a set of variables whose posterior probability is close to that \nof the mode of the distribution will be a good approximation for the MAP estimate. \nThe following theorem suggests a simple approach to estimating the accuracy of an \napproximate solution to (4). It uses an idea from [2] in a modified, slightly more \ngeneral form. \n\nTheorem 1 (Approximation Bounds for Quadratic Forms) Denote by K E \nlRmxm a positive semidefinite matrix, y, a E lRm and define the two quadratic forms \n\nQ(a) := -y T Ka + _aT (a 2 K + KT K)a, \n\n1 \n2 \n\n(5) \n\n\f(6) \nSuppose Q and Q* have minima Qmin and Q:nn. Then for all a, a* E IRffl we have \n\nQ*(a) := -y T a + _aT (a2 1 + K)a. \n\n1 \n2 \n\nQ(a)::::: Qmin \n\n::::: _~IIYI12 - a 2Q*(a*), \n\nQ*(a*)::::: Q;',.in :::::a-2(_~IIYI12_Q(a)), \nwith equalities throughout when Q(a) = Qmin and Q*(a*) = Q;',.in. \n\n(7) \n\n(8) \n\nHence, by minimizing Q* in addition to Q we can bound Q's closeness to the \noptimum and vice versa. \nProof The minimum of Q(a) is obtained for aopt = (K + a 21)-1y (which also \nminimizes Q*), hence \n1 T \n\nQmin = -2\"Y K(K +a 1) y and Qmin = -2\"Y (K +a 1) y. \n\n(9) \nThis allows us to combine Qmin and Q;',.in to Qmin + a 2Q;',.in = _~llyI12. Since by \ndefinition Q (a) ::::: Qmin for all a (and likewise Q* (a*) ::::: Q;',.in for all a*) we may \nsolve Qmin + a 2Q;',.in for either Q or Q* to obtain lower bounds for each of the two \nquantities. This proves (7) and (8). \n\u2022 \n\n2 -1 \n\n1 T \n\n-1 \n\n* \n\n2 \n\nEquation (7) is useful for computing an approximation to the MAP solution, \nwhereas (8) can be used to obtain error bars on the estimate. To see this, note that \nin calculating the variance, the expensive quantity to compute is -k T (K +a21)-1k. \nHowever, this can be found by solving \n\nminimize [-k T a + ~a T (a 2 1 + K) a] , \n\naEIRm \n\n(10) \n\nand the expression inside the parentheses is Q*(a) with y = k (see (6)). Hence, an \napproximate minimizer of (10) gives an upper bound on the error bars, and lower \nbounds can be obtained from (8) . \nh \nI \n. \nn practice we W1 use t e quantly gap a, a \n.- -Q(a)+u2Q * (a*)+~liYli2 ,I.e. t e \nrelative size of the difference between upper and lower bound as stopping criterion. \n\n*) .- 2(Q(a)+u Q*(a*)+2liYli) \n\n\u00b711 \n\nh \n\n( \n\n. \n\n\u2022 \n\n2 \n\n1 \n\n2 \n\n4 A Sparse Greedy Algorithm \n\nThe central idea is that in order to obtain a faster algorithm, one has to reduce \nthe number of free variables. Denote by P E IRffl xn with m ::::: nand m,n E N an \nextension matrix (Le. p T is a projection) with p T P = 1. We will make the ansatz \n(11) \n\nap := P[3 where [3 E IRn \n\nand find solutions [3 such that Q(ap) (or Q*(ap)) is minimized. The solution is \n\n(12) \nClearly if Pis ofrank m, this will also be the solution of (4) (the minimum negative \nlog posterior for all a E IRffl ). In all other cases, however, it is an approximation. \n\n[3opt = (pT (a2 K + K T K) p) -1 p T K T y. \n\nComputational Cost of Greedy Decompositions \n\nFor a given P E IRffl xn let us analyze the computational cost involved in the esti(cid:173)\nmation procedures. To compute (12) we need to evaluate pT Ky which is O(nm), \n(KP)T(KP) which is O(n2m) and invert an n x n matrix, which is O(n3 ). Hence the \ntotal cost is O(n2m). Predictions then cost only k T a which is O(n). Using P also \nto minimize Q*(P[3*) costs no more than O(n3 ), which is needed to upper-bound \nthe log posterior. \n\n\fFor error bars, we have to approximately minimize (10) which can done for a = P(3 \nat O(n3 ) cost. If we compute (PKpT)-l beforehand, this can be done by at O(n2 ) \nand likewise for upper bounds. We have to minimize -k T K P(3 + !(3T pT ((72 K + \nKT K)P(3 which costs O(n2m) (once the inverse matrices have been computed, \none may, however, use them to compute error bars at different locations, too, thus \ncosting only O(n2 )). The lower bounds on the error bars may not be so crucial, \nsince a bad estimate will only lead to overly conservative confidence intervals and \nnot have any other negative effect. Finally note that all we ever have to compute \nand store is K P, i.e. the m x n submatrix of K rather than K itself. Table 1 \nsummarizes the scaling behaviour of several optimization algorithms. \n\nConjugate \n\nExact \nOptimal Sparse \nSolution Gradient [2] Decomposition \nMemory \nO(m~) \nInitialization O(m;j) \nPred. Mean \ng~:~) \nError Bars \n\nO(m~) \nO(nm:l) \ng~~~2) \n\nSparse Greedy \nApproximation \nO(nm) \no (K.n:lm) \nO(n) \n\nO(nm) \nO(n:lm) \nO(n2 \nO(n2m) or O(n2 ) O(K.n2m) or O(n2 ) \n\nTable 1: Computational Cost of Optimization Methods. Note that n 0, find X E lRn with minimal number of nonzero entries \nsuch that IIAx - bl1 2 ::; E. \n\n\fIf we define A := (a2K + KTK)~ and b := A-1Ky, then we may write Q(o) = \n!llb - Aol1 2 + c where c is a constant independent of o. Thus the problem of \nsparse approximate minimization of Q(o) is a special case of Natarajan's problem \n(where the matrix A is square, symmetric, and positive definite). In addition, the \nalgorithm considered by Natarajan in [5J involves sequentially choosing columns of \nA to maximally decrease IIAx - bll. This is clearly equivalent to the sparse greedy \nalgorithm described above. Hence, it is straightforward to obtain the following \nresult from Theorem 2 in [5J. \n\nTheorem 2 (Approximation Rate) Algorithm 1 achieves Q(o) ::; Q(oopt) + E \nwhen a has \n\nn::; I8n*~E/4)ln(IIA-1KYII) \n\n).1 \n\nE \n\nnon-zero components, where n*(E/4) is the minimal number of nonzero components \nin vectors a for which Q(o) ::; Q(oopt) + E/4, A = (a2K + KTK)1/2, and).l is \nthe minimum of the magnitudes of the singular values of A, the matrix obtained by \nnormalizing the columns of A. \n\nRandomized Algorithms for Subset Selection \n\nUnfortunately, the approximation algorithm considered above is still too expensive \nfor large m since each search operation involves O(m) indices. Yet, if we are satisfied \nwith finding a relatively good index rather than the best, we may resort to selecting \na random subset of size K \u00ab: m. In Algorithm 1, this corresponds to choosing \nM ~ I, M* ~ 1* as random subsets of size K. In fact, a constant value of K will \ntypically suffice. To see why, we recall a simple lemma from [7J: the cumulative \ndistribution function of the maximum of m i.i.d. random variables 6, ... ,em is \nFO m , where F(\u00b7) is the cdf of ei. Thus, in order to find a column to add to P \nthat is with probability 0.95 among the best 0.05 of all such columns, a random \nsubsample of size ilogO.05/log0.951 = 59 will suffice. \n\nAlgorithm 1 Sparse Greedy Quadratic Minimization. \nRequire: Training data X = {Xl, ... , Xm}, Targets y, Noise a 2, Precision E \n\nInitialize index sets I,1* = {I, ... ,m}j S, S* = 0. \nrepeat \n\nChoose M ~ I, M* ~ I*. \nFind arg milliEM Q ([P, eiJ,Bopt), argmilli\"EM\" Q* OP*, ei\" J,B~Pt)\u00b7 \nMove i from I to S, i* from 1* to S*. \nSet P:= [P,eiJ, P*:= [P*,ei\"J. \n\nuntil Q(P,Bopt} + a2Q*(P,B~Pt) + !llyl12 ::; HIQ(P,Bopt} I + la2Q*(P,B~Pt) +! IIYl121 \n\nOutput: Set of indices S, ,Bopt, (pTKP)-t, and (pT(KTK +a2K)p)-1. \n\nNumerical Considerations \nThe crucial part is to obtain the values of Q(P,Bopt} cheaply (with P = [Pold, eiJ), \nprovided we solved the problem for Pold. From (12) one can see that all that needs \nto be done is a rank-I update on the inverse. In the following we will show that this \ncan be obtained in O(mn) operations, provided the inverse of the smaller subsystem \nis known. Expressing the relevant terms using Pold and ki we obtain \n\npTKT y \n\npT (KTK +a2K) P \n\n[Pold,eiJTKT y = (PoidKT y,kJ y) \nPoid (KT K + a 2 K) Pold pJd (KT + a21) ~ 1 \n\n[ \n\nkJ(K +a21)Pold \n\nkJki + a2Kii \n\n\fThus computation of the terms costs only O(nm), given the values for Pold' Fur(cid:173)\nthermore, it is easy to verify that we can write the inverse of a symmetric positive \nsemidefinite matrix as \n\n(13) \n\nwhere 'Y := (C + BT A -1 B)-1. Hence, inversion of pT (KT K + a 2 K) P costs only \nO(n2 ). Thus, to find P of size m x n takes O(ltn2m) time. For the error bars, \n(p T KP)-1 will generally be a good starting value for the minimization of (10), \nso the typical cost for (10) will be O(Tmn) for some T < n, rather than O(mn2 ). \nFinally, for added numerical stability one may want to use an incremental Cholesky \nfactorization in (13) instead of the inverse of a matrix. \n\n5 Experiments and Discussion \n\nWe used the Abalone dataset from the VCI Repository to investigate the properties \nof the algorithm. The dataset is of size 4177, split into 4000 training and 177 testing \nsplit to analyze the numerical performance, and a (3000,1177) split to assess the \ngeneralization error (the latter was needed in order to be able to invert (and keep in \nmemory) the full matrix K + a 2 1 for a comparison). The data was rescaled to zero \nmean and unit variance coordinate-wise. Finally, the gender encoding in Abalone \n(male/female/infant) was mapped into {(I, 0, 0), (0, 1, 0), (0,0, I)}. \n\nIn all our experiments we used Gaussian kernels k(x, x') = exp( -~) as co-\nvariance kernels. Figure 1 analyzes the speed of convergence for different It. \n\nIIx-x'1I2 \n\nFigure 1: Speed of Convergence. \nWe plot the size of the gap be(cid:173)\ntween upper and lower bound of the \nlog posterior (gap( a, a*)) for the \nfirst 4000 samples from the Abalone \ndataset (a 2 = 0.1 and 2w 2 = 10). \nFrom top to bottom: subsets of size \n1, 2, 5, 10, 20, 50, 100, 200. The \nresults were averaged over 10 runs. \nThe relative variance of the gap size \nwas less than 10%. \nOne can see that that subsets of size \n1O-' O'--------,L----,L------,o'---------\"c---,L------'-,-------,-L---,L------'-,------,-! 50 and above ensure rapid conver-\n\n20 \n\n40 \n\n60 \n\n80 \n\n100 \n\n120 \n\n140 \n\n160 \n\n180 \n\n200 \n\nNumber of Ilerahons \n\ngence. \n\nFor the optimal parameters (2a 2 = 0.1 and 2w2 = 10, chosen after [7]) the average \ntest error of the sparse greedy approximation trained until gap(a, a*) < 0.025 on a \n(3000,1177) split (the results were averaged over ten independent choices of training \nsets.) was 1.785 \u00b1 0.32, slightly worse than for the GP estimate (1.782 \u00b1 0.33). \nThe log posterior was -1.572.105 (1 \u00b1 0.005), the optimal value -1.571 . 105 (1 \u00b1 \n0.005). Hence for all practical purposes full inversion of the covariance matrix and \nthe sparse greedy approximation have statistically indistinguishable generalization \nperformance. \n\nIn a third experiment (Table 2) we analyzed the number of basis functions needed \nto minimize the log posterior to gap(a, a*) < 0.025, depending on different choices \nof the kernel width a. In all cases, less than 10% of the kernel functions suffice to \n\n\ffind a good minimizer of the log posterior, for the error bars, even less than 2% are \nsufficient. This is a dramatic improvement over previous techniques. \n\nKernel width 2w:& \nKernels for log-posterior 373 \nKernels for error bars \n\n50 \n270 \n79\u00b161 49\u00b143 26\u00b127 17\u00b116 12\u00b19 8\u00b15 \n\n2 \n287 \n\n1 \n\n5 \n255 \n\n10 \n257 \n\n20 \n251 \n\nTable 2: Number of basis functions needed to minimize the log posterior on the \nAbalone dataset (4000 training samples), depending on the width of the kernel w. \nAlso, number of basis functions required to approximate k T (K + 0-21)- l k which is \nneeded to compute the error bars. We averaged over the remaining 177 test samples. \n\nTo ensure that our results were not dataset specific and that the algorithm scales well \nwe tested it on a larger synthetic dataset of size 10000 in 20 dimensions distributed \naccording to N(O, 1). The data was generated by adding normal noise with variance \n0-2 = 0.1 to a function consisting of 200 randomly chosen Gaussians of width 2w 2 = \n40 and normally distributed coefficients and centers. \n\nWe purposely chose an inadequate Gaussian process prior (but correct noise level) \nof Gaussians with width 2w 2 = 10 in order to avoid trivial sparse expansions. After \n500 iterations (i.e. after using 5% of all basis functions) the size of the gap(cr, cr\u00b7) \nwas less than 0.023 (note that this problem is too large to be solved exactly). \n\nWe believe that sparse greedy approximation methods are a key technique to scale \nup Gaussian Process regression to sample sizes of 10.000 and beyond. The tech(cid:173)\nniques presented in the paper, however, are by no means limited to regression. \nWork on the solutions of dense quadratic programs and classification problems is in \nprogress. The authors thank Bob Williamson and Bernhard Sch6lkopf. \n\nReferences \n[1] S. Fine and K Scheinberg. Efficient SVM training using low-rank kernel representation. \n\nTechnical report, IBM Watson Research Center, New York, 2000. \n\n[2] M. Gibbs and D .J .C . Mackay. Efficient implementation of gaussian processes. Technical \n\nreport, Cavendish Laboratory, Cambridge, UK, 1997. \n\n[3] F. Girosi. An equivalence between sparse approximation and support vector machines. \n\nNeural Computation, 10(6):1455-1480, 1998. \n\n[4] S. Mallat and Z. Zhang. Matching Pursuit in a time-frequency dictionary. IEEE \n\nTransactions on Signal Processing, 41:3397-3415, 1993. \n\n[5] B. K Natarajan. Sparse approximate solutions to linear systems. SIAM Journal of \n\nComputing, 25(2) :227-234, 1995. \n\n[6] B. Sch6lkopf, S. Mika, C. Burges, P . Knirsch, K-R. Miiller, G. Ratsch, and A. Smola. \nInput space vs. feature space in kernel-based methods. IEEE Transactions on Neural \nNetworks, 10(5):1000 - 1017, 1999. \n\n[7] A.J . Smola and B. Sch6lkopf. Sparse greedy matrix approximation for machine learn(cid:173)\n\ning. In P. Langley, editor, Proceedings of the 17th International Conference on Machine \nLearning, pages 911 - 918, San Francisco, 2000. Morgan Kaufman. \n\n[8] C.KI. Williams and M. Seeger. The effect of the input density distribution on kernel(cid:173)\n\nbased classifiers. In P. Langley, editor, Proceedings of the Seventeenth International \nConference on Machine Learning, pages 1159 - 1166, San Francisco, California, 2000. \nMorgan Kaufmann. \n\n\f", "award": [], "sourceid": 1880, "authors": [{"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "Peter", "family_name": "Bartlett", "institution": null}]}