|
Linear least squares is an important computational problem, that arises primarily in applications when it is desired to fit a linear mathematical model to measurements obtained from experiments. The goals of linear least squares are to extract predictions from the measurements and to reduce the effect of measurement errors. Mathematically, it can be stated as the problem of finding an approximate solution to an overdetermined system of linear equations. Linear least square problems admit a closed-form solution, in contrast to non-linear least squares problems, which often have to be solved by an iterative procedure. [edit] Motivational exampleAs a result of an experiment, four (x,y) data points were obtained, (1,6), (2,5), (3,7), and (4,10) (shown in red in the picture on the right). It is desired to find a line y = α + βx that fits "best" these four points. In other words, we would like to find the numbers α and β that approximately solve the overdetermined linear system of four equations in two unknowns in some "best" sense. The least squares approach to solving this problem is to try to make as small as possible the sum of squares of "errors" between the right- and left-hand sides of these equations, that is, to find the minimum of the function The minimum is determined by calculating the partial derivatives of S(α,β) in respect to α and β and setting them to zero. This results in a system of two equations in two unknowns, which, when solved, gives the solution
and the equation y = 3.5 + 1.4x of the line of best fit. The residuals, that is, the discrepancies between the y values from the experiment and the y values calculated using the line of best fit are then found to be 1.1, − 1.3, − 0.7, and 0.9 (see the picture on the right). The minimum value of the sum of squares is S(3.5,1.4) = 1.12 + ( − 1.3)2 + ( − 0.7)2 + 0.92 = 4.2. [edit] The general problemConsider an overdetermined system of m linear equations in n unknowns, Such a system usually has no solution, and the goal is then to find the numbers βj which fit the equations "best", in the sense of minimizing the sum of squares of differences between the right- and left-hand sides of the equations. The justification for choosing this criterion is given in properties, below. The linear least squares problem has a unique solution, provided that the n columns of the matrix X are linearly independent. The solution is obtained by solving the normal equations [edit] Uses in data fittingThe primary application of linear least squares is in data fitting. Given a set of m data points Here, the functions φj may be nonlinear in the variable x. Ideally, the model function fits the data exactly, so for all so to minimize the function The problem then reduces to the overdetermined linear system mentioned earlier, with Xij = φj(xi). [edit] Derivation of the normal equationsS is minimized when its gradient with respect to each parameter is equal to zero. The elements of the gradient vector are the partial derivatives of S with respect to the parameters: Since Substitution of the expressions for the residuals and the derivatives into the gradient equations gives Upon rearrangement, the normal equations are obtained. The normal equations are written in matrix notation as The solution of the normal equations yields the vector [edit] Inverting the normal equationsAlthough the algebraic solution of the normal equations can be written as it is not good practice to invert the normal equations matrix. An exception occurs in numerical smoothing and differentiation where an analytical expression is required. If the matrix The solution is obtained in two stages, a forward substitution, See example of linear regression for a worked-out numerical example with three parameters. [edit] Orthogonal decomposition methodsOrthogonal decomposition methods of solving the least squares problem are slower than the normal equations method but are more numerically stable. The extra stability results from not having to form the product The matrix X is subjected to an orthogonal decomposition; the QR decomposition will serve to illustrate the process. where Q is an orthogonal The residual vector is left-multiplied by The sum of squares of the transformed residuals, The minimum value of S is attained when the upper block, U, is zero. Therefore the parameters are found by solving These equations are easily solved as An alternative decomposition of X is the singular value decomposition (SVD)[1] This is effectively another kind of orthogonal decomposition as both U and V are orthogonal. This method is the most computationally intensive, but is particularly useful if the normal equations matrix, [edit] Properties of the least-squares estimators
The residual vector,
, which corresponds to the solution of a least squares system, , is orthogonal to the column space of the matrix X.The gradient equations at the minimum can be written as A geometrical interpretation of these equations is that the vector of residuals, If the experimental errors, For example, it is easy to show that the arithmetic mean of a set of measurements of a quantity is the least-squares estimator of the value of that quantity. If the conditions of the Gauss-Markov theorem apply, the arithmetic mean is optimal, whatever the distribution of errors of the measurements might be. However, in the case that the experimental errors do belong to a Normal distribuition, the least-squares estimator is also a maximum likelihood estimator.[2] These properties underpin the use of the method of least squares for all types of data fitting, even when the assumptions are not strictly valid. [edit] LimitationsAn assumption underlying the treatment given above is that the independent variable, x, is free of error. In practice, the errors on the measurements of the independent variable are usually much smaller than the errors on the dependent variable and can therefore be ignored. When this is not the case, total least squares also known as Errors-in-variables model, or Rigorous least squares, should be used. This can be done by adjusting the weighting scheme to take into account errors on both the dependent and independent variables and then following the standard procedure.[3][4] In some cases the (weighted) normal equations matrix Another drawback of the least squares estimator is the fact that the norm of the residuals, [edit] Weighted linear least squaresWhen the observations are not equally reliable, a weighted sum of squares may be minimized. Each element of the diagonal weight matrix, W should,ideally, be equal to the reciprocal of the variance of the measurement.[6] The normal equations are then [edit] Parameter errors, correlation and confidence limitsThe parameter values are linear combinations of the observed values Therefore an expression for the errors on the parameter can be obtained by error propagation from the errors on the observations. Let the variance-covariance matrix for the observations be denoted by M and that of the parameters by Mβ. Then, When When unit weights are used ( In all cases, the variance of the parameter βi is given by It is often assumed, for want of any concrete evidence, that the error on a parameter belongs to a Normal distribution with a mean of zero and standard deviation σ. Under that assumption the following confidence limits can be derived.
The assumption is not unreasonable when m>>n. If the experimental errors are normally distributed the parameters will belong to a Student's t-distribution with m-n degrees of freedom. When m>>n Student's t-distribution approximates to a Normal distribution. Note, however, that these confidence limits cannot take systematic error into account. Also, parameter errors should be quoted to one significant figure only, as they are subject to sampling error.[7] When the number of observations is relatively small, Chebychev's inequality can be used for an upper bound on probabilities, regardless of any assumptions about the distribution of experimental errors: the maximum probabilities that a parameter will be more than 1, 2 or 3 standard deviations away from its expectation value are 100%, 25% and 11% respectively. [edit] Residual values and correlationThe residuals are related to the observations by The symmetric, idempotent matrix where I is an identity matrix. The variance-covariance matrice of the residuals, Mr is given by This shows that even though the observations may be uncorrelated, the residuals are always correlated. The sum of residual values is equal to zero whenever the model function contains a constant term. Left-multiply the expression for the residuals by Say, for example, that the first term of the model is a constant, so that Xi1 = 1 for all i. In that case it follows that Thus, in the motivational example, above, the fact that the sum of residual values is equal to zero it is not accidental but is a consequence of the presence of the constant term, α, in the model. If experimental error follows a normal distribution, then, because of the linear relationship between residuals and observations, so should residuals, [8] but since the observations are only a sample of the population of all possible observations, the residuals should belong to a Student's t-distribution. Studentized residuals are useful in making a statistical test for an outlier when a particular residual appears to be excessively large. [edit] Objective functionThe objective function can be written as since If it is assumed that the residuals belong to a Normal distribution, the objective function, being a sum of weighted squared residuals, will belong to a Chi-square (χ2) distribution with m-n degrees of freedom. Some illustrative percentile values of χ2 are given in the following table.[10]
These values can be used for a statistical criterion as to the goodness-of-fit. When unit weights are used, the numbers should be divided by the variance of an observation. [edit] Typical uses and applications
[edit] Notes
[edit] References
[edit] External links
Directorio de Enlaces Directorio dmoz Directorio espejo dmoz Pedro Bernardo | |||||||||||||||||||||||||||||||