WebCab Probability and Statistics for .NET v3.6

GeneralLinear Class

Offers the ability to fit a linear regression model in accordance with the least squares approach and then measure its goodness-of-fit.

For a list of all members of this type, see GeneralLinear Members.

System.Object
   GeneralLinear

public class GeneralLinear

Remarks

The regression model is constructed from the linear combination of a collection of basis elements (i.e. functions), where the basis elements are set using the method SetFunctionBasis. This approach allows complete flexibility over the functions which are used as basis elements within the linear regression model.

Details concerning the Linear Regression Model

The Regression model is a collection of functions from which the function of best-fit will be selected. The collection of functions within the Regression model is constructed from a linear combination of basis functions f1(x), f2(x),..., fn(x), where the basis functions are set using SetFunctionBasis. Therefore, the function of best fit which is selected from the regression model in accordance with the least squares approach can be any function which takes the form:

a1 * f1(x) + a2 * f2(x) + ... + an * fn(x),

where a1,...,an are any real numbers. The equation given above of often referred to as the regression model.

The Least Square Approach

When we refer to the function of best fit strictly speaking we are referring to the function constructed as a linear combination of basis functions constructed by the following means. The coefficients are selected such that the resulting function has the maximum likelihood of being the best fit in accordance with the least squares approach when the measurement errors of the y-axis coordinates of the known data points are given.

Within this implementation we offer the ability to incorporate measurement errors of the experimental data to which the function is fitted. In particular, we assume that the measurement errors of the yi's are independently random and distributed as a Normal Gaussian distribution about a true value. It follows from these assumptions (though not rigorously) that the most likely coefficients which generate the best fit are achieved by finding the coefficients: a1, ..., an such that the sum of the squares of the terms:

( yi - (a1 * f1(xi) + ... + an * fn(xi)) / sigmai,

is minimized where we sum over the i, for 1 <= i <= m; where m is the number of data points, namely (xi, yi); f1,...,fn are the function basis elements and the sigmai of the standard deviation of the measurement error of the values of yi.

Remarks:

  1. Unknown Measurement Error: If you do not wish to incorporate measurement errors within the selection of the greatest likelihood coefficient then just set all the standard deviations of each of the errors to 1.0.
  2. Weighted Best Fit: If you wish you are also able to apply this class with the view of finding the best weighted fit where the sigmai corresponds the weight applied to the i-th data point.
  3. Chi-Squared Measure: The value of the sum of the above terms is referred to as the Chi-Squared measure. Clearly the lower the absolute value of this measure the better (globally) over the range of value the curve selected fits the given data. Once the regression model has been fitted the corresponding value of the Chi-Squared measure can be evaluated using GetChiSquare.

Using this class

In order to apply this class you must perform the following steps:

  1. Set the Basis Functions: using SetFunctionBasis
  2. Fit the Function set to the Data: using SetGeneralFit

Once the function has been fit you are able to 'read' the following quantitative information about the fitted function:

  1. Return the value of Chi-Squared using GetChiSquare
  2. Return the coefficients in the function of best fit using GetCoefficients
  3. Evaluate the value of the function of best fit at a given point using BestFitValue

Analysis of Variance (ANOVA) Analysis of the Regression Model

ANOVA is an abbreviation for the Analysis of the Variance. This analysis provides information on the structure of the variability of the regression model which form the basis of many tests of the significance of a model. Note that in order to be able to perform this statistical testing we make certain assumptions about the errors associated with each of the data points within the regression model (Y = F(X,a[]) + error), namely:

  1. The Errors associated which each of the data points are independent.
  2. The Errors are normally distributed.
  3. The Errors have a mean of zero.
  4. The Errors have a constant variance.

Though from a theoretical point on view it is necessary to make these assumptions in many practical instances these assumptions will just not hold. Therefore, the user should take this into account when ever apply ANOVA type analysis to real world situations. The order to see the degree to which these assumptions are met a scatter plot of the residuals (evaluated using GetResiduals) of the regression model should be considered. The scatter plot will also provide a quick means by which you are able to discover whether the particular regression model considered was a good choice. If the considered model is deemed not to have been a good choice then further visual inspection of the scatter plot with in most cases suggest possible refinements of the original model considered.

Using Scatter Plots

Within a scatter plot of the residuals of a regression model you should observe the following:

  1. Testing the ANOVA Assumptions
    1. If the residuals are independent and normally distributed then there will be no bunching and the density of points will increase as you get nearer to the zero value on the axis corresponding to the residuals values.
    2. If the residuals have a mean of zero then the scatter plot with be equally dispersed above and below zero on the residuals value axis.
    3. If the variance of the errors is constant then there will be no discernible change in the level of the dispersion for differing values of the model's (space) variable (i.e. x). In practical, you will often find that as the model's (space) variable increases the level of dispersion increase, that is, the variance increases.
  2. Implications concerning the Regression Model itself
    1. If the residuals within the scatter plot are linearly increasing or decreasing with the model's (space) variable then it would imply that the regression model itself should be modified by including another linear term which would rectify this linear trend.
    2. If the residuals within the scatter plot are trending in a discernible yet non-linear fashion with the model's (space) variable then a corresponding term rectifying this trend should be added to the regression model. For example, if the residuals are cyclical then you will need to add a sine or cosine type term (or another term with a cyclical nature); similarly if the residuals are first increasing then decreases (or vice versa) then you may need to add a quadratic term and so on. Note that though the fitting algorithm within this class does not allow for the fitting of non-linear functions it may be possible to perform a non-linear transformation on the source data and then apply a linear regression model. Alternatively, if this is not possible then you should use a non-linear regression model in conjunction with the class NonLinearModel.

ANOVA Functionality Offered

Within this class once the regression model has been fitted you are able to evaluate the following ANOVA type measures:

  1. Measures of the Regression Function's 'distance from'
    1. GetResiduals - The residuals for the (multi-linear) regression considered.
    2. SumSquaresError - The sum of the squares due to error (SSE).
    3. MeanSquaresError - The mean square due to error (MSE).
    4. StandardError - The Standard Error of the Estimate.
    5. TotalSumSquares - The Total Sum of Squares (SST).
    6. SumSquaresRegression - The Sum of Squares Due to Regression (SSR).
    7. MeanSquaresRegression - The mean square due to regression (MSR).
  2. Measures of the Significance of the Model
    1. RSquared - The multiple R2 (or multiple coefficient of determination).
    2. RSquaredAdjusted - The Adjusted R2 (or Adjusted Multiple Coefficient of Determination).
    3. FTest - The F-Test Statistic.

Requirements

Namespace: WebCab.Libraries.Statistics.CurveFitting

Assembly: WebCab.Libraries.Statistics (in WebCab.Libraries.Statistics.dll)

See Also

GeneralLinear Members | WebCab.Libraries.Statistics.CurveFitting Namespace