WebCab Probability and Statistics
v3.5
(J2SE Edition)

webcab.lib.statistics.correlation
Class CorrelationStateful

java.lang.Object
  |
  +--webcab.lib.statistics.correlation.CorrelationStateful
All Implemented Interfaces:
Serializable

public class CorrelationStateful
extends Object
implements Serializable

This is the stateful implementation of the Correlation and Regression class allowing the investigation of linear relationships between two variables using the techniques of correlation and linear regression. This version of the Correlation and Regression functionality allows the data set of pairs which is being studied to be set and then for various qualitative properties of this set of pairs to be evaluated, This approach is particularly appropriate in instances where repeated evaluation of the various Correlation properties will be performed on the same data set. The reason being that for each of these evaluation the data set

Use of State

This stateful version implements the functionality of the Correlation and Regression class using the OOP notion and technique of state. In instances, where the which will allow for more efficient execution in instances when the data is "sent over the wire" (for example in instances when the data set of retrieved from a remote DBMS).

Overview of this Correlation class

We study the relationship between two variables by considering a data set of pairs of values which correspond to particular instances of values taken simultaneously by the two underlying variables. We then study the correlation and linear regression properties of this data set in order to deduce information concerning the relationship between the to variables.

In particular, we allow the linear regression line to be constructed which allows us to predict one variable from given values of the other variables to a certain degree of confidence dependent on the `linearity' of the date set. We also cover linear (Pearson's, t-test, z-transform) and rank (Spearman's, Kendall's) correlation.

That is, by using this class for a given data set you are able to decide to what degree two variables are correlated, determine the confidence interval and the level of significance of the correlation tests performed. You are also able to construct the regression line for the data set. Similarly, you can determine for two data samples with corresponds regression lines the confidence interval for the conditional mean between these two regression lines.

Possible Data Sets, questions addressed and effectiveness

Possible Data Sets

Such data sets appear in a number of contexts. Examples of pairs for which such data sets could be constructed include:

  1. The grade and the number of students with a class whole obtained that grade.
  2. The number of commercials shown and the sales achieved in a given week.

Possible Questions addressed

By tabulating a given set of students or sales data; respectively against the above criteria, the application of this class would address the following type of questions:

  1. (Grades, Students): The average grade obtained, the degree of the dispersion of the grades, generally does the number of students obtaining a grade increase as the grade increases. To what degree (using linear methods) can we predict the number of students which will obtain a given grade.
  2. (No. Commercials, Sales): The average sales or number of commercials in a given week along with the dispersion (or variance) between these values from week to week. Establish to what degree to increase in the number of commercials increases the sales figures.

Effectiveness

The effectiveness of the functionality in terms of being able to predict values will depend on the nature of the data set considered. The reason being that we will only be able to confidently makes predictions when there exist a strong linear relationship between the two variables considered. The reason being that we have implemented a linear regression model (see note below for more details).

The correlation functionality implemented consists of a number of coefficients which are designed to measuring the correlation (i.e. the degree to which one variable moves with the other) for differing types of sets (see notes below).

Detailed Overview of the Functionality Available

Set the Data set and number of significant digits returned

  1. addValue - Add pairs of values to the data set one at a time.
  2. addValues - Add pairs of values to the data set many at a time.

Correlation Coefficients and Statistics

  1. pearsonCorrelationCoefficient() - Evaluates Pearson's Correlation Coefficient.
  2. spearmanRankTest() - Spearman's Rank Correlation Coefficient.
  3. kendallCorrelationCoefficient() - Evaluates Kendall's Correlation Coefficient.
  4. significance - Calculates the significance test for a given correlation coefficient.
  5. meanX - Mean of the values of the first elements of the pairs from which the current data set is constructed.
  6. meanY - Mean of the values of the second elements of the pairs from which the current data set is constructed.
  7. sampleVarianceX - The variance of the first elements from the pairs from which the current data set is constructed.
  8. sampleVarianceY - The variance of the second elements from the pairs from which the current data set is constructed.

Linear Regression methods

  1. leastSquaresRegressionLineY - Constructs the regression line of Y on X using the method of least squares.
  2. leastSquaresRegressionLineX - Constructs the regression line of X on Y using the method of least squares.
  3. coefficientOfDetermination - Calculates the coefficient of determination for the current set of data.
  4. residuals - Determines the residual for a given pair of points.
  5. residualsAverage - Determines the arithmetic average of all the residuals.

See Also:
Serialized Form

Constructor Summary
CorrelationStateful()
          Creates a new instance of the Correlation class with an empty initial data set.
CorrelationStateful(double[] xValues, double[] yValues)
          Constructs a new Correlation instance using the specified value pairs for its initial data set.
 
Method Summary
 void addValue(double xValue, double yValue)
          Adds a new pair of (ordered) numbers (xValue, yValue) to the data set.
 void addValues(double[] xValues, double[] yValues)
          Adds a new set of (ordered) pairs of data to the data set.
 double coefficientOfDetermination()
          Calculates the coefficient of determination for the current set of data.
 double estimateX(double yValue)
          Estimates the value of the X variable when the Y variable is known using the regression line of X on Y, which can be evaluated using leastSquaresRegressionLineX().
 double estimateY(double xValue)
          Estimates the value of the Y variable when the X variable is known using the regression line of Y on X, which can be evaluated using leastSquaresRegressionLineY().
 double kendallCorrelationCoefficient()
          Calculates Kendall's correlation coefficient for the current data set.
 double[] leastSquaresRegressionLineX()
          Constructs the regression line of X on Y using the method of least squares.
 double[] leastSquaresRegressionLineY()
          Constructs the regression line of Y on X using the method of least squares.
 double meanX()
          Calculates the arithmetic mean of the elements of the first element (i.e.
 double meanY()
          Calculates the arithmetic mean of the elements of the second element (i.e.
 double pearsonCorrelationCoefficient()
          Calculates Pearson's correlation coefficient for the current data set.
 double residuals(int index)
          Determines the residual for a given pair of points within the current data set in accordance with the regression line constructed using leastSquaresRegressionLineX().
 double residualsAverage()
          Determines the arithmetic average of the residuals for all pairs of points within the current data set in accordance with the regression line constructed using leastSquaresRegressionLineX().
 double sampleVarianceX()
          Calculates the sample variance of the elements of the first element (i.e.
 double sampleVarianceY()
          Calculates the sample variance of the elements of the second element (i.e.
 double significance(double correlationCoefficient)
          Calculates the significance test for a given correlation coefficient.
 double spearmanRankTest()
          Calculates Spearson's Rank correlation coefficient for the current data set.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

CorrelationStateful

public CorrelationStateful()
Creates a new instance of the Correlation class with an empty initial data set. In order to specify your own value pairs, you may use the addValue and the addValues methods.


CorrelationStateful

public CorrelationStateful(double[] xValues,
                           double[] yValues)
Constructs a new Correlation instance using the specified value pairs for its initial data set.

Method Detail

addValue

public void addValue(double xValue,
                     double yValue)
Adds a new pair of (ordered) numbers (xValue, yValue) to the data set.

Parameters:
xValue - the value in the first variable of the (ordered) pair which is added to the data set
yValue - the value in the second variable of the (ordered) pair which is added to the date set.

addValues

public void addValues(double[] xValues,
                      double[] yValues)
Adds a new set of (ordered) pairs of data to the data set.

Note:

  1. The ith pair of numbers which will be added to the data set will be (xValues[i],yValues[i]).
  2. This function will add, one by one, pairs of numbers formed with one `X' value and one `Y' value.

Parameters:
xValues - an array which are the ordered elements which make up the first variable of the set of (ordered) pairs which is to be added to the data set.
yValues - an array which are the ordered elements which make up the second variable of the set of (ordered) pairs which is to be added to the data set.

pearsonCorrelationCoefficient

public double pearsonCorrelationCoefficient()
Calculates Pearson's correlation coefficient for the current data set.

Returns:
Pearson's correlation coefficient of the data set which has been set using addValue and/or addValues.

spearmanRankTest

public double spearmanRankTest()
Calculates Spearson's Rank correlation coefficient for the current data set.


kendallCorrelationCoefficient

public double kendallCorrelationCoefficient()
Calculates Kendall's correlation coefficient for the current data set.


significance

public double significance(double correlationCoefficient)
Calculates the significance test for a given correlation coefficient. The statistic test is calculated using the null hypothesis of there being no correlation present. This statistic follows the Student's t-distribution, with N-2 degrees of freedom where N can be considered to be the total number of pairs.

Parameters:
correlationCoefficient - either evaluation of Pearson's correlation coefficient or Spearson's rank correlation coefficient for the current data set.
Returns:
the value of the significance test for the given correlation coefficient.

meanX

public double meanX()
Calculates the arithmetic mean of the elements of the first element (i.e. X) of the pairs of values from which the current data set is constructed.


meanY

public double meanY()
Calculates the arithmetic mean of the elements of the second element (i.e. Y) of the pairs of values from which the current data set is constructed.


sampleVarianceX

public double sampleVarianceX()
Calculates the sample variance of the elements of the first element (i.e. X) of the pairs of values from which the current data set is constructed.


sampleVarianceY

public double sampleVarianceY()
Calculates the sample variance of the elements of the second element (i.e. Y) of the pairs of values from which the current data set is constructed.


leastSquaresRegressionLineY

public double[] leastSquaresRegressionLineY()
Constructs the regression line of Y on X using the method of least squares. That is, the regression line using the least squares method is constructed when the second element of the pairs from which the data set of constructed is plot against the first elements of the pairs.

Returns:
an array d with two elements, where the regression line will be given in functional form by the following formula: Y(y)=d[0]*X+d[1]
See Also:
Calculates the regression line of X on Y, using the method of least squares.

estimateY

public double estimateY(double xValue)
Estimates the value of the Y variable when the X variable is known using the regression line of Y on X, which can be evaluated using leastSquaresRegressionLineY().


leastSquaresRegressionLineX

public double[] leastSquaresRegressionLineX()
Constructs the regression line of X on Y using the method of least squares. That is, the regression line using the least squares method is constructed when the first element of the pairs from which the data set of constructed is plot against the second elements of the pairs.

Returns:
an array d = {d[0], d[1]}, with two elements, where the regression line will be given in functional form by the following formula: X(y)=d[0]*Y+d[1]
See Also:
Calculates the regression line of Y on X, using the method of least squares.

estimateX

public double estimateX(double yValue)
Estimates the value of the X variable when the Y variable is known using the regression line of X on Y, which can be evaluated using leastSquaresRegressionLineX().


coefficientOfDetermination

public double coefficientOfDetermination()
Calculates the coefficient of determination for the current set of data.

Explanation of the Coefficient of Determination

The coefficient of determination is the amount of variation in the second variable (i.e. Y) which is explained by the regression line of the second variable Y on the first variable X (see leastSquaresRegressionLineY()), divided by the total amount of variation of the second variable (i.e. Y).

Returns:
the coefficient of determination for the current set of data

residuals

public double residuals(int index)
Determines the residual for a given pair of points within the current data set in accordance with the regression line constructed using leastSquaresRegressionLineX(). Recall that the residual is the variation of the second variable (i.e. Y) around the regression line.

Parameters:
index - the index of the pair of points within the current data set. The indexing of the pairs of points within the data set starts from 0; and hence 0 corresponds to the first pair of points, 1 to the second pairs of point and so on.
Returns:
the residual for a given pair of points within the current data set.
See Also:
Evaluates the arithmetic average of the residuals.

residualsAverage

public double residualsAverage()
Determines the arithmetic average of the residuals for all pairs of points within the current data set in accordance with the regression line constructed using leastSquaresRegressionLineX(). This method simple determines the arithmetic mean between all values determined by residuals.

Returns:
the residuals average for all pairs of points within the current data set.
See Also:
Evaluates the residuals for a given pair of points.

WebCab Probability and Statistics
v3.5
(J2SE Edition)