MADlib  1.0
User Documentation
 All Files Functions Groups
Linear Regression
+ Collaboration diagram for Linear Regression:
About:

Ordinary least-squares (OLS) linear regression refers to a stochastic model in which the conditional mean of the dependent variable (usually denoted \( Y \)) is an affine function of the vector of independent variables (usually denoted \( \boldsymbol x \)). That is,

\[ E[Y \mid \boldsymbol x] = \boldsymbol c^T \boldsymbol x \]

for some unknown vector of coefficients \( \boldsymbol c \). The assumption is that the residuals are i.i.d. distributed Gaussians. That is, the (conditional) probability density of \( Y \) is given by

\[ f(y \mid \boldsymbol x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \cdot \exp\left(-\frac{1}{2 \sigma^2} \cdot (y - \boldsymbol x^T \boldsymbol c)^2 \right) \,. \]

OLS linear regression finds the vector of coefficients \( \boldsymbol c \) that maximizes the likelihood of the observations.

Let

Maximizing the likelihood is equivalent to maximizing the log-likelihood \( \sum_{i=1}^n \log f(y_i \mid \boldsymbol x_i) \), which simplifies to minimizing the residual sum of squares \( RSS \) (also called sum of squared residuals or sum of squared errors of prediction),

\[ RSS = \sum_{i=1}^n ( y_i - \boldsymbol c^T \boldsymbol x_i )^2 = (\boldsymbol y - X \boldsymbol c)^T (\boldsymbol y - X \boldsymbol c) \,. \]

The first-order conditions yield that the \( RSS \) is minimized at

\[ \boldsymbol c = (X^T X)^+ X^T \boldsymbol y \,. \]

Computing the total sum of squares \( TSS \), the explained sum of squares \( ESS \) (also called the regression sum of squares), and the coefficient of determination \( R^2 \) is done according to the following formulas:

\begin{align*} ESS & = \boldsymbol y^T X \boldsymbol c - \frac{ \| y \|_1^2 }{n} \\ TSS & = \sum_{i=1}^n y_i^2 - \frac{ \| y \|_1^2 }{n} \\ R^2 & = \frac{ESS}{TSS} \end{align*}

Note: The last equality follows from the definition \( R^2 = 1 - \frac{RSS}{TSS} \) and the fact that for linear regression \( TSS = RSS + ESS \). A proof of the latter can be found, e.g., at: http://en.wikipedia.org/wiki/Sum_of_squares

We estimate the variance \( Var[Y - \boldsymbol c^T \boldsymbol x \mid \boldsymbol x] \) as

\[ \sigma^2 = \frac{RSS}{n - k} \]

and compute the t-statistic for coefficient \( i \) as

\[ t_i = \frac{c_i}{\sqrt{\sigma^2 \cdot \left( (X^T X)^{-1} \right)_{ii} }} \,. \]

The \( p \)-value for coefficient \( i \) gives the probability of seeing a value at least as extreme as the one observed, provided that the null hypothesis ( \( c_i = 0 \)) is true. Letting \( F_\nu \) denote the cumulative density function of student-t with \( \nu \) degrees of freedom, the \( p \)-value for coefficient \( i \) is therefore

\[ p_i = \Pr(|T| \geq |t_i|) = 2 \cdot (1 - F_{n - k}( |t_i| )) \]

where \( T \) is a student-t distributed random variable with mean 0.

The condition number [2] \( \kappa(X) = \|X\|_2\cdot\|X^{-1}\|_2\) is computed as the product of two spectral norms [3]. The spectral norm of a matrix \(X\) is the largest singular value of \(X\) i.e. the square root of the largest eigenvalue of the positive-semidefinite matrix \(X^{*}X\):

\[ \|X\|_2 = \sqrt{\lambda_{\max}\left(X^{*}X\right)}\ , \]

where \(X^{*}\) is the conjugate transpose of \(X\). The condition number of a linear regression problem is a worst-case measure of how sensitive the result is to small perturbations of the input. A large condition number (say, more than 1000) indicates the presence of significant multicollinearity.

Input:

The training data is expected to be of the following form:

{TABLE|VIEW} sourceName (
    ...
    dependentVariable FLOAT8,
    independentVariables FLOAT8[],
    ...
)
Usage:

(1) The Simple Interface

(2) The Full Interface

The full interface support the analysis of heteroskedasticity of the linear fit.

SELECT madlib.linregr_train (
    'source_table',        -- name of input table, VARCHAR
    'out_table',           -- name of output table, VARCHAR
    'dependent_varname',   -- dependent variable, VARCHAR
    'independent_varname', -- independent variable, VARCHAR
    [group_cols,           -- names of columns to group by, VARCHAR[].
                                    -- Default value: Null
    [heteroskedasticity_option]] -- whether to analyze
                                          --    heteroskedasticity,
                                          -- BOOLEAN. Default value: False
);

Here the 'independent_varname' can be the name of a column, which contains array of numeric values. It can also have a format of string 'array[1, x1, x2, x3]', where x1, x2 and x3 are all column names.

Output is stored in the out_table:

[ group_col_1 | group_col_2 | ... |] coef | r2 | std_err | t_stats | p_values | condition_no [|
-----------+-------------+-----+------+----+---------+---------+----------+--------------+---
bp_stats | bp_p_value ]
-------------+---------

Where the first part

[ group_col_1 | group_col_2 | ... |]

presents only when group_cols is not Null. The last part

[ bp_stats | ... |
corrected_p_values ]

presents only when heteroskedasticity_option is True.

When group_cols is given, the data is grouped by the given columns and a linear model is fit to each group of data. The output will have additional columns for all combinations of the values of all the group_cols. For each combination of group_cols values, linear regression result is shown.

When heteroskedasticity_option is True, the output will have additional columns. The function computes the Breusch–Pagan test [4] statistics and the corresponding \(p\)-value.

Examples:

The following example is taken from http://www.stat.columbia.edu/~martin/W2110/SAS_7.pdf.

  1. Create the sample data set:
    sql> CREATE TABLE houses (id INT, tax INT, bedroom INT, bath FLOAT, price INT,
                size INT, lot INT);
    sql> COPY houses FROM STDIN WITH DELIMITER '|';
      1 |  590 |       2 |    1 |  50000 |  770 | 22100
      2 | 1050 |       3 |    2 |  85000 | 1410 | 12000
      3 |   20 |       3 |    1 |  22500 | 1060 |  3500
      4 |  870 |       2 |    2 |  90000 | 1300 | 17500
      5 | 1320 |       3 |    2 | 133000 | 1500 | 30000
      6 | 1350 |       2 |    1 |  90500 |  820 | 25700
      7 | 2790 |       3 |  2.5 | 260000 | 2130 | 25000
      8 |  680 |       2 |    1 | 142500 | 1170 | 22000
      9 | 1840 |       3 |    2 | 160000 | 1500 | 19000
     10 | 3680 |       4 |    2 | 240000 | 2790 | 20000
     11 | 1660 |       3 |    1 |  87000 | 1030 | 17500
     12 | 1620 |       3 |    2 | 118600 | 1250 | 20000
     13 | 3100 |       3 |    2 | 140000 | 1760 | 38000
     14 | 2070 |       2 |    3 | 148000 | 1550 | 14000
     15 |  650 |       3 |  1.5 |  65000 | 1450 | 12000
    \.
    
  2. You can call the linregr() function for an individual metric:
    sql> SELECT (linregr(price, array[1, bedroom, bath, size])).coef FROM houses;
                                      coef
    ------------------------------------------------------------------------
     {27923.4334170641,-35524.7753390234,2269.34393735323,130.793920208133}
    (1 row)
    
    sql> SELECT (linregr(price, array[1, bedroom, bath, size])).r2 FROM houses;
            r2
    -------------------
     0.745374010140315
    (1 row)
    
    sql> SELECT (linregr(price, array[1, bedroom, bath, size])).std_err FROM houses;
                                   std_err
    ----------------------------------------------------------------------
     {56306.4821787474,25036.6537279169,22208.6687270562,36.208642285651}
    (1 row)
    
    sql> SELECT (linregr(price, array[1, bedroom, bath, size])).t_stats FROM houses;
                                    t_stats
    ------------------------------------------------------------------------
     {0.495918628487924,-1.41891067892239,0.10218279921428,3.6122293450358}
    (1 row)
    
    sql> SELECT (linregr(price, array[1, bedroom, bath, size])).p_values FROM houses;
                                      p_values
    -----------------------------------------------------------------------------
     {0.629711069315512,0.183633155781461,0.920450514073051,0.00408159079312354}
    (1 row)
    
  3. Alternatively you can call the linreg() function for the full record:
    sql> \x on
    Expanded display is on.
    sql> SELECT (r).* FROM (SELECT linregr(price, array[1, bedroom, bath, size])
                AS r FROM houses) q;
    -[ RECORD 1 ]+-----------------------------------------------------------------
    coef         | {27923.4334170641,-35524.7753390234,2269.34393735323,130.793920208133}
    r2           | 0.745374010140315
    std_err      | {56306.4821787474,25036.6537279169,22208.6687270562,36.208642285651}
    t_stats      | {0.495918628487924,-1.41891067892239,0.10218279921428,3.6122293450358}
    p_values     | {0.629711069315512,0.183633155781461,0.920450514073051,0.00408159079312354}
    condition_no | 9783.018
  4. You can call linregr_train() function for more functionality
    sql> SELECT madlib.linregr_train('houses', 'result', 'price',
                                     'array[1, tax, bath, size]',
                                     '{bedroom}'::varchar[], True);
    
    sql> SELECT * from result;
    -[ RECORD 1]---------+-------------------------------------------------------
    bedroom             | 2
    coef                | {-84242.0345, 55.4430, -78966.9754, 225.6119}
    r2                  | 0.9688
    std_err             | {35019.00, 19.57, 23036.81, 49.04}
    t_stats             | {-2.406, 2.833, -3.428, 4.600}
    p_values            | {0.251, 0.216, 0.181, 0.136}
    condition_no        | 10086.1
    bp_stats            | 2.5451
    bp_p_value          | 0.4672
    
    -[ RECORD 2]---------+------------------------------------------------------
    bedroom             | 3
    coef                | {-88155.8292502747,27.1966436293179,41404.0293389239,62.6375210724027}
    r2                  | 0.841699901312963
    std_err             | {57867.9999699512,17.82723091538,43643.1321521931,70.8506824870639}
    t_stats             | {-1.52339512850022,1.52556747362568,0.948695185179172,0.884077878626493}
    p_values            | {0.18816143289241,0.187636685729725,0.38634003235866,0.417132778730133}
    condition_no        | 11722.62
    bp_stats            | 6.7538
    bp_p_value          | 0.08017
    
    -[ RECORD 3]---------+-------------------------------------------------------
    bedroom             | 4
    coef                | {0.0112536020318378,41.4132554771633,0.0225072040636757,31.3975496688276}
    r2                  | 1
    std_err             | {0,0,0,0}
    t_stats             | {Infinity,Infinity,Infinity,Infinity}
    p_values            | Null
    condition_no        | Null
    bp_stats            | Null
    bp_p_value          | Null
Literature:

[1] Cosma Shalizi: Statistics 36-350: Data Mining, Lecture Notes, 21 October 2009, http://www.stat.cmu.edu/~cshalizi/350/lectures/17/lecture-17.pdf

[2] Wikipedia: Condition Number, http://en.wikipedia.org/wiki/Condition_number.

[3] Wikipedia: Spectral Norm, http://en.wikipedia.org/wiki/Spectral_norm#Spectral_norm

[4] Wikipedia: Breusch–Pagan test, http://en.wikipedia.org/wiki/Breusch%E2%80%93Pagan_test

[5] Wikipedia: Heteroscedasticity-consistent standard errors, http://en.wikipedia.org/wiki/Heteroscedasticity-consistent_standard_errors

See Also
File linear.sql_in documenting the SQL functions.