Statsmodels#

  • Mathematical equation which explains the relationship between dependent variable (Y) and independent variable(X).

    Y = f(X)
    
  • Due to uncertainy in result and noise the equation is

    Y = f(X) + e
    

Linear Models#

\begin{align*} Y &= \theta_0 + \theta_1 X_1 + \theta_2 X_2 + ... + \theta_n X_n\\ Y &= \theta_0 + \theta_1 X + \theta_2 X^2 + ... + \theta_n X^n\\ Y &= \theta_0 + \theta_1 sin(X_1) + \theta_2 * cos(X_2) \end{align*}

Linear Regression#

\begin{align*} Y &= \theta_0 + \theta_1 * X + e \\ & \theta_0, \theta_1 = \text{Coefficient} \\ & e = \text{normally distributed residual error} \end{align*}

  • Linear Regression model assumes that residuals are independent and normally distributed

  • Model is fitted to the data using ordinary least squares approach

Non-Linear Models#

  • Most of the cases, the non-linear models are generalized to linear models

  • Binomial Regresson, Poisson Regression

Design Matrices#

  • Once the model is chosen design metrices are constructed. Y = XB + e

Variable

Description

Y

vector/matrix of dependent variable

X

vector/matrix of independent variable

B

vector/matrix of coefficient

e

residual error

Creating a Model#

using statsmodel library#

  • OLS (oridinart least squares)

  • GLM (genralized linear model)

  • WLS (weighted least squares)

  • ols

  • glm

  • wls

    Uppercase names take design metrices as args
    Lowercase names take Patsy formulas and dataframes as args
    

Fitting a Model#

  • fitting method returns a model object for futher methods, attributes and coefficient matrix for analysis

View Model Summary#

  • Describe the fit description of the model in text.


Construct Design Matrices#

\(Y = \theta_0 + \theta_1 X_1 + \theta_2 X_2 + \theta_3 X_1 X_2\)

Design Matrix with Numpy#

[1]:
import numpy as np

Y = np.array([1,2,3,4,5]).reshape(-1,1)

x1 = np.array([6,7,8,9,10])
x2 = np.array([11,12,13,14,15])

X = np.vstack([np.ones(5), x1, x2, x1*x2]).T

print(Y)
print(X)
[[1]
 [2]
 [3]
 [4]
 [5]]
[[  1.   6.  11.  66.]
 [  1.   7.  12.  84.]
 [  1.   8.  13. 104.]
 [  1.   9.  14. 126.]
 [  1.  10.  15. 150.]]

Design Matrix with patsy#

  • allows defining a model easily

  • constructs relevant design matrices (patsy.dmatrices)

  • takes a formula in string form as arg and a dictionary like object with data arrays for resoponse variables

image

    ~
 /    \
Y     +
    /   \
   1     +
       /   \
      x1    +
          /   \
        x2     *
             /   \
           x1    x2
  • ‘y ~ np.log(x1)’: Often numpy functions can be used to transform terms in the expression.

  • ‘y ~ I(x1 + x2)’: I is the identify function, used to escape arithmetic expressions and are evaluated.

  • ‘y ~ C(x1)’: Treats the variable x1 as a categorical variable.

[2]:
import patsy

y = np.array([1, 2, 3, 4, 5])
x1 = np.array([6, 7, 8, 9, 10])
x2 = np.array([11, 12, 13, 14, 15])
data = {
    'Y' : Y,
    'x1' : x1,
    'x2' : x2,
}

equation = 'Y ~ 1 + x1 + x2 + x1*x2'

Y, X = patsy.dmatrices(equation, data)

print(Y)
print(X)
[[1.]
 [2.]
 [3.]
 [4.]
 [5.]]
[[  1.   6.  11.  66.]
 [  1.   7.  12.  84.]
 [  1.   8.  13. 104.]
 [  1.   9.  14. 126.]
 [  1.  10.  15. 150.]]

Linear Model Creation using statsmodels#

Using inbuilt Icecream dataset#

[4]:
import statsmodels.api as sm

icecream = sm.datasets.get_rdataset("Icecream","Ecdat")

dataset = icecream.data
dataset.head()
[4]:
cons income price temp
0 0.386 78 0.270 41
1 0.374 79 0.282 56
2 0.393 81 0.277 63
3 0.425 80 0.280 68
4 0.406 76 0.272 69
[5]:
import statsmodels.formula.api as smf

linearModel1 = smf.ols('cons ~ price + temp',dataset)

fitModel1 = linearModel1.fit()

print(fitModel1.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                   cons   R-squared:                       0.633
Model:                            OLS   Adj. R-squared:                  0.606
Method:                 Least Squares   F-statistic:                     23.27
Date:                Thu, 15 Oct 2020   Prob (F-statistic):           1.34e-06
Time:                        21:53:43   Log-Likelihood:                 54.607
No. Observations:                  30   AIC:                            -103.2
Df Residuals:                      27   BIC:                            -99.01
Df Model:                           2
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.5966      0.258      2.309      0.029       0.067       1.127
price         -1.4018      0.925     -1.515      0.141      -3.300       0.496
temp           0.0030      0.000      6.448      0.000       0.002       0.004
==============================================================================
Omnibus:                        0.991   Durbin-Watson:                   0.656
Prob(Omnibus):                  0.609   Jarque-Bera (JB):                0.220
Skew:                          -0.107   Prob(JB):                        0.896
Kurtosis:                       3.361   Cond. No.                     6.58e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.58e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
[6]:
linearModel2 = smf.ols('cons ~ income + temp',dataset)

fitModel2 = linearModel2.fit()

print(fitModel2.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                   cons   R-squared:                       0.702
Model:                            OLS   Adj. R-squared:                  0.680
Method:                 Least Squares   F-statistic:                     31.81
Date:                Thu, 15 Oct 2020   Prob (F-statistic):           7.96e-08
Time:                        21:53:43   Log-Likelihood:                 57.742
No. Observations:                  30   AIC:                            -109.5
Df Residuals:                      27   BIC:                            -105.3
Df Model:                           2
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.1132      0.108     -1.045      0.305      -0.335       0.109
income         0.0035      0.001      3.017      0.006       0.001       0.006
temp           0.0035      0.000      7.963      0.000       0.003       0.004
==============================================================================
Omnibus:                        2.264   Durbin-Watson:                   1.003
Prob(Omnibus):                  0.322   Jarque-Bera (JB):                1.094
Skew:                           0.386   Prob(JB):                        0.579
Kurtosis:                       3.528   Cond. No.                     1.56e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.56e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
[7]:
linearModel3 = smf.ols('cons ~ -1 + income + temp',dataset)

fitModel3 = linearModel3.fit()

print(fitModel3.summary())
                                 OLS Regression Results
=======================================================================================
Dep. Variable:                   cons   R-squared (uncentered):                   0.990
Model:                            OLS   Adj. R-squared (uncentered):              0.990
Method:                 Least Squares   F-statistic:                              1426.
Date:                Thu, 15 Oct 2020   Prob (F-statistic):                    6.77e-29
Time:                        21:53:43   Log-Likelihood:                          57.146
No. Observations:                  30   AIC:                                     -110.3
Df Residuals:                      28   BIC:                                     -107.5
Df Model:                           2
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
income         0.0023      0.000      9.906      0.000       0.002       0.003
temp           0.0033      0.000      8.571      0.000       0.003       0.004
==============================================================================
Omnibus:                        3.584   Durbin-Watson:                   0.887
Prob(Omnibus):                  0.167   Jarque-Bera (JB):                2.089
Skew:                           0.508   Prob(JB):                        0.352
Kurtosis:                       3.798   Cond. No.                         6.45
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[8]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
df = sm.datasets.get_rdataset("mtcars").data


model = smf.ols('np.log(wt) ~ np.log(mpg)',df)
trainedModel = model.fit()

print(trainedModel.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:             np.log(wt)   R-squared:                       0.806
Model:                            OLS   Adj. R-squared:                  0.799
Method:                 Least Squares   F-statistic:                     124.4
Date:                Thu, 15 Oct 2020   Prob (F-statistic):           3.41e-12
Time:                        21:53:48   Log-Likelihood:                 18.024
No. Observations:                  32   AIC:                            -32.05
Df Residuals:                      30   BIC:                            -29.12
Df Model:                           1
Covariance Type:            nonrobust
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       3.9522      0.255     15.495      0.000       3.431       4.473
np.log(mpg)    -0.9570      0.086    -11.152      0.000      -1.132      -0.782
==============================================================================
Omnibus:                        1.199   Durbin-Watson:                   1.625
Prob(Omnibus):                  0.549   Jarque-Bera (JB):                1.159
Skew:                           0.349   Prob(JB):                        0.560
Kurtosis:                       2.381   Cond. No.                         33.5
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Logistic Regression#

  • Logit : Logistic Regression

  • MNLogit : Multinomial Logistic Regression

  • Poisson : Poisson Regression

\begin{align} h_\theta (x) = g(\theta^T x)\\ y = \theta^T\\ g(x) = \frac{1}{1 + e^{-y}} \end{align}

[9]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
import pandas as pd


df = sm.datasets.get_rdataset('iris').data

## logistic regression takes only two variables as target
dfSubset = df[(df['Species'] == "versicolor") | (df['Species'] == "virginica")].copy()


## preprocessing
## label endoding manually

dfSubset["Species"] = dfSubset['Species'].map({
    "versicolor" : 1,
    "virginica" : 0
})

dfSubset.columns = [column.replace(".","_") for column in dfSubset.columns]

## Creating a model
model = smf.logit('Species ~ Petal_Length + Petal_Width ', data = dfSubset)
trainedModel = model.fit()

print(trainedModel.summary())



## Make Predictions
dfTest = pd.DataFrame({
    "Petal_Length" : np.random.randn(20) * 0.7 + 6,
    "Petal_Width" : np.random.randn(20) * 0.7 + 1
})


dfTest['rawSpecies'] = trainedModel.predict(dfTest)
dfTest['Species'] = dfTest.rawSpecies.apply(lambda x: 1 if x>0.5 else 0)

print("*---- Test data and Predictions ----*")
dfTest.head()
Optimization terminated successfully.
         Current function value: 0.102818
         Iterations 10
                           Logit Regression Results
==============================================================================
Dep. Variable:                Species   No. Observations:                  100
Model:                          Logit   Df Residuals:                       97
Method:                           MLE   Df Model:                            2
Date:                Thu, 15 Oct 2020   Pseudo R-squ.:                  0.8517
Time:                        21:53:53   Log-Likelihood:                -10.282
converged:                       True   LL-Null:                       -69.315
Covariance Type:            nonrobust   LLR p-value:                 2.303e-26
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       45.2723     13.612      3.326      0.001      18.594      71.951
Petal_Length    -5.7545      2.306     -2.496      0.013     -10.274      -1.235
Petal_Width    -10.4467      3.756     -2.782      0.005     -17.808      -3.086
================================================================================

Possibly complete quasi-separation: A fraction 0.34 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
*---- Test data and Predictions ----*
[9]:
Petal_Length Petal_Width rawSpecies Species
0 6.437427 0.075640 0.999412 1
1 6.710820 1.414629 0.000296 0
2 5.009607 1.536436 0.597176 1
3 5.885385 1.335874 0.072375 0
4 6.339593 0.180516 0.998998 1

Poisson Regression Model#

  • Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables.

  • describes a process where dependent variable refers to success count of many attempts and each attempt has a very low probability of success.

[10]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf


df = pd.read_csv("https://stats.idre.ucla.edu/stat/data/poisson_sim.csv")


model = smf.poisson('num_awards ~ math + C(prog)',data = df)
trainedModel = model.fit()


print(trainedModel.summary())
Optimization terminated successfully.
         Current function value: 0.913761
         Iterations 6
                          Poisson Regression Results
==============================================================================
Dep. Variable:             num_awards   No. Observations:                  200
Model:                        Poisson   Df Residuals:                      196
Method:                           MLE   Df Model:                            3
Date:                Thu, 15 Oct 2020   Pseudo R-squ.:                  0.2118
Time:                        21:53:55   Log-Likelihood:                -182.75
converged:                       True   LL-Null:                       -231.86
Covariance Type:            nonrobust   LLR p-value:                 3.747e-21
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       -5.2471      0.658     -7.969      0.000      -6.538      -3.957
C(prog)[T.2]     1.0839      0.358      3.025      0.002       0.382       1.786
C(prog)[T.3]     0.3698      0.441      0.838      0.402      -0.495       1.234
math             0.0702      0.011      6.619      0.000       0.049       0.091
================================================================================