Multiple Regression for sp500 Prediction Using Sector ETFs




MultipleRegression_SP500_Prediction








Multiple Regression for sp500 Prediction Using Sector ETFs

By Maxim Shen … Jul 12th 2016

In this example, we want to illustrate how we can predict SP500 index, given different sector ETF historical daily market data.

We try to reduce independent dimensions to determine our best multiple regression model to achieve better prediction capability by reducing independents multicollinearity and possible overfitting, increasing model update efficiency

You can download original jupyter notebook and data files from my GitHub repo

In [1]:
'''
In this example, we pick some ETF with relative less tracking errors 

IYK - consumer goods
VIS - industries
VPU - utilities
XLK - technology
PCF - finance
FENY - energy
FREL - real estate

'''

import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model
import statsmodels.api as sm

path = 'csv/'
price = 'Adj Close'
fields = ['Date','Adj Close']

# load 8 independents and dependent data
colTitles = ['cust', 'idst', 'util', 'tech', 'fina', 'engy', 'rest', 'care']

cust = pd.read_csv(path + 'consumer_IYK.csv', usecols=fields, index_col='Date')
idst = pd.read_csv(path + 'industries_VIS.csv', usecols=fields, index_col='Date')
util = pd.read_csv(path + 'utilities_VPU.csv', usecols=fields, index_col='Date')
tech = pd.read_csv(path + 'tech_XLK.csv', usecols=fields, index_col='Date')
fina = pd.read_csv(path + 'finance_PGF.csv', usecols=fields, index_col='Date')
engy = pd.read_csv(path + 'energy_FENY.csv', usecols=fields, index_col='Date')
rest = pd.read_csv(path + 'realestate_FREL.csv', usecols=fields, index_col='Date')
care = pd.read_csv(path + 'healthcare_IHI.csv', usecols=fields, index_col='Date')

sp500 = pd.read_csv(path + 'sp500.csv', usecols=fields, low_memory=False, index_col='Date')  

# data includes 131 trading days between Jan 4th 2016 and Jul 11th 2016
print(cust.shape,idst.shape,util.shape,tech.shape,fina.shape,engy.shape,rest.shape,sp500.shape)
(131, 1) (131, 1) (131, 1) (131, 1) (131, 1) (131, 1) (131, 1) (131, 1)
In [2]:
# Make scatter plots for all independents and dependent(SP500) pairs 
%matplotlib inline

fig = plt.figure(figsize=(20, 20))

sub1 = fig.add_subplot(421)
sub1.scatter(cust, sp500)
sub1.set_title('customer', fontsize=30)
sub1.grid(True)

sub2 = fig.add_subplot(422)
sub2.scatter(idst, sp500)
sub2.set_title('industries', fontsize=30)
sub2.grid(True)

sub3 = fig.add_subplot(423)
sub3.scatter(util, sp500)
sub3.set_title('utilities', fontsize=30)
sub3.grid(True)

sub4 = fig.add_subplot(424)
sub4.scatter(tech, sp500)
sub4.set_title('technology', fontsize=30)
sub4.grid(True)

sub5 = fig.add_subplot(425)
sub5.scatter(fina, sp500)
sub5.set_title('finance', fontsize=30)
sub5.grid(True)

sub6 = fig.add_subplot(426)
sub6.scatter(engy, sp500)
sub6.set_title('energy', fontsize=30)
sub6.grid(True)

sub7 = fig.add_subplot(427)
sub7.scatter(rest, sp500)
sub7.set_title('real estate', fontsize=30)
sub7.grid(True)

sub8 = fig.add_subplot(428)
sub8.scatter(care, sp500)
sub8.set_title('health care', fontsize=30)
sub8.grid(True)

fig.tight_layout()

Observation:

  1. All independent variables seem having certain linear correlation with SP500,
  2. except for lower range of utilities
  3. Let’s still keep ‘utilities’ independent in the model for now
  4. Later, we will take out ‘utilities’ to improve the model prediction capability
In [3]:
# Next, let's check all Adjusted R-squared scores for all pairs above
colTitles = ['cust', 'idst', 'util', 'tech', 'fina', 'engy', 'rest', 'care']
print('cust', sm.OLS(sp500, cust).fit().rsquared_adj)
print('idst', sm.OLS(sp500, idst).fit().rsquared_adj)
print('util', sm.OLS(sp500, util).fit().rsquared_adj)
print('tech', sm.OLS(sp500, tech).fit().rsquared_adj)
print('fina', sm.OLS(sp500, fina).fit().rsquared_adj)
print('engy', sm.OLS(sp500, engy).fit().rsquared_adj)
print('rest', sm.OLS(sp500, rest).fit().rsquared_adj)
print('care', sm.OLS(sp500, care).fit().rsquared_adj)
cust 0.999878271424
idst 0.999672708306
util 0.998866896615
tech 0.999823931866
fina 0.999471444364
engy 0.997757890206
rest 0.999478007237
care 0.998983909575

Observation:

  1. All pairs have pretty high adjusted r-squared values
  2. It is hard for us to take out any dimension at this stage
In [4]:
# Let's concatenate all data columns to make a whole dataframe
data = pd.concat([cust, idst, util, tech, fina, engy, rest, care], axis=1)
data.columns = colTitles
data.head()
Out[4]:
cust idst util tech fina engy rest care
Date
2016-07-11 117.529999 109.309998 114.349998 44.389999 19.25 19.270000 25.320000 140.039993
2016-07-08 117.279999 108.570000 114.379997 44.130001 19.25 19.260000 25.150000 139.949997
2016-07-07 115.680000 106.459999 113.330002 43.450001 19.17 19.010000 24.719999 138.100006
2016-07-06 115.489998 106.180000 115.419998 43.450001 19.15 19.219999 25.000000 137.520004
2016-07-05 115.239998 105.599998 115.059998 43.150002 19.10 19.120001 25.100000 135.820007
In [5]:
# Correlation matrix for all independent variables
print(round(data.corr(),4))
        cust    idst    util    tech    fina    engy    rest    care
cust  1.0000  0.9701  0.9183  0.9178  0.8378  0.9270  0.9375  0.8985
idst  0.9701  1.0000  0.8540  0.9275  0.8125  0.9452  0.8920  0.8813
util  0.9183  0.8540  1.0000  0.7680  0.7917  0.8646  0.8711  0.8449
tech  0.9178  0.9275  0.7680  1.0000  0.7231  0.8231  0.8379  0.7827
fina  0.8378  0.8125  0.7917  0.7231  1.0000  0.8964  0.9318  0.9498
engy  0.9270  0.9452  0.8646  0.8231  0.8964  1.0000  0.9082  0.9431
rest  0.9375  0.8920  0.8711  0.8379  0.9318  0.9082  1.0000  0.9425
care  0.8985  0.8813  0.8449  0.7827  0.9498  0.9431  0.9425  1.0000
In [6]:
# Correlation matrix Heatmap
fig = plt.figure(figsize=(10, 10))

ax = fig.add_subplot(111)
ax.set_title('Independent Values Correlation Matrix\n', fontsize=25)

cax = ax.matshow(data.corr(),cmap=plt.cm.Blues)
cbar = fig.colorbar(cax)
cbar.set_label("",size=15)
cbar.ax.tick_params(labelsize=15) 

xText = ax.set_xticklabels(['']+colTitles, fontsize=15)
yText = ax.set_yticklabels(['']+colTitles, fontsize=15)
In [7]:
# plot scatter matrix for all independent variable pairs
%matplotlib inline
from pandas.tools.plotting import scatter_matrix
Axes = scatter_matrix(data, figsize=[15,15], grid=True)

#y labels
ylabels = [plt.setp(item.yaxis.get_label(), 'size', 20) for item in Axes.ravel()]
#x labels
xlabels = [plt.setp(item.xaxis.get_label(), 'size', 20) for item in Axes.ravel()]
In [8]:
# Run multiple regression, 
# including all independent variabls 
results = sm.OLS(sp500, data).fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              Adj Close   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 2.156e+06
Date:                Tue, 12 Jul 2016   Prob (F-statistic):          1.27e-312
Time:                        17:38:17   Log-Likelihood:                -406.71
No. Observations:                 131   AIC:                             829.4
Df Residuals:                     123   BIC:                             852.4
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
cust           4.9468      0.800      6.184      0.000         3.363     6.530
idst           5.0153      0.614      8.165      0.000         3.799     6.231
util          -3.4200      0.254    -13.490      0.000        -3.922    -2.918
tech          13.1090      0.926     14.149      0.000        11.275    14.943
fina          28.8440      1.597     18.060      0.000        25.683    32.005
engy           6.5646      1.678      3.912      0.000         3.243     9.886
rest           1.1731      1.559      0.753      0.453        -1.913     4.259
care           0.7703      0.262      2.941      0.004         0.252     1.289
==============================================================================
Omnibus:                        2.033   Durbin-Watson:                   0.652
Prob(Omnibus):                  0.362   Jarque-Bera (JB):                2.030
Skew:                          -0.294   Prob(JB):                        0.362
Kurtosis:                       2.839   Cond. No.                         894.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Observation:

  1. All independents seem statiscally significant (p-value < 0.05), except for rest ‘real estate’
  2. R-squared and Adj R-squared are 1, probably due to Multicollinearity
  3. Only utility sector is negatively correlated, which might make sense
  4. Finance and Technology have a bigger impact with coef 28.8 and 13.1 respectively
In [9]:
'''
Next step is Dimension Reduction

First, we exclude real estate ETF from the model
Second, take care of multi-collinearity by removing redundant features
'''
m_data = data.copy(True)
if 'rest' in m_data.columns:
    del m_data['rest']
m_results = sm.OLS(sp500, m_data).fit()
print(m_results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              Adj Close   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 2.472e+06
Date:                Tue, 12 Jul 2016   Prob (F-statistic):          1.03e-315
Time:                        17:38:17   Log-Likelihood:                -407.01
No. Observations:                 131   AIC:                             828.0
Df Residuals:                     124   BIC:                             848.1
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
cust           5.1518      0.751      6.862      0.000         3.666     6.638
idst           4.9074      0.596      8.231      0.000         3.727     6.087
util          -3.4248      0.253    -13.537      0.000        -3.926    -2.924
tech          13.1870      0.919     14.348      0.000        11.368    15.006
fina          28.6756      1.579     18.165      0.000        25.551    31.800
engy           6.6923      1.667      4.016      0.000         3.394     9.991
care           0.8774      0.220      3.996      0.000         0.443     1.312
==============================================================================
Omnibus:                        2.251   Durbin-Watson:                   0.647
Prob(Omnibus):                  0.325   Jarque-Bera (JB):                2.233
Skew:                          -0.311   Prob(JB):                        0.327
Kurtosis:                       2.848   Cond. No.                         839.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Observation:

  1. All independents seem statiscally significant (p-value < 0.05)
  2. R-squared and Adj R-squared are 1, probably due to Multicollinearity
  3. Condition number(Cond. No.) = 839, which is greater than thirty, this model may have multicolinearity
In [10]:
print('Parameters: ', m_results.params)
print('Standard errors: ', m_results.bse)
Parameters:  cust     5.151822
idst     4.907350
util    -3.424835
tech    13.186987
fina    28.675587
engy     6.692319
care     0.877377
dtype: float64
Standard errors:  cust    0.750777
idst    0.596233
util    0.252992
tech    0.919058
fina    1.578639
engy    1.666554
care    0.219547
dtype: float64
In [11]:
# Draw a plot to compare the true relationship to OLS predictions

from statsmodels.sandbox.regression.predstd import wls_prediction_std
prstd, iv_l, iv_u = wls_prediction_std(m_results)

fig, ax = plt.subplots(figsize=(10,8))
n = len(m_results.predict())
x = np.linspace(1, n, num=n, endpoint=True)
y_pred = m_results.predict()
y_true = sp500['Adj Close']

ax.grid(True)
ax.plot(x, y_pred, 'b--o', label="prediction", color='blue')
ax.plot(x, y_true, 'b--o', label="actual", color='red')
ax.legend(loc=3)
Out[11]:
<matplotlib.legend.Legend at 0x7f3e58e21ef0>
In [12]:
# plot differences between predicated SP500 and actual SP500
%matplotlib inline
import matplotlib.pyplot as plt

diff = abs(y_pred - y_true)

fig = plt.figure(figsize=(10, 10))
plt.hist(diff, bins=100, color='blue')
plt.grid(True)
plt.title("Prediction & Actual Abs Difference Histogram", fontsize=15)
plt.xlabel("Prediction & Actual Abs Difference", fontsize=15)
plt.ylabel("Frequency", fontsize=15)
plt.show()

# 99%, 95%, 90% difference
print("99 Percentile : ", np.percentile(diff, 99))
print("95 Percentile : ", np.percentile(diff, 95))
print("90 Percentile : ", np.percentile(diff, 90))
99 Percentile :  13.3476341025
95 Percentile :  10.5842526967
90 Percentile :  8.93604908931

Observation:

  1. This shows 99 percentile absolute difference between predicated SP500 and actual SP500 is less than 13.34
  2. This shows 95 percentile absolute difference between predicated SP500 and actual SP500 is less than 10.58
  3. This shows 90 percentile absolute difference between predicated SP500 and actual SP500 is less than 8.94
In [13]:
'''
# To check Multicollinearity, we check condition numbers from OLSResults
#
# The condition number measures the sensitivity of a function’s output to its input. 
# When two predictor variables are highly correlated, which is called multicolinearity, the coefficients 
# or factors of those predictor variables can fluctuate erratically for small changes in the data, or the model. 
# Ideally, similar models should be similar, i.e., have approximately equal coefficients. Multicolinearity can 
# cause numerical matrix inversion to crap out, or produce inaccurate results. One approach to this problem in
# regression is the technique of ridge regression, which is available in the sklearn Python module.
#
# We are looking for condition number as small as possible, ideally < 35

Further improvement: 

Here, we use the idea of 'Best Subsets' to find which dimensions should be 
involved to determine the best model in order to Multicollinearity. 

Now, we have 7 independents, after taking out real estate sector. Therefore,
we have total 2^7=128 subsets using 7 independents combinations. 

'''
from itertools import *
m_colTitles = set(['cust', 'idst', 'util', 'tech', 'fina', 'engy', 'care'])

# power set is the set of all possible subsets of a set
def powerset(iterable):
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))

# print condition number of OLS regression 
def printCondNum(x, y, x_name):
    print(x_name,': ',sm.OLS(y, x).fit().condition_number)    

for e in powerset(m_colTitles):
    if e: 
        printCondNum(m_data[list(e)], sp500, e)
('fina',) :  1.0
('tech',) :  1.0
('util',) :  1.0
('engy',) :  1.0
('idst',) :  1.0
('cust',) :  1.0
('care',) :  1.0
('fina', 'tech') :  94.7555089627
('fina', 'util') :  139.737696307
('fina', 'engy') :  31.8238949641
('fina', 'idst') :  146.296090409
('fina', 'cust') :  245.830435015
('fina', 'care') :  162.428078262
('tech', 'util') :  78.157229643
('tech', 'engy') :  51.891405788
('tech', 'idst') :  122.496964645
('tech', 'cust') :  181.487725712
('tech', 'care') :  81.3057628439
('util', 'engy') :  137.934241213
('util', 'idst') :  66.0978081326
('util', 'cust') :  78.7625508834
('util', 'care') :  59.830720208
('engy', 'idst') :  167.805482535
('engy', 'cust') :  134.372353739
('engy', 'care') :  231.971298366
('idst', 'cust') :  108.429051083
('idst', 'care') :  68.2879080212
('cust', 'care') :  62.5679873413
('fina', 'tech', 'util') :  233.960408026
('fina', 'tech', 'engy') :  102.956613729
('fina', 'tech', 'idst') :  243.695883536
('fina', 'tech', 'cust') :  263.231563574
('fina', 'tech', 'care') :  277.136826823
('fina', 'util', 'engy') :  149.925899367
('fina', 'util', 'idst') :  215.674044742
('fina', 'util', 'cust') :  364.778308737
('fina', 'util', 'care') :  232.066664928
('fina', 'engy', 'idst') :  217.252295841
('fina', 'engy', 'cust') :  295.037449561
('fina', 'engy', 'care') :  286.328015244
('fina', 'idst', 'cust') :  435.32490785
('fina', 'idst', 'care') :  235.375037481
('fina', 'cust', 'care') :  371.114049686
('tech', 'util', 'engy') :  149.108160649
('tech', 'util', 'idst') :  168.278686141
('tech', 'util', 'cust') :  298.038255966
('tech', 'util', 'care') :  120.225457908
('tech', 'engy', 'idst') :  300.336102416
('tech', 'engy', 'cust') :  203.736249956
('tech', 'engy', 'care') :  248.092236642
('tech', 'idst', 'cust') :  240.141334378
('tech', 'idst', 'care') :  192.092690071
('tech', 'cust', 'care') :  284.688921636
('util', 'engy', 'idst') :  238.592318329
('util', 'engy', 'cust') :  200.293926513
('util', 'engy', 'care') :  301.542664504
('util', 'idst', 'cust') :  138.785808522
('util', 'idst', 'care') :  90.3352098256
('util', 'cust', 'care') :  102.327469061
('engy', 'idst', 'cust') :  299.534755816
('engy', 'idst', 'care') :  314.294642406
('engy', 'cust', 'care') :  311.142830825
('idst', 'cust', 'care') :  143.278671068
('fina', 'tech', 'util', 'engy') :  249.623344105
('fina', 'tech', 'util', 'idst') :  351.819489309
('fina', 'tech', 'util', 'cust') :  400.036183056
('fina', 'tech', 'util', 'care') :  351.644528085
('fina', 'tech', 'engy', 'idst') :  324.407884916
('fina', 'tech', 'engy', 'cust') :  321.216226938
('fina', 'tech', 'engy', 'care') :  393.941908302
('fina', 'tech', 'idst', 'cust') :  458.658692657
('fina', 'tech', 'idst', 'care') :  431.222808803
('fina', 'tech', 'cust', 'care') :  382.87643747
('fina', 'util', 'engy', 'idst') :  327.093290563
('fina', 'util', 'engy', 'cust') :  407.480613823
('fina', 'util', 'engy', 'care') :  413.544417108
('fina', 'util', 'idst', 'cust') :  581.740172375
('fina', 'util', 'idst', 'care') :  286.50644884
('fina', 'util', 'cust', 'care') :  471.47800278
('fina', 'engy', 'idst', 'cust') :  439.250616026
('fina', 'engy', 'idst', 'care') :  544.605561956
('fina', 'engy', 'cust', 'care') :  559.887304597
('fina', 'idst', 'cust', 'care') :  579.151767256
('tech', 'util', 'engy', 'idst') :  419.973705744
('tech', 'util', 'engy', 'cust') :  301.974000808
('tech', 'util', 'engy', 'care') :  321.579515743
('tech', 'util', 'idst', 'cust') :  356.954220258
('tech', 'util', 'idst', 'care') :  226.871006503
('tech', 'util', 'cust', 'care') :  389.050755816
('tech', 'engy', 'idst', 'cust') :  426.298277383
('tech', 'engy', 'idst', 'care') :  562.889070991
('tech', 'engy', 'cust', 'care') :  322.767016834
('tech', 'idst', 'cust', 'care') :  339.664204152
('util', 'engy', 'idst', 'cust') :  438.148239042
('util', 'engy', 'idst', 'care') :  376.32955597
('util', 'engy', 'cust', 'care') :  379.71442485
('util', 'idst', 'cust', 'care') :  174.067700202
('engy', 'idst', 'cust', 'care') :  587.354167337
('fina', 'tech', 'util', 'engy', 'idst') :  440.435601798
('fina', 'tech', 'util', 'engy', 'cust') :  450.723756712
('fina', 'tech', 'util', 'engy', 'care') :  511.704881587
('fina', 'tech', 'util', 'idst', 'cust') :  601.562900722
('fina', 'tech', 'util', 'idst', 'care') :  514.119460387
('fina', 'tech', 'util', 'cust', 'care') :  510.729052271
('fina', 'tech', 'engy', 'idst', 'cust') :  495.581836102
('fina', 'tech', 'engy', 'idst', 'care') :  596.288909794
('fina', 'tech', 'engy', 'cust', 'care') :  577.913273311
('fina', 'tech', 'idst', 'cust', 'care') :  625.181347673
('fina', 'util', 'engy', 'idst', 'cust') :  650.129996313
('fina', 'util', 'engy', 'idst', 'care') :  651.284809417
('fina', 'util', 'engy', 'cust', 'care') :  667.484659647
('fina', 'util', 'idst', 'cust', 'care') :  770.417804407
('fina', 'engy', 'idst', 'cust', 'care') :  661.860853125
('tech', 'util', 'engy', 'idst', 'cust') :  517.641789365
('tech', 'util', 'engy', 'idst', 'care') :  662.32306089
('tech', 'util', 'engy', 'cust', 'care') :  409.715286583
('tech', 'util', 'idst', 'cust', 'care') :  446.595902903
('tech', 'engy', 'idst', 'cust', 'care') :  700.382509326
('util', 'engy', 'idst', 'cust', 'care') :  764.462727712
('fina', 'tech', 'util', 'engy', 'idst', 'cust') :  680.98407618
('fina', 'tech', 'util', 'engy', 'idst', 'care') :  703.21689301
('fina', 'tech', 'util', 'engy', 'cust', 'care') :  691.826769327
('fina', 'tech', 'util', 'idst', 'cust', 'care') :  785.081121907
('fina', 'tech', 'engy', 'idst', 'cust', 'care') :  711.892237773
('fina', 'util', 'engy', 'idst', 'cust', 'care') :  788.174516849
('tech', 'util', 'engy', 'idst', 'cust', 'care') :  825.014208966
('fina', 'tech', 'util', 'engy', 'idst', 'cust', 'care') :  838.755084673

Observation:

  1. If we pick two dimensions, the best combination of independents is (‘engy’, ‘fina’), cond=31.82
  2. If we pick three, the best combination of independents is (‘util’, ‘care’, ‘idst’), cond=90.34
  3. If we pick four, the best combination of independents is (‘cust’, ‘util’, ‘care’, ‘idst’), cond=174.07
  4. If we pick five, the best combination of independents is (‘engy’, ‘tech’, ‘cust’, ‘util’, ‘care’), cond= 409.72
  5. Let’s try multiple regression of (‘engy’, ‘fina’) and (‘util’, ‘care’, ‘idst’), whose cond < 100
In [14]:
# Multiple regression using ('engy', 'fina')
x = m_data[list(('engy', 'fina'))]
y = sp500
print(sm.OLS(y, x).fit().summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              Adj Close   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 3.168e+05
Date:                Tue, 12 Jul 2016   Prob (F-statistic):          8.11e-239
Time:                        17:38:19   Log-Likelihood:                -626.21
No. Observations:                 131   AIC:                             1256.
Df Residuals:                     129   BIC:                             1262.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
engy          32.7003      2.293     14.263      0.000        28.164    37.237
fina          78.1883      2.184     35.793      0.000        73.866    82.510
==============================================================================
Omnibus:                        0.803   Durbin-Watson:                   0.172
Prob(Omnibus):                  0.669   Jarque-Bera (JB):                0.881
Skew:                           0.179   Prob(JB):                        0.644
Kurtosis:                       2.817   Cond. No.                         31.8
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [15]:
# Multiple regression using ('util', 'care', 'idst')
x = m_data[list(('util', 'care', 'idst'))]
y = sp500
print(sm.OLS(y, x).fit().summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              Adj Close   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 1.367e+05
Date:                Tue, 12 Jul 2016   Prob (F-statistic):          3.91e-224
Time:                        17:38:20   Log-Likelihood:                -654.21
No. Observations:                 131   AIC:                             1314.
Df Residuals:                     128   BIC:                             1323.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
util           0.4804      1.081      0.444      0.658        -1.659     2.620
care           1.7575      0.912      1.927      0.056        -0.048     3.563
idst          17.2597      1.253     13.779      0.000        14.781    19.738
==============================================================================
Omnibus:                       15.319   Durbin-Watson:                   0.035
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               17.866
Skew:                           0.886   Prob(JB):                     0.000132
Kurtosis:                       2.638   Cond. No.                         90.3
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Observation:

  1. (‘engy’, ‘fina’) performs better than (‘util’, ‘care’, ‘idst’), as for p-value
  2. (‘engy’, ‘fina’) performs better than (‘util’, ‘care’, ‘idst’), as for condition number

Conclusion:

  1. There is a lot of pre work needed before we actually run multiple regression

  2. Pre work including:
    ⋅⋅ checking independents correlation matrix
    ⋅⋅
    independent scatter plots
    ⋅⋅ independent vs dependent regression pair
    ⋅⋅
    residual analysis

  3. In this example, we use “Best Subsets” to find the best OLS model in order to minimize Multicollinearity and find most statistical significance


Leave a Reply