This website works better with desktop in both themes, for mobile devices please change to light theme.

Principal Component Analysis#

References -

Eigensteve

Principal Component Analysis

[1]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
import seaborn as sns

Statistical interpretation of SVD#

  • Principal components analysis (PCA) is a popular approach for deriving a low-dimensional set of features from a large set of variables.

  • PCA is a technique for reducing the dimension of a n × p data matrix X. The first principal component direction of the data is that along which the observations vary the most

  • When faced with a large set of correlated variables, principal components allow us to summarize this set with a smaller number of representative variables that collectively explain most of the variability in the original set.

  • hierarchical co-ordinate systems

each row vector x is measurement from a single experiment

\(X= \begin{bmatrix} -- && x_1 && --\\ -- && x_2 && --\\ && . &&\\ && . \end{bmatrix}\)

  • Compute mean row \(\bar{x} = \frac{1}{n}{\sum_{j=0}^{n}{x_j}} ,\bar{X}\) average matrix

\(\begin{bmatrix} 1\\ 1\\ ...\\ 1 \end{bmatrix} \begin{bmatrix} - && \bar{x} && - \end{bmatrix}\)

  • Subtract mean from data matrix \(B = X - \bar{X}\)

    • mean centered data

    • 0 mean gaussian

  • Covariance matrix \(C = \frac{1}{n-1}{B^T B}\)

C is a Hermitian matrix. square matrix where \(a_{ij}=\bar{a_{ji}}\)

[2]:
X = np.array([
    [1, 2, 3, 4, 5],
    [6, 7, 8, 9, 10],
    [11, 12, 13, 14, 15],
    [16, 17, 18, 19, 20],
    [21, 22, 23, 24, 25],
])
[3]:
X.T @ X
[3]:
array([[ 855,  910,  965, 1020, 1075],
       [ 910,  970, 1030, 1090, 1150],
       [ 965, 1030, 1095, 1160, 1225],
       [1020, 1090, 1160, 1230, 1300],
       [1075, 1150, 1225, 1300, 1375]])
  • First principal component \(u_1 where ... {argmin\atop{||u_1|| = 1}} {u_1^T B^T B u_1}\) and it is eigen vector and eigen value.

  • \(C V = V D\) V = Eigen Vectors D = Eigen Values

  • T = BV

Decompose \(B = U\Sigma V^T\)

Principal component(T) * T = \(U\Sigma V^T.V\) * T = \(U\Sigma\)

\(B^*\) is complex conjugate of \(B\). When \(B\) is complex then complex conjugate is used for \(B^T\)

PCA on Gaussian Distribution#

[4]:
dim = 2 # dimension on data
mu = np.array([2,1]) # mean for both dims
sigma = np.array([2, 0.5]) # deviation for both dims
n_points = 10000 # number of samples

# rotation angle
theta = np.pi/3

# rotation matrix
rotation_matrix = np.array([
    [ np.cos(theta), -np.sin(theta) ],
    [ np.sin(theta), np.cos(theta) ]
])

Generating gaussian data#

[5]:
rndm_data = np.random.randn(dim,n_points) #(2,10000)
[6]:
rndm_data.mean(axis=1),rndm_data.std(axis=1)
[6]:
(array([0.01398408, 0.01466984]), array([0.99288324, 0.99781115]))
[7]:
sns.distplot(rndm_data[0])
sns.distplot(rndm_data[1])
[7]:
<AxesSubplot:>
../_images/MathExploration_PrincipalComponentAnalysis_20_1.png

\(G(\mu,\sigma)\) = Gaussian data with \(\mu\) mean and \(\sigma\) standard deviation

Changing parameters of gaussian distribution

  • for 1-D data \(G(\mu,\sigma) = \sigma \times G(0,1) + \mu\)

  • for 2-D data \(G(\mu_{1 \times 2},\sigma_{1 \times 2}) = \begin{bmatrix} \sigma_1 && 0\\ 0 && \sigma_2 \end{bmatrix} \times G(0,1)_{2 \times n} + \begin{bmatrix} \mu_1 && 0\\ 0 && \mu_2 \end{bmatrix} \times \begin{bmatrix} 1 && &&\\ && \ddots &&\\ && && 1 \end{bmatrix}_{2 \times n}\)

for diagonal matrices it will only calculate with values of another matrix’s dimension for which the data is available.

[8]:
g_data = (np.diag(sigma) @ rndm_data) + (np.diag(mu) @ np.ones((dim,n_points)))
[9]:
sns.distplot(g_data[0])
sns.distplot(g_data[1])
[9]:
<AxesSubplot:>
../_images/MathExploration_PrincipalComponentAnalysis_24_1.png
[10]:
plt.plot(g_data[0],g_data[1],'.')
[10]:
[<matplotlib.lines.Line2D at 0x7f03187a04f0>]
../_images/MathExploration_PrincipalComponentAnalysis_25_1.png

Rotate data#

[11]:
# as it is a 2D rotation

r_data = rotation_matrix @ g_data
[12]:
sns.distplot(r_data[0])
sns.distplot(r_data[1])
[12]:
<AxesSubplot:>
../_images/MathExploration_PrincipalComponentAnalysis_28_1.png
[13]:
plt.plot(r_data[0,...],r_data[1,...],'.')
[13]:
[<matplotlib.lines.Line2D at 0x7f031858b220>]
../_images/MathExploration_PrincipalComponentAnalysis_29_1.png
[14]:
plt.plot(g_data[0,...],g_data[1,...],'.')
plt.plot(r_data[0,...],r_data[1,...],'.')
[14]:
[<matplotlib.lines.Line2D at 0x7f03185643a0>]
../_images/MathExploration_PrincipalComponentAnalysis_30_1.png

Calculation#

[15]:
X = r_data
[16]:
X.shape
[16]:
(2, 10000)
[17]:
X_avg = X.mean(axis=1)
[18]:
# broadcast average of X to be subtracted from X
np.tile(X_avg,(n_points,1)).T
[18]:
array([[0.14160645, 0.14160645, 0.14160645, ..., 0.14160645, 0.14160645,
        0.14160645],
       [2.2599394 , 2.2599394 , 2.2599394 , ..., 2.2599394 , 2.2599394 ,
        2.2599394 ]])
[19]:
# or use this
np.diag(X_avg) @ np.ones((2,n_points))
[19]:
array([[0.14160645, 0.14160645, 0.14160645, ..., 0.14160645, 0.14160645,
        0.14160645],
       [2.2599394 , 2.2599394 , 2.2599394 , ..., 2.2599394 , 2.2599394 ,
        2.2599394 ]])

mean data matrix#

[20]:
B = X - np.diag(X_avg) @ np.ones((2,n_points))
[21]:
B.shape
[21]:
(2, 10000)
[22]:
plt.plot(B[0],B[1],'.')
[22]:
[<matplotlib.lines.Line2D at 0x7f03184c97c0>]
../_images/MathExploration_PrincipalComponentAnalysis_40_1.png
[23]:
sns.distplot(B[0])
sns.distplot(B[1])
[23]:
<AxesSubplot:>
../_images/MathExploration_PrincipalComponentAnalysis_41_1.png

Singular Value Decomposition#

decompose \(B = U \Sigma V^T\)

[24]:
U,S,VT = np.linalg.svd(B/np.sqrt(n_points),full_matrices=False)

U.shape, S.shape, VT.shape
[24]:
((2, 2), (2,), (2, 10000))

calculate \(T = U\Sigma\)

[25]:
principal_comp = U @ np.diag(S)

principal_comp
[25]:
array([[-0.99797602, -0.43129616],
       [-1.71678326,  0.25071495]])
[26]:
# this is to create a list of angles to rotate the straght line into a circle
range_theta = 2 * np.pi * np.arange(0,1,0.01)

range_theta.shape
[26]:
(100,)
[27]:
circle = np.array([np.cos(range_theta), np.sin(range_theta)])

plt.plot(circle[0],circle[1])
plt.plot(2*circle[0],2*circle[1])
plt.plot(3*circle[0],3*circle[1])

plt.show()
../_images/MathExploration_PrincipalComponentAnalysis_48_0.png

Initial princial component range#

[28]:
initial_pc_range = principal_comp @  circle

plt.plot(initial_pc_range[0],initial_pc_range[1],'r.')
[28]:
[<matplotlib.lines.Line2D at 0x7f03182d5be0>]
../_images/MathExploration_PrincipalComponentAnalysis_50_1.png

princial components range with deviation#

[29]:
plt.scatter(X_avg[0],X_avg[1],c='k',marker='o') # center

# plt.plot(X[0],X[1],'.',alpha=0.4)

plt.plot(X_avg[0] + initial_pc_range[0], X_avg[1] + initial_pc_range[1],'--')
plt.plot(X_avg[0] + 2*initial_pc_range[0], X_avg[1] + 2*initial_pc_range[1],'--')
plt.plot(X_avg[0] + 3*initial_pc_range[0], X_avg[1] + 3*initial_pc_range[1],'--')
[29]:
[<matplotlib.lines.Line2D at 0x7f03182440d0>]
../_images/MathExploration_PrincipalComponentAnalysis_52_1.png
[30]:
# first pc , second pc
principal_comp[0,...],principal_comp[1,...]
[30]:
(array([-0.99797602, -0.43129616]), array([-1.71678326,  0.25071495]))

Principal Components#

[31]:
plt.plot(X[0],X[1],'c.',alpha=0.4)

plt.plot(X_avg[0] + initial_pc_range[0], X_avg[1] + initial_pc_range[1],'.-')
plt.plot(X_avg[0] + 2*initial_pc_range[0], X_avg[1] + 2*initial_pc_range[1],'.-')
plt.plot(X_avg[0] + 3*initial_pc_range[0], X_avg[1] + 3*initial_pc_range[1],'.-')


plt.plot([X_avg[0], X_avg[0] + principal_comp[0,0]],[X_avg[1],X_avg[1] + principal_comp[1,0]],'k')
plt.plot([X_avg[0], X_avg[0] + principal_comp[0,1]],[X_avg[1],X_avg[1] + principal_comp[1,1]],'k')
[31]:
[<matplotlib.lines.Line2D at 0x7f031822de50>]
../_images/MathExploration_PrincipalComponentAnalysis_55_1.png

\(u_1=\mu_1 + U_1 * \sigma1\)

\(u_2=\mu_2 + U_1 * \sigma2\)

[32]:
X_avg + (U @ np.diag(S))
[32]:
array([[-0.85636957,  1.82864324],
       [-1.57517681,  2.51065435]])

Breast Cancer Data Anlysis with PCA#

[33]:
from sklearn.datasets import load_breast_cancer
[34]:
dataset = load_breast_cancer()
[35]:
X = dataset.data
[36]:
y = dataset.target
[37]:
X.shape,y.shape
[37]:
((569, 30), (569,))

naive observations

  • 569 samples and 30 features

lets take 3 principal components, just to make it easier for plotting a 3D diagram

[38]:
n_components = 3
[39]:
U, S, VT =np.linalg.svd(X,full_matrices=False)
print(U.shape,S.shape,VT.shape)
(569, 30) (30,) (30, 30)
[40]:
fig,ax = plt.subplots(1,2)

ax[0].axvline(n_components,c='k')
ax[0].plot(np.log(S))

ax[1].axvline(n_components,c='k')
ax[1].plot(np.cumsum(S)/S.sum())

plt.tight_layout()
plt.show()
../_images/MathExploration_PrincipalComponentAnalysis_67_0.png

Looks like 3 principal components are covering more than 98% of the eigen values. feels good when things work out automatically.

[41]:
T = U[...,:n_components] @ np.diag(S[...,:n_components])
[42]:
%matplotlib inline
fig = plt.figure(figsize=(10,10))

ax = plt.axes(projection='3d')

ax.scatter(T[...,0],T[...,1],T[...,2],c=y)

plt.show()
../_images/MathExploration_PrincipalComponentAnalysis_70_0.png

visible cluster of 1’s and 0’s with 3 principal components

lets try with mean centered matrix. Just checking if it gives any different results

[43]:
Xavg = X.mean(axis=0)

B = X - Xavg
[44]:
U, S, VT =np.linalg.svd(B/np.sqrt(X.shape[0]),full_matrices=False)
print(U.shape,S.shape,VT.shape)
(569, 30) (30,) (30, 30)
[45]:
T = U[...,:n_components] @ np.diag(S[...,:n_components])
[46]:
fig,ax = plt.subplots(1,2)

ax[0].axvline(n_components,c='k')
ax[0].plot(np.log(S))

ax[1].axvline(n_components,c='k')
ax[1].plot(np.cumsum(S)/S.sum())

plt.tight_layout()
plt.show()
../_images/MathExploration_PrincipalComponentAnalysis_76_0.png
[47]:
%matplotlib inline
fig = plt.figure(figsize=(10,10))
ax = plt.axes(projection='3d')

ax.scatter(T[...,0],T[...,1],T[...,2],c=y)

plt.show()
../_images/MathExploration_PrincipalComponentAnalysis_77_0.png

almost same. just the rotation is a little bit changed and data is scaled around center.