Workshop: PCA
There are two commands for performing principal component analysis (PCA) in R: princomp() and prcomp() (see details in the lecture note). The main goal of PCA in the tasks today will be dimensionality reduction. We wish to produce a smaller set of uncorrelated variables from the larger set of correlated variables. The resulting variables are uncorrelated and can be used as input to other statistical procedures, e.g. principal component regression (PCR).
Iris data
Let us consider the popular Iris data set (the data set is on Blackboard in a csv file iris.csv). You can also get the data in R by typing data(iris)
in the console. The data consists of 5 variables:
- Species (with three levels: Setosa, Versicolor, Virginica)
- Petal length (in mm)
- Petal width (in mm)
- Sepal length (in mm)
- Sepal width (in mm)
Now we perform spectral decomposition
(eigendecomposition) to get principal components score and loading which you can also implement with princomp() function. Here we are going to show the entire process step-by-step using various functions. (this is the method implemented in princomp() function).
Read the data into R and print out the first six rows of the data.
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Look at the summary statistics using:
summary(iris[,-5])
# fifth column in the data is removed
What can you say about the variables in the data set? We can plot a scatter plot matrix using pairs.panels
function in the psych package:
library(psych)
pairs.panels(iris[,-5],
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)
What can you say about the plot? Do the variables share sufficient information to warrant redundancy?
You can see that the Petal variables are highly correlated (correlation =0.96). They are also highly correlated with the Sepal length variable. This implies that the three variables share redundant information and the use of PCA can remove this redundancy, thereby reducing the dimension of the data.
Q. Do we need to use Covariance matrix (\(\Sigma\)) or Correlation matrix (scale the data)? We can check this by assessing the standard deviation of the variables.
<- apply(iris[,-5], 2, sd)
sd sd
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0.8280661 0.4358663 1.7652982 0.7622377
The variables should be scaled/correlation matrix used.
Spectral decomposition can be obtained using the eigen()
function
<- cor(iris[,-5])
S <- eigen(S)
eigdec eigdec
## eigen() decomposition
## $values
## [1] 2.91849782 0.91403047 0.14675688 0.02071484
##
## $vectors
## [,1] [,2] [,3] [,4]
## [1,] 0.5210659 -0.37741762 0.7195664 0.2612863
## [2,] -0.2693474 -0.92329566 -0.2443818 -0.1235096
## [3,] 0.5804131 -0.02449161 -0.1421264 -0.8014492
## [4,] 0.5648565 -0.06694199 -0.6342727 0.5235971
The values and vectors from the decomposition are the variances and the principal component loadings respectively.
<- eigdec$values
eig eig
## [1] 2.91849782 0.91403047 0.14675688 0.02071484
sum(eig)
## [1] 4
Q. Why is the sum of these eigenvalues 4?
We have four variables which are now scaled. The variance of each of them is 1. The total variance of all of them is therefore 4.
Q. What does eigenvalue > 1 mean?
An eigenvalue that is greather than 1 indicates that the corresponding PC explains more variation than accounted by one of the original variables.
The eigenvectors can be obtained as
<- eigdec$vectors
pc_loading rownames(pc_loading) <- colnames(iris[,-5])
pc_loading
## [,1] [,2] [,3] [,4]
## Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863
## Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971
We can obtain the proportion of variance explained as follows:
# Variances in percentage
<- eigdec$values
eig <- eig*100/sum(eig)
variance # Cumulative variances
<- cumsum(variance)
cumvar <- data.frame(eig = eig, variance = variance,
eig2 cumvariance = cumvar)
eig2
## eig variance cumvariance
## 1 2.91849782 72.9624454 72.96245
## 2 0.91403047 22.8507618 95.81321
## 3 0.14675688 3.6689219 99.48213
## 4 0.02071484 0.5178709 100.00000
We can supplement this result with a scree plot to decide the number of dimensions to keep.
barplot(eig2[, 2], names.arg=1:nrow(eig2),
main = "Scree plot",
xlab = "Dimensions",
ylab = "Percentage of variances",
col ="steelblue")
# Add connected line segments to the plot
lines(x = 1:nrow(eig2), eig2[, 2],
type="b", pch=19, col = "red")
Based on these (both the proportion of variance explained and the scree plot), how many dimensions should be kept for the Iris data set? The answer is that we have two dimensions that cumulatively explain about 96% of the variability in the data. This value is well beyond the rule of thumb which suggested retaining components that explain 80-90% of the variability in the original data set.
Recall that we learned about PC “loadings” and PC “scores” in the lecture. We can derive the PC scores using the following codes:
<- as.matrix(scale(iris[,-5]))%*% pc_loading
pc_score colnames(pc_score) <- paste0("PC", 1:4)
1:4,] pc_score[
## PC1 PC2 PC3 PC4
## [1,] -2.257141 -0.4784238 0.1272796 0.02408751
## [2,] -2.074013 0.6718827 0.2338255 0.10266284
## [3,] -2.356335 0.3407664 -0.0440539 0.02828231
## [4,] -2.291707 0.5953999 -0.0909853 -0.06573534
We can now make a scatter plot matrix as before using pc_score as input.
Here’s the plot:
library(psych)
pairs.panels(pc_score,
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)
The correlation between the PCs (which are the new coordinates) are essential zero as required.
From now on, we will keep the first two components. We can use the following codes to plot the scores:
<- as.data.frame(pc_score[,1:2])
pc_score2 $Species <- iris$Species
pc_score2rownames(pc_score2) <- seq(1:150)
<- range(pc_score2[,1])*1.1
xlim <- range(pc_score2[,2])*1.1
ylim
plot(pc_score2[,1:2], pch=19, xlim = xlim, ylim=ylim, main = "Plot of PC scores",
col=c("blue","red","green")[pc_score2$Species])
text(pc_score2[,1:2], labels =rownames(pc_score2[,1:2]),
pos = 3, col=c("blue","red","green")[pc_score2$Species])
abline(v=0, h=0, lty = 2)
Q. Can you deduce the relationship between the scores and the PCs?
Clearly the two species (on the right) look more similar than the other on the left. The data looks linearly separable (we can draw straight lines and separate the three species). In biplot, similar observations tend to be grouped together.
We can also plot the PC loadings:
colnames(pc_loading) <- paste0("PC", 1:4)
<- pc_loading[,1:2]
pc_loading2
par(bty = 'n')
<- range(pc_loading2[,1])*1.1
xlim <- range(pc_loading2[,2])*1.1
ylim
plot(pc_loading2[,1:2], pch=19, xlim = xlim, ylim=ylim, main = "Plot of PC loadings")
abline(v=0, h=0, lty = 2)
arrows(x0=0, y0=0, x1= pc_loading2[1,1],
y1 = pc_loading2[1,2],col="red",lwd=2)
arrows(x0=0, y0=0, x1 = pc_loading2[2,1],
y1 = pc_loading2[2,2],col="blue",lwd=2)
arrows(x0=0, y0=0, x1= pc_loading2[3,1],
y1 = pc_loading2[3,2],col="green",lwd=2)
arrows(x0=0, y0=0, x1 = pc_loading2[4,1],
y1 = pc_loading2[4,2],col="grey",lwd=2)
text(0.541, -0.477, "Sepal.Length",cex = .6,col="red" )
text(-0.240, -0.99, "Sepal.Width",cex = .6,col="blue")
text(0.580, -0.05, "Petal.Length",cex = .6,col="green" )
text(0.564, -0.10, "Petal.Width",cex = .6,col="grey")
As you probably observed, it is not straightforward to explain the plots separately. A biplot displays both the principal component scores and the principal component loadings. Click the link biplot to read more. We can now make this plot and explain our observation.
<- pc_loading[,1:2]
pc_loading2 biplot(pc_score2[,1:2], pc_loading2, xlab="PC1 (73%)", ylab="PC2 (23%)")
abline(v=0, h=0, lty=2)
From the plot as well as from the PC loadings, first loading vector places approximately equal weight on Sepal length, Petal length and Petal width, with much less weight in Sepal width. The second loading vector places most of its weight on Sepal width. Overall, we see that Sepal length, Petal length and Petal width are located close to each other.
The correlation between the original variables and the selected PCs is given by
cor(iris[,-5], pc_score[,1:2])
## PC1 PC2
## Sepal.Length 0.8901688 -0.36082989
## Sepal.Width -0.4601427 -0.88271627
## Petal.Length 0.9915552 -0.02341519
## Petal.Width 0.9649790 -0.06399985
The output indicates that Sepal length, Petal length and Petal width are highly correlated with PC1 while Sepal width is highly correlated with PC2.
Compare the above correlation matrix with the below:
t(t(pc_loading)*sqrt(eig))
## PC1 PC2 PC3 PC4
## Sepal.Length 0.8901688 -0.36082989 0.27565767 0.03760602
## Sepal.Width -0.4601427 -0.88271627 -0.09361987 -0.01777631
## Petal.Length 0.9915552 -0.02341519 -0.05444699 -0.11534978
## Petal.Width 0.9649790 -0.06399985 -0.24298265 0.07535950
This is the estimated factor loadings which will be revisited later under the name of Factor Analysis in this module. To give you an intuition, recall that \({\Sigma}\) can be rewritten as a function of \(p\) eigenvalues and eigenvectors. And this could be approximated with the first \(q \ll p\) terms. \[\begin{align*} \Sigma & = \sum_{j=1}^p \lambda_j \phi_j \phi_j^\top \\ & \approx \sum_{j=1}^q \lambda_j \phi_j \phi_j^\top = \begin{pmatrix} \sqrt{\lambda_1} \phi_1 & \sqrt{\lambda_2} \phi_2 & \ldots & \sqrt{\lambda_q} \phi_q \end{pmatrix} \begin{pmatrix} \sqrt{\lambda_1} \phi_1^\top \\ \sqrt{\lambda_2} \phi_2^\top \\ \vdots \\ \sqrt{\lambda_q} \phi_q^\top \end{pmatrix} \\ & = {L} {L}^\top \end{align*}\] where the factor model is given as this: \[\begin{align*} X_{p \times n} = \mu_{p \times 1} + L_{p \times q} F_{q \times n}+ \epsilon_{p \times n}, \end{align*}\] and \(\epsilon\) is called idiosyncratic error component while \(LF\) is the common component.
This explains why \(\sqrt{\lambda_j} \phi_j\) is obtained the same as correlation matrix above.
Quality of representation
The quality of representation of the variables is called the squared cosine (cos2) or the squared correlations. The high squared correlation indicates a good representation of the variable on the principal component e.g. Sepal.Length, Petal.Length and Petal.Width are well represented in PC1 while Sepal.Width is well represented in PC2.
<- (cor(iris[,-5], pc_score[,1:2]))^2
cos2 cos2
## PC1 PC2
## Sepal.Length 0.7924004 0.130198208
## Sepal.Width 0.2117313 0.779188012
## Petal.Length 0.9831817 0.000548271
## Petal.Width 0.9311844 0.004095980
Contribution of variables to PCs
Now we look at the contribution of variables to PCs. The contribution of a variable to a given principal component is (in percentage) : (cos2 * 100) / (total cos2 of the component). The larger value of the contribution indicates more contribution of the variable to the PC.
<- apply(cos2, 2, sum)
comp.cos2 # same as the corresponding eigenvalues comp.cos2
## PC1 PC2
## 2.9184978 0.9140305
<- function(cos2, comp.cos2){cos2*100/comp.cos2}
contrib2 <- t(apply(cos2,1, contrib2, comp.cos2))
contrib contrib
## PC1 PC2
## Sepal.Length 27.150969 14.24440565
## Sepal.Width 7.254804 85.24748749
## Petal.Length 33.687936 0.05998389
## Petal.Width 31.906291 0.44812296
The contribution can be displayed with a barplot:
= c("Petal.Length", "Petal.Width","Sepal.Length", "Sepal.Width")
names1 barplot(contrib[order(contrib[, 1],decreasing = T), 1], names.arg=names1,
main = "Contribution of variables to PC1",
xlab = " ",
ylab = "Percentage of variances",
col ="steelblue",las=2,cex.names=0.7)
abline(h=25, col="red",lty=3, lwd =1)
Notice that the horizontal red line is placed at 25. This is because we have 4 variables and 100/4 = 25. Any variable whose height is up to 25 and above is considered to have contributed significantly to the component. Observe that the interpretation of cor, cos2 and contrib are similar.