Workshop: CA
As can be seen from the last workshop, PCA is used for any numerical features but we may want to perform dimensionality reduction with categorical features. We can use correspondence analysis (CA) for categorical data. Here, we will focus on bivariate CA
with a two-way contingency table. As in PCA, the idea in CA is to reduce the dimensionality of a data matrix and visualise it in a subspace of low-dimensionality, commonly two- or three-dimensions. The data of interest in simple CA are usually a two-way contingency table
or any other table of nonnegative ratio-scale data for which relative values are of primary
interest.
This workshop will take you through the mathematics behind CA and how to use existing R packages for CA.
Toothpaste usage data
Recall the mathematical theory behind correspondence analysis (CA) with formulas given in the lecture slides provided for CA. We will now follow the step-by-step procedure using the data on toothpaste usage to produce analyses similar to the ones that was produced based on an existing R package for CA in the lecture note.
The first step is to read the data into R:
<- c(5, 5, 15, 15)
Region1 <- c(5, 25, 5, 5)
Region2 <- c(30, 5, 5, 0)
Region3
<- data.frame(Region1, Region2, Region3)
dat rownames(dat) <- c("BrandA", "BrandB", "BrandC", "BrandD") # row names are needed for row points
dat
## Region1 Region2 Region3
## BrandA 5 5 30
## BrandB 5 25 5
## BrandC 15 5 5
## BrandD 15 5 0
Next, convert dataframe dat
to a table with margins as follows:
<- addmargins(as.table(as.matrix(dat)))
cdat cdat
## Region1 Region2 Region3 Sum
## BrandA 5 5 30 40
## BrandB 5 25 5 35
## BrandC 15 5 5 25
## BrandD 15 5 0 20
## Sum 40 40 40 120
The table cdat is the required contingency table.
The next step is to establish a test to check whether there is a significance association between rows and columns. This can be accomplished by running the chi-square test on the original data.
<- chisq.test(dat)# Notice that this is based on "dat" and not "cdat"
chisq chisq
##
## Pearson's Chi-squared test
##
## data: dat
## X-squared = 79.607, df = 6, p-value = 4.307e-15
Why do you think this test is important? The answer is that we need to establish association between rows and columns before proceeding to CA. Here the p-value
is less that the usual 5% level of significance. It seems that there exists a significant association between the row and the column categories.
When the contingency table is not very large (as in cdat), it’s easy to visually inspect and interpret row and column profiles. For example, Brand D is not used in Region 3 whereas Brand A is more often used in Region 3.
Now we follow the steps one-by-one to perform CA.
- First step is finding the correspondence matrix \(Z\):
<- cdat/sum(dat) # Z = P/n, note that n = sum(dat) is grand total
Z Z
## Region1 Region2 Region3 Sum
## BrandA 0.04166667 0.04166667 0.25000000 0.33333333
## BrandB 0.04166667 0.20833333 0.04166667 0.29166667
## BrandC 0.12500000 0.04166667 0.04166667 0.20833333
## BrandD 0.12500000 0.04166667 0.00000000 0.16666667
## Sum 0.33333333 0.33333333 0.33333333 1.00000000
Now select the row mass \((r)\) (sum over the columns) and column mass \((c)\) (sum over the rows) from \(Z\) as
= Z[1:4, 4] # [1:4,4] means select the 4th column and the first 4 rows corresponding to it
row.mass row.mass
## BrandA BrandB BrandC BrandD
## 0.3333333 0.2916667 0.2083333 0.1666667
= Z[5, 1:3] # [5,1:3] means select the 5th row and the first 3 colums corresponding to it
col.mass col.mass
## Region1 Region2 Region3
## 0.3333333 0.3333333 0.3333333
The maximum number of dimensions expected is \(k = \mbox{min}(I − 1,J − 1)\), which is \(k = 2\) in this case.
- Calculate the matrix of standardized residuals given as \(S= D_{r}^{-1/2}(Z-rc^{T})D_{c}^{-1/2}\), where \(D_{r}\) and \(D_{c}\) are the diagonal matrices obtained from row mass (\(r\)) and column mass (\(c\)). \(S\) can be seen as a standardised \(X\) in the context of PCA.
= diag(col.mass) # turns "c", the column mass vector to a 3X3 diagonal matrix
Dc
Dc= diag(row.mass) # turns "r", the row mass vector to a 4X4 diagonal matrix
Dr Dr
Notice how we have used the diag
function in R to create the diagonal matrices. Type \(D_c\) and \(D_r\) in the console to see the output clearly.
The sqrtm()
function in the package expm
computes square root of positive definite square matrices:
library(expm)
= sqrtm(solve(Dr)) # We first took the inverse of D_r matrix and then the square root
pr = sqrtm(solve(Dc)) # We first took the inverse of D_c matrix and then the square root pc
= Z[1:4,1:3] # remove row/column masses to get 4 by 3 matrix. That is,
zz # select the first 4 rows and the corresponding first 3 columns
= (row.mass%*%t(col.mass)) # this is r*c^{T} in the formula.
rowtcol
= pr%*%(zz-rowtcol)%*%pc # Weighted distance or matrix of standardised residuals.
S S
## [,1] [,2] [,3]
## [1,] -0.2083333 -0.20833333 0.4166667
## [2,] -0.1781742 0.35634832 -0.1781742
## [3,] 0.2108185 -0.10540926 -0.1054093
## [4,] 0.2946278 -0.05892557 -0.2357023
The dimension of S is \(4 \times 3\).
- The
svd
function in R can be used to compute \(S = UD_{\alpha}V\), where \(U^{T}U = V^{T}V = I\) as follows.
= svd(S) # "u" from the created object is used to compute the row points
svd_res # while "v" is used for the column points.
The eigenvalues can be used to determine the inertia
. Recall that this is the square of the singular values.
options(scipen=999) # to prevent scientific notation such as 1e+5 instead of 100000.
<- round(svd_res$d^2, digits=5) #"round" and "digits =5" is to round up to 5 decimal places.
eig eig
## [1] 0.41026 0.25313 0.00000
What does the zero eigenvalue implies? Simply we can remove the redundant dimension.
<- round(svd_res$d^2, digits=5)[1:2]
eig <- eig*100/sum(eig) # Variances in percentage
variance <- cumsum(variance) # Cumulative variances
cumvar <- data.frame(eig = eig, variance = variance,
eig2 cumvariance = cumvar)
eig2
## eig variance cumvariance
## 1 0.41026 61.84296 61.84296
## 2 0.25313 38.15704 100.00000
We can write an R code to make a scree plot from these values (see the barplot
function from Workshop: PCA).
#Making screeplot
barplot(eig2[, 2], names.arg=1:nrow(eig2),
main = "Variances",
xlab = "Dimensions",
ylab = "Percentage of variances",
col = "steelblue")
# Add connected line segments to the plot
lines(x = 1:nrow(eig2), eig2[, 2],
type="b", pch=19, col = "red")
- The principal axes of rows can be obtained as \(F = D_{r}^{-1/2}U D_{\alpha}\)
<- pr%*%(svd_res$u)%*%diag(svd_res$d)
F <- F[1:4,1:2]
row.coord rownames(row.coord) <- rownames(dat)
colnames(row.coord) <- paste0("Dim", 1:2)
row.coord
## Dim1 Dim2
## BrandA -0.8776079 0.1051400
## BrandB 0.3179429 -0.7429492
## BrandC 0.3391090 0.4527749
## BrandD 0.7749294 0.5239126
Check to see that the result for the row coordinates above agrees with the ones that we obtained in the lecture slides using the FactoMineR
package.
We can use the code below to plot the corresponding row points (profiles):
<- range(row.coord[,1])*1.2
xlim <- range(row.coord[,2])*1.2
ylim plot(row.coord, pch=19, col = "blue", xlim = xlim, ylim = ylim, main = "Row profile")
text(row.coord+0.025, labels=rownames(row.coord), pos=3,col ="blue")
abline(v=0, h=0, lty = 2)
- We can modify the R code above to obtain the principal axes of columns using \(G =D_{c}^{-1/2}V D_{\alpha}\). We need to remove the redundant dimension before plotting the column points.
#principal component of columns
<- pc%*%(svd_res$v)%*%diag(svd_res$d)
G <- G[1:3,1:2]
col.coord rownames(col.coord) <- colnames(dat)
colnames(col.coord) <- paste0("Dim", 1:2)
col.coord
## Dim1 Dim2
## Region1 0.5430107 0.56950490
## Region2 0.3563824 -0.65414241
## Region3 -0.8993931 0.08463752
<- range(col.coord[,1])*1.2
xlim <- range(col.coord[,2])*1.2
ylim plot(col.coord, pch=17, col = "red", xlim = xlim, ylim = ylim, main="Column profile")
text(col.coord, labels =rownames(col.coord), pos=3, col ="red")
abline(v=0, h=0, lty = 2)
Compare these results to the ones in the lecture slides obtained directly using CA()
function in R.
We can now make a biplot to represent both the row and column points. We will use red color for the column points and text and blue for the row points.
#Symmetric biplot of rows and columns to view the association
<- range(c(row.coord[,1], col.coord[,1]))*1.2
xlim <- range(c(row.coord[,2], col.coord[,2]))*1.2
ylim
# Plot of rows
plot(row.coord, pch=19, col = "blue", xlim = xlim, ylim = ylim, main="Symmetric biplot")
text(row.coord, labels =rownames(row.coord), pos = 3, col ="blue")
# plot off columns
points(col.coord, pch=17, col = "red")
text(col.coord, labels =rownames(col.coord), pos = 3, col ="red")
abline(v=0, h=0, lty = 2)
- The standard coordinates of rows given as \(X =D_{r}^{-1/2}U\).
<- (pr%*%svd_res$u)[,1:2]
X rownames(X) <- rownames(dat)
X
- The standard coordinates of columns given \(Y =D_{c}^{-1/2}V\).
<- (pc%*%svd_res$v)[,1:2]
Y rownames(Y) <- colnames(dat)
Y
We mentioned in the lecture that asymmetric biplot is needed to interpret inter-distance between column and row profiles. For example, asymmetric row biplot is such that rows are plotted based on principal coordinates and columns are plotted based on standard coordinates. The code below plot asymmetric row biplot for the toothpaste usage data:
#Asymmetric biplot
<- range(c(row.coord[,1], Y[,1]))*1.2
xlim <- range(c(row.coord[,2], Y[,2]))*1.2
ylim
# Plot of rows
plot(row.coord, pch=19, col = "blue", xlim = xlim, ylim = ylim, main="Asymmetric biplot")
text(row.coord, labels =rownames(row.coord), pos = 3, col ="blue")
# plot off columns
points(Y, pch=17, col = "red")
text(Y, labels =rownames(Y), pos = 3, col ="red")
abline(v=0, h=0, lty = 2)
# add arrows
for(i in 1:dim(row.coord)[1]){
arrows(x0=0, y0=0, x1=row.coord[i, 1], y1=row.coord[i, 2], col=4, length=0.1)
}for(i in 1:dim(Y)[1]){
arrows(x0=0, y0=0, x1=Y[i, 1], y1=Y[i, 2], col=2, length=0.1)
}
The interpretation of the plot is the same as what was given in the lecture slides (the strength of association is inversely proportional to the angle between the arrows).
- Contribution of rows to the dimensions can be calculated as \[contrib = row.mass * row.coord^2/eigenvalue\]
<- apply(row.coord^2, 2, "*", row.mass)
cc <- t(apply(cc, 1, "/", eig))*100
row.contrib row.contrib
## Dim1 Dim2
## BrandA 62.577844 1.455696
## BrandB 7.186641 63.600632
## BrandC 5.839534 16.872513
## BrandD 24.395732 18.072691
Now that we have seen how to work through simple correspondence analysis on a step-by-step basis. In the computer practical, we will see how to perform CA by using a ready made package in R. The two prominent packages are CA() [FactoMineR package]
and ca() [ca package]
.