AGRON INFO TECH

Creating and Customizing PCA Biplot using ggplot2 and ggrepel packages

Introduction:

In this tutorial, we will explore how to create a PCA biplot using the popular R libraries ggplot2 and ggrepel. Principal Component Analysis (PCA) is a powerful statistical technique used for dimensionality reduction and visualizing patterns in high-dimensional data. By the end of this guide, you will be able to load the iris dataset, perform PCA, extract scores and loadings, and generate an interactive biplot with customizable labels and axes.

During the implementation, we encountered a notable challenge related to utilizing distinct secondary labels for the vector loading on the X and Y axes. Modifying the scale had unintended consequences on the primary X and Y axis scales, resulting in points exceeding beyond the plot margin. To address this issue and maintain separate scales for the secondary labels, we devised a feasible solution involving the multiplication of xend and yend by a factor of 3.

Follow along with the steps outlined below to gain a comprehensive understanding of creating and customizing a PCA biplot. By the end of this tutorial, you will have the skills to visualize and interpret complex data patterns effectively. Let’s get started!

Loading the Required Libraries

To begin, we need to load the necessary R libraries. We will be using ggplot2, devtools, and ggrepel. These libraries provide powerful tools for data visualization and labeling.

library(ggplot2)
library(devtools)
library(ggrepel)

Loading the Iris Dataset

The iris dataset is a popular dataset in the field of data analysis and machine learning. It contains measurements of iris flowers from three different species. We will load the iris dataset into our R environment.

# Load iris dataset
data(iris)

Running PCA and Extracting Scores and Loadings

Next, we will perform PCA on the iris dataset and extract the scores and loadings. This will allow us to plot the data points and the directions of the principal components in our biplot.

# Run PCA and extract scores and loadings
iris_pca <- prcomp(iris[-5], scale. = TRUE)

# Getting proportion of variance explained by PC1 and PC2
prop_var <- iris_pca$sdev^2 / sum(iris_pca$sdev^2)

scores <- as.data.frame(iris_pca$x) 
scores$Species <- iris$Species

loadings <- as.data.frame(iris_pca$rotation)
loadings$Species <- rownames(loadings)

Creating the Biplot

Now, we will use ggplot2 to create the biplot. The biplot combines a scatter plot of the data points with arrows indicating the directions and magnitudes of the loadings. This provides a visual representation of the relationships between the variables and the observations.

# Create biplot
biplot <- ggplot(data = scores, aes(x = PC1, y = PC2)) +
          geom_point(aes(color = Species), size = 2, shape = 19) +
          geom_segment(
                    data = loadings, aes(
                              x = 0, y = 0,
                              xend = PC1 , yend = PC2
                    ),
                    arrow = arrow(length = unit(0.3, "cm"), type = "open", angle = 25),
                    size = 1, color = "darkblue"
          )
biplot
PCA biplot

Creating different secondary scale for vector loadings

One of the notable challenges encountered was the utilization of distinct secondary labels for the vector loading on the X and Y axes. Modifying the scale had unintended consequences on the primary X and Y axis scales, leading to points extending beyond the plot margin. In order to address this issue and maintain separate scales for the secondary labels, a feasible solution involved the multiplication of xend and yend by a factor of 3.

scale <- 3
# Create biplot
biplot <- ggplot(data = scores, aes(x = PC1, y = PC2)) +
          geom_point(aes(color = Species), size = 2, shape = 19) +
          geom_segment(
                    data = loadings, aes(
                              x = 0, y = 0,
                              xend = PC1 * 3, yend = PC2 * 3
                    ),
                    arrow = arrow(length = unit(0.3, "cm"), type = "open", angle = 25),
                    size = 1, color = "darkblue"
          )
biplot
PCA biplot

Customizing the Biplot

In this step, we will customize the biplot to enhance its readability and aesthetics. We will adjust the scales of the axes, add labels to the data points, and modify the appearance of the plot.

biplot +           
          # Primary and secondary scales
          scale_x_continuous(
                    limits = c(-3, 3), name = paste0("PC1 (",round(prop_var[1]*100, digits = 2), " %)"),
                    sec.axis = sec_axis(~ . / scale, name = "Loadings on PC1")
          ) +
          scale_y_continuous(
                    limits = c(-3, 3), name = paste0("PC2 (",round(prop_var[2]*100, digits = 2), " %)"),
                    sec.axis = sec_axis(~ . / scale, name = "Loadings on PC2")
          ) +
          geom_label_repel(
                    data = loadings,
                    aes(
                              label = rownames(loadings),
                              x = PC1 * scale,
                              y = PC2 * scale
                    ),
                    box.padding = 0.2,
                    point.padding = 0.3,
                    size = 3, # Change the font size of the text here
                    color = "black", # Change the color of the text here
                    arrow = arrow(length = unit(0.3, "cm"), type = "closed", angle = 25),
                    force = 4
          ) +
          # Theme with modified axis text color
          theme_bw() +
          theme(
                    axis.title.x.top = element_text(color = "darkblue"),  
                    axis.title.y.right = element_text(color = "darkblue"),
                    axis.text.x.top = element_text(color = "darkblue"),  
                    axis.ticks.x.top = element_line(color = "darkblue"), 
                    axis.text.y.right = element_text(color = "darkblue"),
                    axis.ticks.y.right = element_line(color = "darkblue"), legend.position = "top"
          ) +
          labs(title = "Biplot - PCA")
PCA biplot

Conclusion:

In this tutorial, we have learned how to create a PCA biplot using ggplot2 and ggrepel libraries in R. By following these steps, you can effectively visualize and analyze patterns in your data using principal component analysis. Experiment with different datasets and explore the insights provided by biplots. Happy analyzing!


Download R program — Click_here

Download R studio — Click_here