Contents
Introduction
In this blog post, we will go through the steps of quickly creating multiple bar plots with standard error and lettering using R. We will perform an Analysis of Variance (ANOVA) and a Least Significant Difference (LSD) test to understand the effects of different nitrogen (N) rates on rice. Finally, we will create visualizations to present the results. Let’s dive in!
Load Necessary Libraries
We begin by loading the required libraries. These libraries will help us with data manipulation, statistical analysis, and visualization.
library(ggplot2)
library(dplyr)
library(tidyr)
library(agricolae)
library(reshape2)
library(readxl)
Import the Data Set
Next, we import the dataset from an Excel file named “data_rice.xlsx”. We display the first few rows and the column names to understand the structure of the data.
df <- read_excel(path = "data_rice.xlsx", col_names = TRUE)
head(df)
## # A tibble: 6 × 5 ## rep Nrate Heading Maturity Yield ## <dbl> <chr> <dbl> <dbl> <dbl> ## 1 1 Control (No Urea) 54.2 26.5 2.88 ## 2 2 Control (No Urea) 78.3 27.0 3.03 ## 3 3 Control (No Urea) 72.7 26.7 2.76 ## 4 1 Urea 25 kg/ha 72.5 36.6 3.48 ## 5 2 Urea 25 kg/ha 84.3 32.1 3.17 ## 6 3 Urea 25 kg/ha 75.3 34.8 2.93
names(df)
## [1] "rep" "Nrate" "Heading" "Maturity" "Yield"
Convert Variables to Factors
We convert the columns rep and Nrate to factors because they represent categorical variables in our analysis.
df$rep <- as.factor(df$rep)
df$Nrate <- as.factor(df$Nrate)
str(df)
## tibble [15 × 5] (S3: tbl_df/tbl/data.frame) ## $ rep : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 3 1 2 3 1 ... ## $ Nrate : Factor w/ 5 levels "Control (No Urea)",..: 1 1 1 3 3 3 4 4 4 5 ... ## $ Heading : num [1:15] 54.2 78.3 72.7 72.5 84.3 ... ## $ Maturity: num [1:15] 26.5 27 26.7 36.6 32.1 ... ## $ Yield : num [1:15] 2.88 3.03 2.76 3.48 3.17 ...
Define Response Variables
We define response_vars as all columns from the third to the last column in the dataset. These columns represent different growth and yield characteristics.
response_vars <- colnames(df)[3:ncol(df)]
Analysis of Variance (ANOVA)
We are performing ANOVA (Analysis of Variance) for multiple response variables stored in a list called response_vars
. We first initialize an empty list called anova_result
to store the ANOVA results for each response variable. The code then iterates over each response variable in response_vars
. If a response variable contains a space in its name, it is wrapped in backticks to ensure it is correctly interpreted within the formula.
A formula is then dynamically constructed for each response variable, where the response variable is modeled as a function of two factors: “rep” and “Nrate”. This formula is passed to the aov
function, which performs the ANOVA using the data frame df
. The resulting ANOVA object is stored in the anova_result
list, with the response variable name as the key. For each response variable, the code prints a message indicating that ANOVA is being performed, followed by the summary of the ANOVA results. This process allows us to systematically conduct and review ANOVA for multiple response variables in a dataset.
anova_result <- list()
for (i in response_vars) {
if(grepl(" ", i)) {
i <- paste0("`", i, "`")
}
formula <- as.formula(paste(i, "~", "rep", "+", "Nrate"))
anova_result[[i]] <- aov(formula, data = df)
print(paste("ANOVA for", i))
print(summary(anova_result[[i]]))
}
## [1] "ANOVA for Heading" ## Df Sum Sq Mean Sq F value Pr(>F) ## rep 2 345.8 172.88 3.147 0.0981 . ## Nrate 4 1211.6 302.90 5.514 0.0198 * ## Residuals 8 439.5 54.94 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## [1] "ANOVA for Maturity" ## Df Sum Sq Mean Sq F value Pr(>F) ## rep 2 26.2 13.08 1.456 0.28879 ## Nrate 4 507.3 126.82 14.126 0.00107 ** ## Residuals 8 71.8 8.98 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## [1] "ANOVA for Yield" ## Df Sum Sq Mean Sq F value Pr(>F) ## rep 2 0.160 0.0802 0.377 0.69737 ## Nrate 4 7.487 1.8716 8.801 0.00501 ** ## Residuals 8 1.701 0.2127 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Least Significant Difference (LSD) Test
Then we conducted post-hoc analysis using the Least Significant Difference (LSD) test for multiple response variables stored in response_vars
. We first initialize an empty list called lsd_result
to store the LSD test results for each response variable. The code iterates over each response variable in response_vars
, and if a variable’s name contains a space, it is enclosed in backticks to ensure it is correctly processed. For each response variable, the LSD test is performed using the corresponding ANOVA result from the previously stored anova_result
list, focusing on the treatment factor “Nrate” with a significance level (alpha) of 0.05. The test results, including means and groups, are then merged into a single data frame.
The standard error (SE) for each response variable is calculated by dividing the standard deviation by the square root of the number of replicates. The resulting data frame is stored in the lsd_result
list, with each data frame’s column containing the response variable values, means, groups, and standard errors. The column name for the response variable is standardized to “response”. Finally, the results for each response variable are printed. This process ensures a detailed post-hoc analysis following ANOVA, allowing for a clear comparison of treatment means.
lsd_result <- list()
for (i in response_vars) {
if(grepl(" ", i)) {
i <- paste0("`", i, "`")
}
lsd_result[[i]] <- LSD.test(anova_result[[i]], trt = "Nrate", alpha = 0.05)
lsd_result[[i]] <- merge(x = lsd_result[[i]]$means[1:3], y = lsd_result[[i]]$groups[2], by.x = "row.names", by.y = "row.names", all.x = FALSE)
lsd_result[[i]]$SE <- lsd_result[[i]]$std / sqrt(lsd_result[[i]]$r)
lsd_result[[i]] <- data.frame(lsd_result[[i]])
colnames(lsd_result[[i]])[2] <- "response"
print(lsd_result[[i]])
}
## Row.names response std r groups SE ## 1 Control (No Urea) 68.44509 12.609047 3 c 7.279837 ## 2 Urea 100 kg/ha 87.97074 5.763612 3 ab 3.327623 ## 3 Urea 25 kg/ha 77.39595 6.182959 3 bc 3.569733 ## 4 Urea 50 kg/ha 80.05120 4.757606 3 bc 2.746805 ## 5 Urea 75 kg/ha 94.64645 11.813072 3 a 6.820281 ## Row.names response std r groups SE ## 1 Control (No Urea) 26.74720 0.2452797 3 d 0.1416123 ## 2 Urea 100 kg/ha 39.30555 3.5689391 3 ab 2.0605280 ## 3 Urea 25 kg/ha 34.49546 2.2648864 3 bc 1.3076328 ## 4 Urea 50 kg/ha 33.48755 5.2644154 3 c 3.0394116 ## 5 Urea 75 kg/ha 44.04475 1.8292321 3 a 1.0561076 ## Row.names response std r groups SE ## 1 Control (No Urea) 2.890565 0.1376631 3 b 0.07947983 ## 2 Urea 100 kg/ha 2.710516 0.3322514 3 b 0.19182546 ## 3 Urea 25 kg/ha 3.191620 0.2721221 3 b 0.15710978 ## 4 Urea 50 kg/ha 2.861858 0.7981537 3 b 0.46081427 ## 5 Urea 75 kg/ha 4.636202 0.3007376 3 a 0.17363091
Prepare Data for Plotting
Next, we are combining the results of the LSD tests for multiple response variables into a single data frame for further analysis or visualization. First, we use do.call()
to merge all individual data frames stored in lsd_result
into one large data frame called combined_df
. We then add a new column, variables
, to combined_df
to store the row names, which represent the response variable names. The original row names are subsequently removed using rownames(combined_df) <- NULL
. The first column of combined_df
is renamed to “Nrate” to reflect the treatment levels.
Next, we clean up the variables
column by removing any characters following a period (.) using sub()
function, which simplifies the variable names. Additionally, backticks are removed from the variable names using gsub()
function.
Finally, the Nrate
column is converted to a factor with specific levels to ensure that the treatment levels are ordered meaningfully: “Control (No Urea)”, “Urea 25 kg/ha”, “Urea 50 kg/ha”, “Urea 75 kg/ha”, and “Urea 100 kg/ha”. This transformation facilitates easier plotting and interpretation of the results. The resulting combined_df
contains the LSD test results for all response variables, with standardized treatment levels and cleaned variable names, ready for further analysis.
combined_df <- do.call(rbind, lsd_result)
combined_df$variables <- rownames(combined_df)
rownames(combined_df) <- NULL
colnames(combined_df)[1] <- "Nrate"
combined_df$variables <- sub("\\..*", "", combined_df$variables)
combined_df$variables <- gsub("`", "", combined_df$variables)
combined_df$Nrate <- factor(combined_df$Nrate, levels = c(
"Control (No Urea)",
"Urea 25 kg/ha",
"Urea 50 kg/ha",
"Urea 75 kg/ha",
"Urea 100 kg/ha"
))
Create the Bar Plot
Next, we created a detailed and customized bar plot using the ggplot2
package to visualize the results of the LSD tests for different response variables. The code plots the combined_df
data frame, specifically the rows where variables
match the names in response_vars
.
- Setting up the plot: The
ggplot
function initializes the plot with the data filtered to include only the relevant response variables. Theaes
function sets up the aesthetic mappings, specifyingNrate
on the x-axis,response
on the y-axis, and filling the bars based onNrate
. - Adding bars: The
geom_bar
function creates bar plots with the specified aesthetics, drawing bars with black borders and setting their position to dodge each other for clarity. The width of the bars is set to 0.9. - Adding error bars: The
geom_errorbar
function adds error bars to each bar, representing the standard error (SE
). The error bars are positioned to align with the bars and have a width of 0.10. - Adding text labels: The
geom_text
function adds text labels above each bar, displaying the group labels. The labels are positioned slightly above the error bars for better visibility. - Customizing labels and theme: The
labs
function sets empty titles and labels for the x-axis and fill legend, while the y-axis label is set to “Growth and yield characteristics”. Thetheme
function customizes the axis text to rotate the x-axis labels by 45 degrees for better readability and sets other aesthetic elements like legend position, strip placement, and panel spacing. - Faceting: The
facet_wrap
function splits the plot into multiple panels, one for each response variable. Each panel has its own y-axis scale and is arranged in a single column. - Further customization: The plot’s theme is further customized to remove major and minor grid lines, set the panel background to blank, and ensure the axis lines are clearly visible. The
scale_y_continuous
function ensures the y-axis scales appropriately with a slight expansion for better spacing. - Customizing fill colors: The
scale_fill_manual
function applies a specific color palette from theggsci
package, using the “Journal of Clinical Oncology” (JCO) color palette for the bars.
This comprehensive code creates a clear, well-labeled, and aesthetically pleasing visualization of the LSD test results, making it easy to compare the effects of different Nrate
treatments on various response variables.
plot <- ggplot(combined_df[combined_df$variables %in% response_vars, ],
aes(x = factor(Nrate),
y = response,
fill = factor(Nrate))) +
geom_bar(stat = "identity",
color = "black",
position = position_dodge(width = 0.5),
width = 0.9) +
geom_errorbar(aes(ymin = response - SE,
ymax = response + SE),
position = position_dodge(width = 0.9),
width = 0.10) +
geom_text(aes(x = Nrate,
y = response + SE,
label = as.matrix(groups)),
position = position_dodge(width = 0.9),
vjust = -0.5,
hjust = 0.5) +
labs(title = "",
x = "",
y = "Growth and yield characteristics",
fill = "") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
facet_wrap(facets = ~variables,
ncol = 1,
scales = "free_y",
switch = "y") +
theme(legend.position = "right",
strip.placement = "outside",
strip.background = element_blank(),
panel.spacing = unit(1, "lines")) +
scale_y_continuous(expand = expansion(mult = c(0.2, 0.2))) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line.y = element_line(color = "black", size = 0.5),
axis.line.x = element_line(color = "black", size = 0.5)) +
scale_fill_manual(values = ggsci::pal_jco()(5))
plot
Customize Labels and Update Plot
Then we defined a custom labeller using the as_labeller
function from the ggplot2
package to create more descriptive and formatted facet labels for our plot. The custom labeller is created as a named vector that maps the original variable names to more detailed and readable labels. Specifically, the custom_labeller
object associates:
- “Heading” with the label “Heading~(days)”
- “Maturity” with the label “Maturity~(days)”
- “Yield” with the label “Yield~(t~ha^{-1})”
The label_parsed
argument is used to ensure that the labels are parsed as expressions, allowing the inclusion of mathematical notation and formatting in the labels. For example, t~ha^{-1}
represents tons per hectare with appropriate superscripting. These custom labels enhance the readability and presentation quality of the facet labels in the ggplot, making the plot more informative and visually appealing.
custom_labeller <- as_labeller(c(
Heading = "Heading~(days)",
Maturity = "Maturity~(days)",
Yield = "Yield~(t~ha^{-1})"
), label_parsed)
Finally, we customize the facet labels for better readability and update the plot accordingly. We updated the labels in facet_wrap() function to use the above modified labels. We display the updated plot using the below code.
plot + facet_wrap(facets = ~variables,
ncol = 1,
scales = "free_y",
switch = "y",
labeller = custom_labeller)
👉 For more details and informative videos 📺, you can also subscribe to our YouTube Channel AGRON Info Tech.
Download R program and R studio —
Click_here
Download data file —
Click_here