AGRON INFO TECH

Different ways to compute summary statistics in R

Summary statistics are used to summarize the main characteristics of a dataset. In R, there are several built-in functions that can be used to calculate summary statistics for a given dataset. In this post, we will discuss some of the most commonly used functions for calculating summary statistics in R.

Loading iris data

In R, you can load the iris dataset by simply calling the data() function and passing in the name of the dataset as a parameter. This will load the iris dataset into R and make it available for use in your R session. Once the dataset is loaded, you can view the first few rows of the dataset using the head() function. This will print out the first 6 rows of the iris dataset, which includes the variables “Sepal.Length”, “Sepal.Width”, “Petal.Length”, “Petal.Width”, and “Species”.

data("iris")
head(iris)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

Descriptive statistics

Using base package

In R, the summary() function is a built-in function that provides a summary of statistical measures for a particular data object. The function takes a single argument, which is typically a data frame or a matrix, although it can also be used with other types of objects.

The summary() function provides a quick and easy way to obtain important descriptive statistics for a dataset, including the minimum and maximum values, quartiles, median, and mean. For categorical variables, it provides the frequency distribution of the levels.

str(iris)
# 'data.frame': 150 obs. of  5 variables:
#  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris[-5])
#   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
#  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
#  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
#  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
#  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
#  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
#  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

Using psych package

The describe() function is part of the psych package in R, which provides a variety of functions for psychological and psychometric research. The describe() function is used to obtain descriptive statistics for a data frame or matrix, including measures of central tendency, variability, skewness, and kurtosis.

library(psych)
describe(iris)
#              vars   n mean   sd median trimmed  mad min max range  skew
# Sepal.Length    1 150 5.84 0.83   5.80    5.81 1.04 4.3 7.9   3.6  0.31
# Sepal.Width     2 150 3.06 0.44   3.00    3.04 0.44 2.0 4.4   2.4  0.31
# Petal.Length    3 150 3.76 1.77   4.35    3.76 1.85 1.0 6.9   5.9 -0.27
# Petal.Width     4 150 1.20 0.76   1.30    1.18 1.04 0.1 2.5   2.4 -0.10
# Species*        5 150 2.00 0.82   2.00    2.00 1.48 1.0 3.0   2.0  0.00
#              kurtosis   se
# Sepal.Length    -0.61 0.07
# Sepal.Width      0.14 0.04
# Petal.Length    -1.42 0.14
# Petal.Width     -1.36 0.06
# Species*        -1.52 0.07

Note that the describe() function also provides additional options for customizing the output, such as selecting specific statistics to display or computing confidence intervals.

describe(
          iris, 
          fast = TRUE, 
          quant = c(0.25, 0.50, 0.75), 
          ranges = FALSE, 
)
#              vars   n mean   sd   se Q0.25 Q0.5 Q0.75
# Sepal.Length    1 150 5.84 0.83 0.07   5.1 5.80   6.4
# Sepal.Width     2 150 3.06 0.44 0.04   2.8 3.00   3.3
# Petal.Length    3 150 3.76 1.77 0.14   1.6 4.35   5.1
# Petal.Width     4 150 1.20 0.76 0.06   0.3 1.30   1.8
# Species         5 150  NaN   NA   NA    NA   NA    NA

You can also get descriptive statistics grouped by factor or categorical variable using formula mode.

describe(
          iris ~ Species,
          fast = TRUE, 
          quant = c(0.25, 0.50, 0.75), 
          ranges = FALSE, omit = TRUE
)
# 
#  Descriptive statistics by group 
# group: setosa
#              vars  n mean   sd   se Q0.25 Q0.5 Q0.75
# Sepal.Length    1 50 5.01 0.35 0.05   4.8  5.0  5.20
# Sepal.Width     2 50 3.43 0.38 0.05   3.2  3.4  3.68
# Petal.Length    3 50 1.46 0.17 0.02   1.4  1.5  1.58
# Petal.Width     4 50 0.25 0.11 0.01   0.2  0.2  0.30
# Species         5 50  NaN   NA   NA    NA   NA    NA
# ------------------------------------------------------------ 
# group: versicolor
#              vars  n mean   sd   se Q0.25 Q0.5 Q0.75
# Sepal.Length    1 50 5.94 0.52 0.07  5.60 5.90   6.3
# Sepal.Width     2 50 2.77 0.31 0.04  2.52 2.80   3.0
# Petal.Length    3 50 4.26 0.47 0.07  4.00 4.35   4.6
# Petal.Width     4 50 1.33 0.20 0.03  1.20 1.30   1.5
# Species         5 50  NaN   NA   NA    NA   NA    NA
# ------------------------------------------------------------ 
# group: virginica
#              vars  n mean   sd   se Q0.25 Q0.5 Q0.75
# Sepal.Length    1 50 6.59 0.64 0.09  6.23 6.50  6.90
# Sepal.Width     2 50 2.97 0.32 0.05  2.80 3.00  3.18
# Petal.Length    3 50 5.55 0.55 0.08  5.10 5.55  5.88
# Petal.Width     4 50 2.03 0.27 0.04  1.80 2.00  2.30
# Species         5 50  NaN   NA   NA    NA   NA    NA

Using sapply function

In R, sapply() is a function that applies a given function to each element of a vector or list and returns the results as a vector or matrix. All the descriptive statistics were computed using sapply function and then were combined in a data frame using data.frame() function.

data.frame(
          mean = sapply(iris[-5], mean),
          median = sapply(iris[-5], median),
          sd = sapply(iris[-5], sd),
          Q1 = sapply(iris[-5], quantile)[2,],
          Q3 = sapply(iris[-5], quantile)[4,],
           min = sapply(iris[-5], min),
           max = sapply(iris[-5], max)

)
#                  mean median        sd  Q1  Q3 min max
# Sepal.Length 5.843333   5.80 0.8280661 5.1 6.4 4.3 7.9
# Sepal.Width  3.057333   3.00 0.4358663 2.8 3.3 2.0 4.4
# Petal.Length 3.758000   4.35 1.7652982 1.6 5.1 1.0 6.9
# Petal.Width  1.199333   1.30 0.7622377 0.3 1.8 0.1 2.5

Using aggregate function

The aggregate() function in R is used to apply a function to subsets of data in a data frame. The function takes one or more variables, and groups the data based on those variables. The grouped data is then summarized using a function, such as mean, sum, or max.

# Mean
as.data.frame(aggregate(iris[-5], by = iris[5], FUN = 'mean')) 
#      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1     setosa        5.006       3.428        1.462       0.246
# 2 versicolor        5.936       2.770        4.260       1.326
# 3  virginica        6.588       2.974        5.552       2.026
# Median
as.data.frame(aggregate(iris[-5], by = iris[5], FUN = 'median'))
#      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1     setosa          5.0         3.4         1.50         0.2
# 2 versicolor          5.9         2.8         4.35         1.3
# 3  virginica          6.5         3.0         5.55         2.0
# Minimum values
as.data.frame(aggregate(iris[-5], by = iris[5], FUN = 'min'))
#      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1     setosa          4.3         2.3          1.0         0.1
# 2 versicolor          4.9         2.0          3.0         1.0
# 3  virginica          4.9         2.2          4.5         1.4
# Maximum values
as.data.frame(aggregate(iris[-5], by = iris[5], FUN = 'max'))
#      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1     setosa          5.8         4.4          1.9         0.6
# 2 versicolor          7.0         3.4          5.1         1.8
# 3  virginica          7.9         3.8          6.9         2.5
# First quantile Q1 25%
as.data.frame(aggregate(iris[-5], by = iris[5], FUN = quantile, probs = 0.25))
#      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1     setosa        4.800       3.200          1.4         0.2
# 2 versicolor        5.600       2.525          4.0         1.2
# 3  virginica        6.225       2.800          5.1         1.8
# Third quantile Q3 75%
as.data.frame(aggregate(iris[-5], by = iris[5], FUN = quantile, probs = 0.75))
#      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1     setosa          5.2       3.675        1.575         0.3
# 2 versicolor          6.3       3.000        4.600         1.5
# 3  virginica          6.9       3.175        5.875         2.3

Download R program — Click_here

Download R studio — Click_here