Summary statistics are used to summarize the main characteristics of a dataset. In R, there are several built-in functions that can be used to calculate summary statistics for a given dataset. In this post, we will discuss some of the most commonly used functions for calculating summary statistics in R.
Contents
Loading iris data
In R, you can load the iris dataset by simply calling the data() function and passing in the name of the dataset as a parameter. This will load the iris dataset into R and make it available for use in your R session. Once the dataset is loaded, you can view the first few rows of the dataset using the head() function. This will print out the first 6 rows of the iris dataset, which includes the variables “Sepal.Length”, “Sepal.Width”, “Petal.Length”, “Petal.Width”, and “Species”.
data("iris")
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species # 1 5.1 3.5 1.4 0.2 setosa # 2 4.9 3.0 1.4 0.2 setosa # 3 4.7 3.2 1.3 0.2 setosa # 4 4.6 3.1 1.5 0.2 setosa # 5 5.0 3.6 1.4 0.2 setosa # 6 5.4 3.9 1.7 0.4 setosa
Descriptive statistics
Using base package
In R, the summary()
function is a built-in function that provides a summary of statistical measures for a particular data object. The function takes a single argument, which is typically a data frame or a matrix, although it can also be used with other types of objects.
The summary()
function provides a quick and easy way to obtain important descriptive statistics for a dataset, including the minimum and maximum values, quartiles, median, and mean. For categorical variables, it provides the frequency distribution of the levels.
str(iris)
# 'data.frame': 150 obs. of 5 variables: # $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... # $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... # $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... # $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... # $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris[-5])
# Sepal.Length Sepal.Width Petal.Length Petal.Width # Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 # 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 # Median :5.800 Median :3.000 Median :4.350 Median :1.300 # Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 # 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 # Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Using psych package
The describe()
function is part of the psych package in R, which provides a variety of functions for psychological and psychometric research. The describe()
function is used to obtain descriptive statistics for a data frame or matrix, including measures of central tendency, variability, skewness, and kurtosis.
library(psych)
describe(iris)
# vars n mean sd median trimmed mad min max range skew # Sepal.Length 1 150 5.84 0.83 5.80 5.81 1.04 4.3 7.9 3.6 0.31 # Sepal.Width 2 150 3.06 0.44 3.00 3.04 0.44 2.0 4.4 2.4 0.31 # Petal.Length 3 150 3.76 1.77 4.35 3.76 1.85 1.0 6.9 5.9 -0.27 # Petal.Width 4 150 1.20 0.76 1.30 1.18 1.04 0.1 2.5 2.4 -0.10 # Species* 5 150 2.00 0.82 2.00 2.00 1.48 1.0 3.0 2.0 0.00 # kurtosis se # Sepal.Length -0.61 0.07 # Sepal.Width 0.14 0.04 # Petal.Length -1.42 0.14 # Petal.Width -1.36 0.06 # Species* -1.52 0.07
Note that the describe()
function also provides additional options for customizing the output, such as selecting specific statistics to display or computing confidence intervals.
describe(
iris,
fast = TRUE,
quant = c(0.25, 0.50, 0.75),
ranges = FALSE,
)
# vars n mean sd se Q0.25 Q0.5 Q0.75 # Sepal.Length 1 150 5.84 0.83 0.07 5.1 5.80 6.4 # Sepal.Width 2 150 3.06 0.44 0.04 2.8 3.00 3.3 # Petal.Length 3 150 3.76 1.77 0.14 1.6 4.35 5.1 # Petal.Width 4 150 1.20 0.76 0.06 0.3 1.30 1.8 # Species 5 150 NaN NA NA NA NA NA
You can also get descriptive statistics grouped by factor or categorical variable using formula mode.
describe(
iris ~ Species,
fast = TRUE,
quant = c(0.25, 0.50, 0.75),
ranges = FALSE, omit = TRUE
)
# # Descriptive statistics by group # group: setosa # vars n mean sd se Q0.25 Q0.5 Q0.75 # Sepal.Length 1 50 5.01 0.35 0.05 4.8 5.0 5.20 # Sepal.Width 2 50 3.43 0.38 0.05 3.2 3.4 3.68 # Petal.Length 3 50 1.46 0.17 0.02 1.4 1.5 1.58 # Petal.Width 4 50 0.25 0.11 0.01 0.2 0.2 0.30 # Species 5 50 NaN NA NA NA NA NA # ------------------------------------------------------------ # group: versicolor # vars n mean sd se Q0.25 Q0.5 Q0.75 # Sepal.Length 1 50 5.94 0.52 0.07 5.60 5.90 6.3 # Sepal.Width 2 50 2.77 0.31 0.04 2.52 2.80 3.0 # Petal.Length 3 50 4.26 0.47 0.07 4.00 4.35 4.6 # Petal.Width 4 50 1.33 0.20 0.03 1.20 1.30 1.5 # Species 5 50 NaN NA NA NA NA NA # ------------------------------------------------------------ # group: virginica # vars n mean sd se Q0.25 Q0.5 Q0.75 # Sepal.Length 1 50 6.59 0.64 0.09 6.23 6.50 6.90 # Sepal.Width 2 50 2.97 0.32 0.05 2.80 3.00 3.18 # Petal.Length 3 50 5.55 0.55 0.08 5.10 5.55 5.88 # Petal.Width 4 50 2.03 0.27 0.04 1.80 2.00 2.30 # Species 5 50 NaN NA NA NA NA NA
Using sapply function
In R, sapply()
is a function that applies a given function to each element of a vector or list and returns the results as a vector or matrix. All the descriptive statistics were computed using sapply
function and then were combined in a data frame using data.frame()
function.
data.frame(
mean = sapply(iris[-5], mean),
median = sapply(iris[-5], median),
sd = sapply(iris[-5], sd),
Q1 = sapply(iris[-5], quantile)[2,],
Q3 = sapply(iris[-5], quantile)[4,],
min = sapply(iris[-5], min),
max = sapply(iris[-5], max)
)
# mean median sd Q1 Q3 min max # Sepal.Length 5.843333 5.80 0.8280661 5.1 6.4 4.3 7.9 # Sepal.Width 3.057333 3.00 0.4358663 2.8 3.3 2.0 4.4 # Petal.Length 3.758000 4.35 1.7652982 1.6 5.1 1.0 6.9 # Petal.Width 1.199333 1.30 0.7622377 0.3 1.8 0.1 2.5
Using aggregate function
The aggregate()
function in R is used to apply a function to subsets of data in a data frame. The function takes one or more variables, and groups the data based on those variables. The grouped data is then summarized using a function, such as mean, sum, or max.
# Mean
as.data.frame(aggregate(iris[-5], by = iris[5], FUN = 'mean'))
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width # 1 setosa 5.006 3.428 1.462 0.246 # 2 versicolor 5.936 2.770 4.260 1.326 # 3 virginica 6.588 2.974 5.552 2.026
# Median
as.data.frame(aggregate(iris[-5], by = iris[5], FUN = 'median'))
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width # 1 setosa 5.0 3.4 1.50 0.2 # 2 versicolor 5.9 2.8 4.35 1.3 # 3 virginica 6.5 3.0 5.55 2.0
# Minimum values
as.data.frame(aggregate(iris[-5], by = iris[5], FUN = 'min'))
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width # 1 setosa 4.3 2.3 1.0 0.1 # 2 versicolor 4.9 2.0 3.0 1.0 # 3 virginica 4.9 2.2 4.5 1.4
# Maximum values
as.data.frame(aggregate(iris[-5], by = iris[5], FUN = 'max'))
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width # 1 setosa 5.8 4.4 1.9 0.6 # 2 versicolor 7.0 3.4 5.1 1.8 # 3 virginica 7.9 3.8 6.9 2.5
# First quantile Q1 25%
as.data.frame(aggregate(iris[-5], by = iris[5], FUN = quantile, probs = 0.25))
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width # 1 setosa 4.800 3.200 1.4 0.2 # 2 versicolor 5.600 2.525 4.0 1.2 # 3 virginica 6.225 2.800 5.1 1.8
# Third quantile Q3 75%
as.data.frame(aggregate(iris[-5], by = iris[5], FUN = quantile, probs = 0.75))
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width # 1 setosa 5.2 3.675 1.575 0.3 # 2 versicolor 6.3 3.000 4.600 1.5 # 3 virginica 6.9 3.175 5.875 2.3
Download R program — Click_here
Download R studio — Click_here