Data wrangling is the process of cleaning, transforming, and reshaping data into a more useful format for analysis. R is a popular programming language for data wrangling, with many built-in functions and packages designed specifically for this purpose. Here we shall use some common techniques for data wrangling.
Contents
Description of the data
We shall use sales data in this example. Each column represents a variable related to an individual’s personal information, such as ID, name, age, gender, education, salary, and employment start and end dates.
id
: A numeric variable representing the unique identifier for each observation.name
: A character variable representing the name of each individual.age
: A numeric variable representing the age of each individual.gender
: A categorical variable representing the gender of each individual (Male or Female).education
: A categorical variable representing the level of education of each individual (Graduate, Postgraduate, or Undergraduate).salary
: A numeric variable representing the salary of each individual.start_date
: A date variable representing the start date of employment for each individual.end_date
: A date variable representing the end date of employment for each individual.
The df_missing
and df_duplicate
objects are created by manipulating the original df
object. df_missing
object contains missing values in the third and fifth columns, in rows 3 and 7. df_duplicate
object contains duplicate rows of the original df
data frame with row 1, 3, and 5 repeated.
Importing data
The sales data was imported using read.csv()
function.
# Read the CSV file into a data frame
df <- read.csv("sales_data.csv", header = T)
head(df)
# id name age gender education salary start_date end_date # 1 1 John 25 Male Graduate 50000 2021-01-01 2023-12-31 # 2 2 Jane 31 Female Postgraduate 65000 2019-08-15 2022-07-31 # 3 3 David 42 Male Graduate 80000 2020-03-01 2023-06-30 # 4 4 Mary 26 Female Undergraduate 45000 2020-05-01 2022-12-31 # 5 5 Chris 39 Male Postgraduate 70000 2021-04-01 2024-05-31 # 6 6 Alex 28 Male Graduate 90000 2019-12-01 2023-11-30
We further created two more data sets using the df object as described below:
df_missing
is a data frame that contains missing values, or NA values, in some of its rows. Missing values can occur for various reasons, such as incomplete data collection, data entry errors, or data processing errors. Handling missing values is an important part of data wrangling, as missing values can affect the accuracy and validity of statistical analysis and modeling.
# Add missing values
df_missing <- df
df_missing[c(3,7), c(3,6)] <- NA
df_duplicate
is a data frame that contains duplicate rows, or rows with identical values in all their columns. Duplicate rows can occur for various reasons, such as data entry errors, data merging errors, or data processing errors. Handling duplicate rows is an important part of data cleaning, as duplicate rows can affect the accuracy and validity of statistical analysis and modeling, and can also waste storage space and computing resources.
# Add duplicate rows
df_duplicate <- rbind(df, df[c(1,3,5),])
Loading required packages
We shall use two popular data wrangling packages in R – dplyr
and tidyr
.
dplyr
package is used for data manipulation and provides a set of functions for selecting specific columns, filtering rows based on conditions, grouping rows and performing aggregations on grouped data, joining tables, and creating new variables. The main functions in dplyr include select()
, filter()
, arrange()
, mutate()
, group_by()
, summarise()
, and left_join()
, among others.
tidyr
package is used for data reshaping and cleaning. It provides a set of functions for converting between wide and long formats, filling in missing values, and separating or uniting variables. The main functions in tidyr include gather()
, spread()
, separate()
, unite()
, and fill()
, among others.
After loading these packages, their functions can be called and used in the R script for data wrangling tasks.
# Load required libraries
library(dplyr)
library(tidyr)
Selecting columns
The select()
function from the dplyr package is used to create a new data frame df_select
by selecting specific columns from the original data frame df. The column names name, age, and salary are passed as arguments to the select()
function to extract only these columns. The resulting data frame df_select contains only these columns, with all rows preserved. This operation is also known as column subsetting or projection.
# Select columns by name
df_select <- select(df, name, age, salary)
head(df_select)
# name age salary # 1 John 25 50000 # 2 Jane 31 65000 # 3 David 42 80000 # 4 Mary 26 45000 # 5 Chris 39 70000 # 6 Alex 28 90000
Filtering rows
The below code is filtering the rows of the df data frame using the filter()
function from the dplyr package based on a condition. Specifically, the code selects rows where the education column is Graduate and the gender column is Male. The resulting data frame is assigned to the df_filter object.
# Filter rows based on a condition
df_filter <- filter(df,
education == "Graduate",
gender == "Male")
df_filter
# id name age gender education salary start_date end_date # 1 1 John 25 Male Graduate 50000 2021-01-01 2023-12-31 # 2 3 David 42 Male Graduate 80000 2020-03-01 2023-06-30 # 3 6 Alex 28 Male Graduate 90000 2019-12-01 2023-11-30 # 4 9 Tom 33 Male Graduate 60000 2022-01-01 2024-12-31
The below code filters the data frame df by selecting rows where the value in the salary column is greater than 70000, and assigns the resulting data frame to a new object df_filter2. Essentially, this code extracts all the rows in df where the salary of the individual is greater than 70000.
# Filter rows based on a condition
df_filter2 <- filter(df, salary > 70000)
df_filter2
# id name age gender education salary start_date end_date # 1 3 David 42 Male Graduate 80000 2020-03-01 2023-06-30 # 2 6 Alex 28 Male Graduate 90000 2019-12-01 2023-11-30 # 3 8 Anna 29 Female Postgraduate 75000 2021-06-01 2024-05-31 # 4 10 Mike 47 Male Undergraduate 85000 2019-10-01 2022-09-30
Mutating columns
The below R code calculates the percentage of salary for each individual in the data frame df. The pipe operator %>%
is used to pass the data frame df to the mutate
function, which creates a new column percent_salary
by dividing the salary of each individual by the sum of all salaries and multiplying by 100. The resulting data frame df_percent contains the original columns of df along with the new column percent_salary. We printed the second and ninth columns (name and percent_salary) from df_percent object.
df_percent <- df %>%
mutate(percent_salary = salary / sum(salary) * 100)
df_percent[, c(2,9)]
# name percent_salary # 1 John 7.407407 # 2 Jane 9.629630 # 3 David 11.851852 # 4 Mary 6.666667 # 5 Chris 10.370370 # 6 Alex 13.333333 # 7 Lisa 8.148148 # 8 Anna 11.111111 # 9 Tom 8.888889 # 10 Mike 12.592593
Grouping and summarizing data
The below code groups the df data frame by the education column using the group_by()
function from the dplyr package. The resulting df_group object is then used with the summarize()
function to calculate summary statistics, such as the average salary and maximum age for each group of education.
The summary statistics are stored in the df_summary object. Specifically, the avg_salary
column contains the average salary for each group, and the max_age
column contains the maximum age for each group.
# Group by a column and calculate summary statistics
df_group <- group_by(df, education)
df_summary <- summarize(df_group,
avg_salary = mean(salary),
max_age = max(age))
df_summary
# # A tibble: 3 x 3 # education avg_salary max_age # <chr> <dbl> <int> # 1 Graduate 70000 42 # 2 Postgraduate 66250 39 # 3 Undergraduate 65000 47
Handling missing data
The below R code imputes the missing values in the df_missing
dataframe using mean imputation. It first creates a new dataframe called df_imputed
that is identical to df_missing
. Then, it uses the mutate()
function from the dplyr package to replace the missing values in the age and salary columns with their respective column means.
The ifelse()
function is used to check if a value is missing using the is.na()
function. If a value is missing, the function replaces it with the mean of the non-missing values in the column using the mean()
function. The na.rm = TRUE
argument is used to ignore missing values when calculating the mean. The resulting df_imputed dataframe has no missing values in the age and salary columns.
# Impute missing values with mean imputation
df_imputed <- df_missing %>%
mutate(age = ifelse(is.na(age),
mean(age, na.rm = TRUE), age),
salary = ifelse(is.na(salary),
mean(salary, na.rm = TRUE), salary))
df_imputed
# id name age gender education salary start_date end_date # 1 1 John 25.00 Male Graduate 50000 2021-01-01 2023-12-31 # 2 2 Jane 31.00 Female Postgraduate 65000 2019-08-15 2022-07-31 # 3 3 David 32.25 Male Graduate 67500 2020-03-01 2023-06-30 # 4 4 Mary 26.00 Female Undergraduate 45000 2020-05-01 2022-12-31 # 5 5 Chris 39.00 Male Postgraduate 70000 2021-04-01 2024-05-31 # 6 6 Alex 28.00 Male Graduate 90000 2019-12-01 2023-11-30 # 7 7 Lisa 32.25 Female Postgraduate 67500 2020-02-15 2022-12-31 # 8 8 Anna 29.00 Female Postgraduate 75000 2021-06-01 2024-05-31 # 9 9 Tom 33.00 Male Graduate 60000 2022-01-01 2024-12-31 # 10 10 Mike 47.00 Male Undergraduate 85000 2019-10-01 2022-09-30
Removing duplicate rows
The below R code removes duplicate rows from the data frame named df_duplicate
. The function used here is distinct()
which returns a data frame with unique rows from the input data frame. In this case, the function removes the duplicate rows that were created by concatenating the original data frame df and its subset with row indices 1, 3, and 5. The resulting data frame df_unique has the same columns as df_duplicate, but with only unique rows.
# Remove duplicate rows
df_unique <- distinct(df_duplicate)
Reshaping data
Join two data frames based on common column
The below code performs a left join between two data frames df and df2 based on a common column education. The df data frame contains information about individuals, including their education level. The df2 data frame contains information about the degrees obtained by individuals with different education levels. The resulting data frame df_join
contains all the columns from df and df2 combined into one data frame. In the resulting data frame, the values in the columns of df2 are merged with the corresponding values in df based on the common column education. If there is no match in df2 for a particular value in df, then the columns from df2 will have NA values in the corresponding rows of the resulting data frame. The left_join
function is used to perform a left join, which means that all the rows from df are retained in the resulting data frame, and only the matching rows from df2 are added to the resulting data frame.
# Join two data frames based on a common column
df2 <- data.frame(education = c("Graduate", "Postgraduate"),
degree = c("Bachelors", "Masters"))
df_join <- left_join(df, df2, by = "education")
df_join
# id name age gender education salary start_date end_date degree # 1 1 John 25 Male Graduate 50000 2021-01-01 2023-12-31 Bachelors # 2 2 Jane 31 Female Postgraduate 65000 2019-08-15 2022-07-31 Masters # 3 3 David 42 Male Graduate 80000 2020-03-01 2023-06-30 Bachelors # 4 4 Mary 26 Female Undergraduate 45000 2020-05-01 2022-12-31 <NA> # 5 5 Chris 39 Male Postgraduate 70000 2021-04-01 2024-05-31 Masters # 6 6 Alex 28 Male Graduate 90000 2019-12-01 2023-11-30 Bachelors # 7 7 Lisa 35 Female Postgraduate 55000 2020-02-15 2022-12-31 Masters # 8 8 Anna 29 Female Postgraduate 75000 2021-06-01 2024-05-31 Masters # 9 9 Tom 33 Male Graduate 60000 2022-01-01 2024-12-31 Bachelors # 10 10 Mike 47 Male Undergraduate 85000 2019-10-01 2022-09-30 <NA>
Reshape data from wide to long formt
The code reshapes the data frame df_wide from a wide to a long format using the pivot_longer()
function from the tidyr
library. The resulting data frame df_long has three columns: id
, variable
, and value.
The cols
argument specifies the columns to be converted from wide to long format. The names_to
argument specifies the name of the column that will contain the variable names, and the values_to
argument specifies the name of the column that will contain the values. In this case, the resulting df_long data frame has one row for each combination of id and variable from the original df_wide data frame.
# Reshape data from wide to long format
df_wide <- data.frame(id = c("A","B","C","D","E"),
var1 = c(2,5,3,4,1),
var2 = c(5,9,6,4,7),
var3 = c(3.5, 2.0, 1.8, 2.2, 3.9))
df_long <- pivot_longer(df_wide, cols = c("var1", "var2", "var3"),
names_to = "variable",
values_to = "value")
df_long
# # A tibble: 15 x 3 # id variable value # <chr> <chr> <dbl> # 1 A var1 2 # 2 A var2 5 # 3 A var3 3.5 # 4 B var1 5 # 5 B var2 9 # 6 B var3 2 # 7 C var1 3 # 8 C var2 6 # 9 C var3 1.8 # 10 D var1 4 # 11 D var2 4 # 12 D var3 2.2 # 13 E var1 1 # 14 E var2 7 # 15 E var3 3.9
Reshape data from long to wide formt
The below code is reshaping the df_long data frame from long format to wide format using the pivot_wider()
function. The resulting data frame df_wide2 will have id as the identifier column, variable as the column headers, and value as the values of the new columns. The names_from
parameter specifies the column to be used as column headers, and values_from
parameter specifies the column to be used as the values for each of the new columns. The id_cols
parameter is used to specify the column(s) to be kept as identifier columns.
# Reshape data from long to wide format
df_wide2 <- pivot_wider(df_long,
id_cols = "id",
names_from = "variable",
values_from = "value")
df_wide2
# # A tibble: 5 x 4 # id var1 var2 var3 # <chr> <dbl> <dbl> <dbl> # 1 A 2 5 3.5 # 2 B 5 9 2 # 3 C 3 6 1.8 # 4 D 4 4 2.2 # 5 E 1 7 3.9
Download R program — Click_here
Download R studio — Click_here