AGRON INFO TECH

Best Practices for Data Wrangling in R

Data wrangling is the process of cleaning, transforming, and reshaping data into a more useful format for analysis. R is a popular programming language for data wrangling, with many built-in functions and packages designed specifically for this purpose. Here we shall use some common techniques for data wrangling.

Description of the data

We shall use sales data in this example. Each column represents a variable related to an individual’s personal information, such as ID, name, age, gender, education, salary, and employment start and end dates.

  • id: A numeric variable representing the unique identifier for each observation.
  • name: A character variable representing the name of each individual.
  • age: A numeric variable representing the age of each individual.
  • gender: A categorical variable representing the gender of each individual (Male or Female).
  • education: A categorical variable representing the level of education of each individual (Graduate, Postgraduate, or Undergraduate).
  • salary: A numeric variable representing the salary of each individual.
  • start_date: A date variable representing the start date of employment for each individual.
  • end_date: A date variable representing the end date of employment for each individual.

The df_missing and df_duplicate objects are created by manipulating the original df object. df_missing object contains missing values in the third and fifth columns, in rows 3 and 7. df_duplicate object contains duplicate rows of the original df data frame with row 1, 3, and 5 repeated.

Importing data

The sales data was imported using read.csv() function.

# Read the CSV file into a data frame
df <- read.csv("sales_data.csv", header = T)
head(df)
#   id  name age gender     education salary start_date   end_date
# 1  1  John  25   Male      Graduate  50000 2021-01-01 2023-12-31
# 2  2  Jane  31 Female  Postgraduate  65000 2019-08-15 2022-07-31
# 3  3 David  42   Male      Graduate  80000 2020-03-01 2023-06-30
# 4  4  Mary  26 Female Undergraduate  45000 2020-05-01 2022-12-31
# 5  5 Chris  39   Male  Postgraduate  70000 2021-04-01 2024-05-31
# 6  6  Alex  28   Male      Graduate  90000 2019-12-01 2023-11-30

We further created two more data sets using the df object as described below:

  • df_missing is a data frame that contains missing values, or NA values, in some of its rows. Missing values can occur for various reasons, such as incomplete data collection, data entry errors, or data processing errors. Handling missing values is an important part of data wrangling, as missing values can affect the accuracy and validity of statistical analysis and modeling.
# Add missing values
df_missing <- df
df_missing[c(3,7), c(3,6)] <- NA
  • df_duplicate is a data frame that contains duplicate rows, or rows with identical values in all their columns. Duplicate rows can occur for various reasons, such as data entry errors, data merging errors, or data processing errors. Handling duplicate rows is an important part of data cleaning, as duplicate rows can affect the accuracy and validity of statistical analysis and modeling, and can also waste storage space and computing resources.
# Add duplicate rows
df_duplicate <- rbind(df, df[c(1,3,5),])

Loading required packages

We shall use two popular data wrangling packages in R – dplyr and tidyr.

dplyr package is used for data manipulation and provides a set of functions for selecting specific columns, filtering rows based on conditions, grouping rows and performing aggregations on grouped data, joining tables, and creating new variables. The main functions in dplyr include select()filter()arrange()mutate()group_by()summarise(), and left_join(), among others.

tidyr package is used for data reshaping and cleaning. It provides a set of functions for converting between wide and long formats, filling in missing values, and separating or uniting variables. The main functions in tidyr include gather()spread()separate()unite(), and fill(), among others.

After loading these packages, their functions can be called and used in the R script for data wrangling tasks.

# Load required libraries
library(dplyr)
library(tidyr)

Selecting columns

The select() function from the dplyr package is used to create a new data frame df_select by selecting specific columns from the original data frame df. The column names name, age, and salary are passed as arguments to the select() function to extract only these columns. The resulting data frame df_select contains only these columns, with all rows preserved. This operation is also known as column subsetting or projection.

# Select columns by name
df_select <- select(df, name, age, salary)
head(df_select)
#    name age salary
# 1  John  25  50000
# 2  Jane  31  65000
# 3 David  42  80000
# 4  Mary  26  45000
# 5 Chris  39  70000
# 6  Alex  28  90000

Filtering rows

The below code is filtering the rows of the df data frame using the filter() function from the dplyr package based on a condition. Specifically, the code selects rows where the education column is Graduate and the gender column is Male. The resulting data frame is assigned to the df_filter object.

# Filter rows based on a condition
df_filter <- filter(df, 
                    education == "Graduate", 
                    gender == "Male")

df_filter
#   id  name age gender education salary start_date   end_date
# 1  1  John  25   Male  Graduate  50000 2021-01-01 2023-12-31
# 2  3 David  42   Male  Graduate  80000 2020-03-01 2023-06-30
# 3  6  Alex  28   Male  Graduate  90000 2019-12-01 2023-11-30
# 4  9   Tom  33   Male  Graduate  60000 2022-01-01 2024-12-31

The below code filters the data frame df by selecting rows where the value in the salary column is greater than 70000, and assigns the resulting data frame to a new object df_filter2. Essentially, this code extracts all the rows in df where the salary of the individual is greater than 70000.

# Filter rows based on a condition
df_filter2 <- filter(df, salary > 70000)

df_filter2
#   id  name age gender     education salary start_date   end_date
# 1  3 David  42   Male      Graduate  80000 2020-03-01 2023-06-30
# 2  6  Alex  28   Male      Graduate  90000 2019-12-01 2023-11-30
# 3  8  Anna  29 Female  Postgraduate  75000 2021-06-01 2024-05-31
# 4 10  Mike  47   Male Undergraduate  85000 2019-10-01 2022-09-30

Mutating columns

The below R code calculates the percentage of salary for each individual in the data frame df. The pipe operator %>% is used to pass the data frame df to the mutate function, which creates a new column percent_salary by dividing the salary of each individual by the sum of all salaries and multiplying by 100. The resulting data frame df_percent contains the original columns of df along with the new column percent_salary. We printed the second and ninth columns (name and percent_salary) from df_percent object.

df_percent <- df %>% 
  mutate(percent_salary = salary / sum(salary) * 100)
df_percent[, c(2,9)]
#     name percent_salary
# 1   John       7.407407
# 2   Jane       9.629630
# 3  David      11.851852
# 4   Mary       6.666667
# 5  Chris      10.370370
# 6   Alex      13.333333
# 7   Lisa       8.148148
# 8   Anna      11.111111
# 9    Tom       8.888889
# 10  Mike      12.592593

Grouping and summarizing data

The below code groups the df data frame by the education column using the group_by() function from the dplyr package. The resulting df_group object is then used with the summarize() function to calculate summary statistics, such as the average salary and maximum age for each group of education.

The summary statistics are stored in the df_summary object. Specifically, the avg_salary column contains the average salary for each group, and the max_age column contains the maximum age for each group.

# Group by a column and calculate summary statistics
df_group <- group_by(df, education)
df_summary <- summarize(df_group, 
                        avg_salary = mean(salary), 
                        max_age = max(age))
df_summary
# # A tibble: 3 x 3
#   education     avg_salary max_age
#   <chr>              <dbl>   <int>
# 1 Graduate           70000      42
# 2 Postgraduate       66250      39
# 3 Undergraduate      65000      47

Handling missing data

The below R code imputes the missing values in the df_missing dataframe using mean imputation. It first creates a new dataframe called df_imputed that is identical to df_missing. Then, it uses the mutate() function from the dplyr package to replace the missing values in the age and salary columns with their respective column means.

The ifelse() function is used to check if a value is missing using the is.na() function. If a value is missing, the function replaces it with the mean of the non-missing values in the column using the mean() function. The na.rm = TRUE argument is used to ignore missing values when calculating the mean. The resulting df_imputed dataframe has no missing values in the age and salary columns.

# Impute missing values with mean imputation
df_imputed <- df_missing %>% 
          mutate(age = ifelse(is.na(age), 
                              mean(age, na.rm = TRUE), age),
                 salary = ifelse(is.na(salary), 
                                 mean(salary, na.rm = TRUE), salary))
df_imputed
#    id  name   age gender     education salary start_date   end_date
# 1   1  John 25.00   Male      Graduate  50000 2021-01-01 2023-12-31
# 2   2  Jane 31.00 Female  Postgraduate  65000 2019-08-15 2022-07-31
# 3   3 David 32.25   Male      Graduate  67500 2020-03-01 2023-06-30
# 4   4  Mary 26.00 Female Undergraduate  45000 2020-05-01 2022-12-31
# 5   5 Chris 39.00   Male  Postgraduate  70000 2021-04-01 2024-05-31
# 6   6  Alex 28.00   Male      Graduate  90000 2019-12-01 2023-11-30
# 7   7  Lisa 32.25 Female  Postgraduate  67500 2020-02-15 2022-12-31
# 8   8  Anna 29.00 Female  Postgraduate  75000 2021-06-01 2024-05-31
# 9   9   Tom 33.00   Male      Graduate  60000 2022-01-01 2024-12-31
# 10 10  Mike 47.00   Male Undergraduate  85000 2019-10-01 2022-09-30

Removing duplicate rows

The below R code removes duplicate rows from the data frame named df_duplicate. The function used here is distinct() which returns a data frame with unique rows from the input data frame. In this case, the function removes the duplicate rows that were created by concatenating the original data frame df and its subset with row indices 1, 3, and 5. The resulting data frame df_unique has the same columns as df_duplicate, but with only unique rows.

# Remove duplicate rows
df_unique <- distinct(df_duplicate)

Reshaping data

Join two data frames based on common column

The below code performs a left join between two data frames df and df2 based on a common column education. The df data frame contains information about individuals, including their education level. The df2 data frame contains information about the degrees obtained by individuals with different education levels. The resulting data frame df_join contains all the columns from df and df2 combined into one data frame. In the resulting data frame, the values in the columns of df2 are merged with the corresponding values in df based on the common column education. If there is no match in df2 for a particular value in df, then the columns from df2 will have NA values in the corresponding rows of the resulting data frame. The left_join function is used to perform a left join, which means that all the rows from df are retained in the resulting data frame, and only the matching rows from df2 are added to the resulting data frame.

# Join two data frames based on a common column
df2 <- data.frame(education = c("Graduate", "Postgraduate"), 
                  degree = c("Bachelors", "Masters"))
df_join <- left_join(df, df2, by = "education")
df_join
#    id  name age gender     education salary start_date   end_date    degree
# 1   1  John  25   Male      Graduate  50000 2021-01-01 2023-12-31 Bachelors
# 2   2  Jane  31 Female  Postgraduate  65000 2019-08-15 2022-07-31   Masters
# 3   3 David  42   Male      Graduate  80000 2020-03-01 2023-06-30 Bachelors
# 4   4  Mary  26 Female Undergraduate  45000 2020-05-01 2022-12-31      <NA>
# 5   5 Chris  39   Male  Postgraduate  70000 2021-04-01 2024-05-31   Masters
# 6   6  Alex  28   Male      Graduate  90000 2019-12-01 2023-11-30 Bachelors
# 7   7  Lisa  35 Female  Postgraduate  55000 2020-02-15 2022-12-31   Masters
# 8   8  Anna  29 Female  Postgraduate  75000 2021-06-01 2024-05-31   Masters
# 9   9   Tom  33   Male      Graduate  60000 2022-01-01 2024-12-31 Bachelors
# 10 10  Mike  47   Male Undergraduate  85000 2019-10-01 2022-09-30      <NA>

Reshape data from wide to long formt

The code reshapes the data frame df_wide from a wide to a long format using the pivot_longer() function from the tidyr library. The resulting data frame df_long has three columns: idvariable, and value. The cols argument specifies the columns to be converted from wide to long format. The names_to argument specifies the name of the column that will contain the variable names, and the values_to argument specifies the name of the column that will contain the values. In this case, the resulting df_long data frame has one row for each combination of id and variable from the original df_wide data frame.

# Reshape data from wide to long format
df_wide <- data.frame(id = c("A","B","C","D","E"), 
                      var1 = c(2,5,3,4,1), 
                      var2 = c(5,9,6,4,7), 
                      var3 = c(3.5, 2.0, 1.8, 2.2, 3.9))
df_long <- pivot_longer(df_wide, cols = c("var1", "var2", "var3"), 
                        names_to = "variable", 
                        values_to = "value")
df_long
# # A tibble: 15 x 3
#    id    variable value
#    <chr> <chr>    <dbl>
#  1 A     var1       2  
#  2 A     var2       5  
#  3 A     var3       3.5
#  4 B     var1       5  
#  5 B     var2       9  
#  6 B     var3       2  
#  7 C     var1       3  
#  8 C     var2       6  
#  9 C     var3       1.8
# 10 D     var1       4  
# 11 D     var2       4  
# 12 D     var3       2.2
# 13 E     var1       1  
# 14 E     var2       7  
# 15 E     var3       3.9

Reshape data from long to wide formt

The below code is reshaping the df_long data frame from long format to wide format using the pivot_wider() function. The resulting data frame df_wide2 will have id as the identifier column, variable as the column headers, and value as the values of the new columns. The names_from parameter specifies the column to be used as column headers, and values_from parameter specifies the column to be used as the values for each of the new columns. The id_cols parameter is used to specify the column(s) to be kept as identifier columns.

# Reshape data from long to wide format
df_wide2 <- pivot_wider(df_long, 
                        id_cols = "id", 
                        names_from = "variable", 
                        values_from = "value")
df_wide2
# # A tibble: 5 x 4
#   id     var1  var2  var3
#   <chr> <dbl> <dbl> <dbl>
# 1 A         2     5   3.5
# 2 B         5     9   2  
# 3 C         3     6   1.8
# 4 D         4     4   2.2
# 5 E         1     7   3.9

Download R program — Click_here

Download R studio — Click_here