dplyr is a package for data manipulation, written and maintained by Hadley Wickham. It provides some great, easy-to-use functions that are very handy when performing exploratory data analysis and manipulation. For this article we will be using the nycflights13 dataset. The nycflights13 dataset contains information about flights from New York City for the year 2013 with details of flight number, the arrival and departure time, the origin and the destination of the flights along with the carrier, air time and distance. Before we dive into the functions, let’s load up the dplyr package.
The head of the dataset looks like this:
- Filter: The filter function will return all the rows that satisfy a following condition. Below example will return all the rows where month is September, day is 2 and origin of flight is LGA.
- Mutate: Mutate is used to add new variables to the data. Below example will add a new column overall_delay by subtracting departure delay from arrival delay.
- Summarise: The summarise function is used to summarise multiple values into a single value. It is very powerful when used in conjunction with the other functions in the dplyrpackage, as demonstrated below. na.rm = TRUEwill remove all NA values while calculating the mean, sum and standard deviation of air time, so that it doesn’t produce spurious results.
- Group By: The group_byfunction is used to group data by one or more variables. Below example will group the data together based on the number of cylinders and then the summarise function is used to calculate the mean gear for each cylinder.
The following example will group all cars based on the number of cylinders and then calculate mean of gear and horsepower(hp).
- Sample: The sample function is used to select random rows from a table. The first line of code randomly selects ten rows from the dataset, and the second line of code randomly selects 20% rows from the original dataset.
- Arrange: The arrange function is used to arrange rows by variables. You can sort the data in ascending or descending manner. Below example sorts all the rows according to descending order of departure time.
- Pipe Operator: The pipe operator in R, represented by %>%can be used to chain code together. It is very useful when you are performing several operations on data, and don’t want to save the output at each intermediate step. Below example shows how pipe operator is better than using nesting statements and multiple assignments.
The above can be done using pipe operator as follows:
Conclusion: We saw how using manipulation package in R (dplyr), we can easily manipulate data using a number of functions.
Below are some exercises for you to solve using airquality dataset which contains information about air quality measurements in New York from May 1973 – September 1973.
Q1. Find out all the records where Temperature is greater than 70 and month in after June.
Q2. Add a new column Temp_C that displays the temperature in celsius.
Q3. Find out the summary of the Temperature column by calculating its mean and removing the NA values for the same column.
Q4. Find the mean of Temperature for each Month by grouping the data based on the month.
Q5. Create a sample dataset by extracting 20 random samples from the dataset and 15% sample data from the dataset.
Q6. Arrange the records in the dataset by descending order of Month and ascending order of Day.
Q7. Remove all the data corresponding to 7th Month, group the data by Month and find the mean of the temperature for each month.
You can mail us your solution. Our mail id: firstname.lastname@example.org
About Vivek Singh & Avijeet Biswal :
Vivek Singh is B.Tech in Electrical Engineering. Currently he is working as Analyst Intern with NikhilGuru Consulting Analytics Service LLP (Nikhil Analytics), Bangalore.
Avijeet Biswal is B.Tech in Computer Science. He was working as Analyst Intern with NikhilGuru Consulting Analytics Service LLP (Nikhil Analytics), Bangalore.