Data Cleaning is the process of transforming raw data into consistent data that can be analyzed. It is aimed at improving the content of statistical statements based on the data as well as their reliability. Data cleaning may profoundly influence the statistical statements based on the data.
R has a set of comprehensive tools that are specifically designed to clean data in an effective and comprehensive manner.
STEP 1: Initial Exploratory Analysis
The first step to the overall data cleaning process involves an initial exploration of the data frame that you have just imported into R. It is very important to understand how you can import data into R and save it as a data frame.
data<-read.csv(“Regression-Analysis-House Pricing.csv“,na.strings = “”)
The first thing that you should do is check the class of your data frame:
This renders an output as shown below in which we can clearly see that our dataset is saved as a data frame.
Next, we want to check the number of rows and columns the data frame has.
The code give us & its result:
 932 10
Here we can see that the data frame has 932 rows and 10 columns.
We can view the summary statistics for all the columns of the data frame using the code shown below:
This renders an output as shown below:
STEP 2: Visual Exploratory Analysis
There are 2 types of plots that you should use during your cleaning process –The Histogram and the BoxPlot
The histogram is very useful in visualizing the overall distribution of a numeric column. We can determine if the distribution of data is normal or bi-modal or unimodal or any other kind of distribution of interest. We can also use Histograms to figure out if there are outliers in the particular numerical column under study. In order to plot a histogram for any particular column we need to use the code shown below:
Boxplots are super useful because it shows you the median, along with the first, second and third quartiles. BoxPlots are the best way of spotting outliers in your data frame. In order to visualize a box plot we need to use the code shown below:
STEP 3: Correcting the errors!
This step focuses on the methods that you can use to correct all the errors that you have seen.
If we want to change the name of our data frame we can do so using the code shown below:
In the code above we renamed the Carpet column as “Carpet_area”.
Sometimes columns have an incorrect type associated with them. For example, a column containing text elements stored as a numeric column. In such a case we can change the type of column by using the code shown below:
There are a wide array of type conversions you can carry out in R. They are listed below.
String manipulation in R comes in handy when you are working with datasets that have a lot of text based elements.
In order to change all the text to uppercase or lowercase in a particular column we need to execute the code shown below:
#Making all uppercase
#Making all Lowercase
If we want to trim the whitespaces in the next under a column we need to use the code shown below:
#Installing and loading the required packages
#Trimming all whitespaces
If we want to replace a particular word or letter under a column we can do so using the code below:
#Replacing “Not Provided” with “Not Available”
data$Parking<-str_replace(data$Parking,”Not Provided”,”Not Available”)
In order to replace the outliers with the summary statistics like median the following code is used.
#Replacing the outliers of a particular column with median
data$Dist_Taxi[data$Dist_Taxi %in% vec1]<-median(data$Dist_Taxi)
The next section will show you how to deal with your missing values:
#Checking for missing values in the entire dataframe
#Checking for the total number of missing values in the entire dataframe
#Checking for the total number of missing values in a particular column
#Eliminating missing values completely from the entire dataframe
#Eliminating missing values completely from a particular column
#Replacing the NA’s in the entire dataframe with ‘0’s
#Replacing the NA’s in a particular column with ‘0’s
#Replacing the NA’s in a particular column with a summary statistics like median
Suppose we want to unite two columns in our data frame we can do so using the code shown below:
#Installing and loading the required package
data1<-unite(data = data,col = city_category_with_parking,City_Category,Parking)
The unite() function takes 4 arguments – The data frame, the new column name, the first column and the second column name that you want to unite.
Conversely we can also separate a column as shown below:
data2<-separate(data = data1,city_category_with_parking,c(“City_Category”,”Parking”), sep = “-“)
The separate() function takes 4 arguments – The data frame, the column that we want to separate, the names of the new columns and the indicator at which we want the column to be separated at.
steps 1 to 3 above gives you a relatively clean dataset. Always keep exploring new ways that you can clean your data and never stop exploring.
Chandana is B.E. She was working as Analyst Intern with Nikhil Guru Consulting Analytics Service LLP (Nikhil Analytics), Bangalore.