Hot Topics in Analytics

# Correlation and its Application using R  Correlation

The word correlation is used in everyday life to denote some form of relationships. However, in statistical terms we use correlation to denote association between two quantitative variables. Correlation is one of the most common and most useful statistics. A correlation is a single number that describes the degree of relationship between two numeric variables.  A positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation indicates the extent to which one variable increases as the other decreases.

Correlation is measured using correlation coefficient (or “r”).  The value of r ranges from -1.0 to +1.0. The closer r is to +1 or -1, the more closely the two variables are related.

Usually, in statistics, we measure three types of correlations: Pearson correlation, Kendall rank correlation and Spearman correlation.

Pearson r correlation is widely used in statistics to measure the degree of the relationship between linear related variables.  Coefficient, r (In statistics, the value of the correlation coefficient ‘r’ varies between +1 and -1.) Correlation works for quantifiable data in which numbers are meaningful, usually quantities of some sort. To test the association for purely categorical data, such as gender, brands purchased, or favorite color we use Chi-square Test. Details of Chi-square Test will be included in an upcoming article.

Correlation Measures Association, Not Causation

Causation means cause and effect relation. “Correlation does not imply causation” means that correlation cannot be used to infer a causal relationship between the variables. Simple example is that  Sales of personal computers and athletic shoes have both risen strongly in the last several years and there is a high correlation between them, but you cannot assume that buying computers causes people to buy athletic shoes (or vice versa).

Correlation in R

R can perform correlation with the cor() function.

Syntax to get the correlation coefficient: cor( var1, var2, method = “method”).

The default method is “pearson” .Type “kendall” or “spearman” to get the appropriate correlation coefficient.  EG1: Finding the correlation between Age and Circumference of an orange Tree
> library(MASS)

> data(Orange)

> View(Orange)

> with(Orange,cor(age,circumference))

 0.9135189

Here ‘r ‘ value 0.9135189 shows there is a strong positive correlation  between  the age and circumfence of an Orange tree.

Eg2:- The IQfile.txt represents the IQ scores of 10 mothers and their eldest daughters.

(b) find the  value of the sample correlation coefficient r.

mom_iq<-c(135,127,124,120,115,112,104,96,94,85)

mom_iq

 135 127 124 120 115 112 104  96  94  85

daughter_iq<-c(121,131,112,115,99,118,106,89,92,90)

daughter_iq

 121 131 112 115  99 118 106  89  92  90

r_iq<-cor(mom_iq,daughter_iq)

r_iq

 0.8621791

Here r is 0.8621791 means there is a strong positive correlation between the mother’s and daughter’s IQs.

Covariance

The covariance of two variables x and y in a data sample measures , how the two variables are linearly related.  A positive covariance would indicate a positive linear relationship between the variables, and a negative covariance would indicate the opposite. Covariance in R

cov( ) function is used to produce covariances ; cov(x, y = NULL, method = “method”)

EG:- Applying  the cov function to compute the covariance of age & circumference of orange trees

cov(mom_iq,daughter_iq)

 201.0444

Here covariance of the mother’s and daughter’s IQs is 201.0444. It indicates a positive linear relationship between the two variables.

Scatter Plot :To detect a linear relationship

To obtain a measure of relationship between two variables we plot corresponding values in the graphs taking one of the variables along x-axis and other along y-axis. The resulting diagram showing a collection of dots is called scatter diagram. Syntax for generating a Scatter plot in r

with (<dataframe>,plot(x,y))

or

plot(x,y)

Eg:-Draw a scatter diagram of mother’s and daughter’s IQs data. Correlation and Significance tests

We perform a hypothesis test of the “significance of the correlation coefficient” to decide whether the linear relationship in the sample data is strong enough to use to model the relationship in the population.

ρ = population correlation coefficient (unknown)

r = sample correlation coefficient (known; calculated from sample data)

The hypothesis test lets us decide whether the value of the population correlation coefficient ρ is “close to 0” or “significantly different from 0”. We decide this based on the sample correlation coefficient r and the sample size n. Correlation significance in R

The function cor.test() is used to test whether the relationship is significant or not.

Following  example shows the result of correlation significance test performed on the mother’s and daughter’s IQs dataset.

cor.test(mom_iq,daughter_iq)

Pearson’s product-moment correlation

data:  mom_iq and daughter_iq

t = 4.8136,    df = 8,     p-value = 0.001332

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval: 0.5087021 0.9669150

sample estimates:

cor 0.8621791

Here p value is 0.001332 i.e., p<0.05 it means correlation is significant between the mother’s and daughter’s IQs so we are accepting alternative hypothesis.

Watch out for Partial Correlation in the next post.

Source Kavitha P.S is an MCA. Currently she is working as a Senior Analyst Intern with NikhilGuru Consulting Analytics Service LLP, Bangalore. She has prior worked 5 Years with UST-Global,  Trivandrum.

1. SHOBHIT | December 16, 2016 at 9:03 pm | Reply