__Correlation__

The word correlation is used in everyday life to denote some form of relationships. However, in statistical terms we use correlation to denote association between two quantitative variables. Correlation is one of the most common and most useful statistics. A correlation is a single number that describes the degree of relationship between two numeric variables. A positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation indicates the extent to which one variable increases as the other decreases.

Correlation is measured using correlation coefficient (or “r”). The value of r ranges from -1.0 to +1.0. The closer r is to +1 or -1, the more closely the two variables are related.

Usually, in statistics, we measure three types of correlations: Pearson correlation, Kendall rank correlation and Spearman correlation.

Pearson r correlation is widely used in statistics to measure the degree of the relationship between linear related variables.

**Coefficient, r **(In statistics, the value of the correlation coefficient ‘r’ varies between +1 and -1.)

Correlation works for quantifiable data in which numbers are meaningful, usually quantities of some sort. To test the association for purely categorical data, such as gender, brands purchased, or favorite color we use Chi-square Test. Details of Chi-square Test will be included in an upcoming article.

__Correlation Measures Association, Not Causation__

Causation means cause and effect relation. “Correlation does not imply causation” means that correlation cannot be used to infer a causal relationship between the variables. Simple example is that Sales of personal computers and athletic shoes have both risen strongly in the last several years and there is a high correlation between them, but you cannot assume that buying computers causes people to buy athletic shoes (or vice versa).

__Correlation in R__

R can perform correlation with the *cor() *function.

Syntax to get the correlation coefficient: cor( var1, var2, method = “method”).

The default method is “pearson” .Type “kendall” or “spearman” to get the appropriate correlation coefficient.

*EG1: Finding the correlation between Age and Circumference of an orange Tree*

> library(MASS)

> data(Orange)

> View(Orange)

> with(Orange,cor(age,circumference))

[1] 0.9135189

Here ‘r ‘ value 0.9135189 shows there is a strong positive correlation between the age and circumfence of an Orange tree.

*Eg2:- The IQfile.txt** represents the IQ scores of 10 mothers and their eldest daughters.*

*(b) find the value of the sample correlation coefficient r.*

mom_iq<-c(135,127,124,120,115,112,104,96,94,85)

mom_iq

[1] 135 127 124 120 115 112 104 96 94 85

daughter_iq<-c(121,131,112,115,99,118,106,89,92,90)

daughter_iq

[1] 121 131 112 115 99 118 106 89 92 90

r_iq<-cor(mom_iq,daughter_iq)

r_iq

[1] 0.8621791

Here r is 0.8621791 means there is a strong positive correlation between the mother’s and daughter’s IQs.

__Covariance__

The covariance of two variables x and y in a data sample measures , how the two variables are linearly related. A positive covariance would indicate a positive linear relationship between the variables, and a negative covariance would indicate the opposite.

__Covariance in R__

cov( ) function is used to produce covariances ; cov(x, y = NULL, method = “method”)

*EG:- Applying the cov function to compute the covariance of age & circumference of orange trees*

cov(mom_iq,daughter_iq)

[1] 201.0444

Here covariance of the mother’s and daughter’s IQs is 201.0444. It indicates a positive linear relationship between the two variables.

__Scatter Plot :To detect a linear relationship__

To obtain a measure of relationship between two variables we plot corresponding values in the graphs taking one of the variables along x-axis and other along y-axis. The resulting diagram showing a collection of dots is called scatter diagram.

**Syntax** for generating a Scatter plot in r

with (<dataframe>,plot(x,y))

or

plot(x,y)

*Eg:-Draw a scatter diagram of mother’s and daughter’s IQs data.*

__Correlation and Significance tests__

We perform a hypothesis test of the “significance of the correlation coefficient” to decide whether the linear relationship in the sample data is strong enough to use to model the relationship in the population.

ρ = population correlation coefficient (unknown)

r = sample correlation coefficient (known; calculated from sample data)

The hypothesis test lets us decide whether the value of the population correlation coefficient ρ is “close to 0” or “significantly different from 0”. We decide this based on the sample correlation coefficient r and the sample size n.

__Correlation significance in R__

The function cor.test() is used to test whether the relationship is significant or not.

Following example shows the result of correlation significance test performed on the mother’s and daughter’s IQs dataset.

cor.test(mom_iq,daughter_iq)

Pearson’s product-moment correlation

data: mom_iq and daughter_iq

t = 4.8136, df = 8, p-value = 0.001332

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval: 0.5087021 0.9669150

sample estimates:

cor 0.8621791

Here p value is 0.001332 i.e., p<0.05 it means correlation is significant between the mother’s and daughter’s IQs so we are accepting alternative hypothesis.

Watch out for Partial Correlation in the next post.

__Source __

- Statistics for management (seventh Edition) :-Richard l.Levin & David S.Rubin
- https://en.wikipedia.org/wiki/Correlation_and_dependence
- https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php

About Kavitha:

Kavitha P.S is an MCA. Currently she is working as a Senior Analyst Intern with NikhilGuru Consulting Analytics Service LLP, Bangalore. She has prior worked 5 Years with UST-Global, Trivandrum.

First of its kind article….complete knowledge nice….!!!