Correlation
The word correlation is used in everyday life to denote some form of relationships. However, in statistical terms we use correlation to denote association between two quantitative variables. Correlation is one of the most common and most useful statistics. A correlation is a single number that describes the degree of relationship between two numeric variables. A positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation indicates the extent to which one variable increases as the other decreases.
Correlation is measured using correlation coefficient (or “r”). The value of r ranges from -1.0 to +1.0. The closer r is to +1 or -1, the more closely the two variables are related.
Usually, in statistics, we measure three types of correlations: Pearson correlation, Kendall rank correlation and Spearman correlation.
Pearson r correlation is widely used in statistics to measure the degree of the relationship between linear related variables.
Coefficient, r (In statistics, the value of the correlation coefficient ‘r’ varies between +1 and -1.)
Correlation works for quantifiable data in which numbers are meaningful, usually quantities of some sort. To test the association for purely categorical data, such as gender, brands purchased, or favorite color we use Chi-square Test. Details of Chi-square Test will be included in an upcoming article.
Correlation Measures Association, Not Causation
Causation means cause and effect relation. “Correlation does not imply causation” means that correlation cannot be used to infer a causal relationship between the variables. Simple example is that Sales of personal computers and athletic shoes have both risen strongly in the last several years and there is a high correlation between them, but you cannot assume that buying computers causes people to buy athletic shoes (or vice versa).
Correlation in R
R can perform correlation with the cor() function.
Syntax to get the correlation coefficient: cor( var1, var2, method = “method”).
The default method is “pearson” .Type “kendall” or “spearman” to get the appropriate correlation coefficient.
EG1: Finding the correlation between Age and Circumference of an orange Tree
> library(MASS)
> data(Orange)
> View(Orange)
> with(Orange,cor(age,circumference))
[1] 0.9135189
Here ‘r ‘ value 0.9135189 shows there is a strong positive correlation between the age and circumfence of an Orange tree.
Eg2:- The IQfile.txt represents the IQ scores of 10 mothers and their eldest daughters.
(b) find the value of the sample correlation coefficient r.
mom_iq<-c(135,127,124,120,115,112,104,96,94,85)
mom_iq
[1] 135 127 124 120 115 112 104 96 94 85
daughter_iq<-c(121,131,112,115,99,118,106,89,92,90)
daughter_iq
[1] 121 131 112 115 99 118 106 89 92 90
r_iq<-cor(mom_iq,daughter_iq)
r_iq
[1] 0.8621791
Here r is 0.8621791 means there is a strong positive correlation between the mother’s and daughter’s IQs.
Covariance
The covariance of two variables x and y in a data sample measures , how the two variables are linearly related. A positive covariance would indicate a positive linear relationship between the variables, and a negative covariance would indicate the opposite.
Covariance in R
cov( ) function is used to produce covariances ; cov(x, y = NULL, method = “method”)
EG:- Applying the cov function to compute the covariance of age & circumference of orange trees
cov(mom_iq,daughter_iq)
[1] 201.0444
Here covariance of the mother’s and daughter’s IQs is 201.0444. It indicates a positive linear relationship between the two variables.
Scatter Plot :To detect a linear relationship
To obtain a measure of relationship between two variables we plot corresponding values in the graphs taking one of the variables along x-axis and other along y-axis. The resulting diagram showing a collection of dots is called scatter diagram.
Syntax for generating a Scatter plot in r
with (<dataframe>,plot(x,y))
or
plot(x,y)
Eg:-Draw a scatter diagram of mother’s and daughter’s IQs data.
Correlation and Significance tests
We perform a hypothesis test of the “significance of the correlation coefficient” to decide whether the linear relationship in the sample data is strong enough to use to model the relationship in the population.
ρ = population correlation coefficient (unknown)
r = sample correlation coefficient (known; calculated from sample data)
The hypothesis test lets us decide whether the value of the population correlation coefficient ρ is “close to 0” or “significantly different from 0”. We decide this based on the sample correlation coefficient r and the sample size n.
Correlation significance in R
The function cor.test() is used to test whether the relationship is significant or not.
Following example shows the result of correlation significance test performed on the mother’s and daughter’s IQs dataset.
cor.test(mom_iq,daughter_iq)
Pearson’s product-moment correlation
data: mom_iq and daughter_iq
t = 4.8136, df = 8, p-value = 0.001332
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval: 0.5087021 0.9669150
sample estimates:
cor 0.8621791
Here p value is 0.001332 i.e., p<0.05 it means correlation is significant between the mother’s and daughter’s IQs so we are accepting alternative hypothesis.
Watch out for Partial Correlation in the next post.
Source
- Statistics for management (seventh Edition) :-Richard l.Levin & David S.Rubin
- https://en.wikipedia.org/wiki/Correlation_and_dependence
- https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php
About Kavitha:
Kavitha P.S is an MCA. Currently she is working as a Senior Analyst Intern with NikhilGuru Consulting Analytics Service LLP, Bangalore. She has prior worked 5 Years with UST-Global, Trivandrum.
First of its kind article….complete knowledge nice….!!!