Hot Topics in Analytics

# Chisquare Test, Different Types and its Application using R  Chi-Square Test

The chi-square statistic is represented by χ2. The tests associated with this particular statistic are used when your variables are at the nominal and ordinal levels of measurement – that is, when your data is categorical. Briefly, chi-square tests provide a means of determining whether a set of observed frequencies deviate significantly from a set of expected frequencies.

Chi-square can be used at both univariate and bivariate levels. In its univariate form –the analysis of a single variable – it is associated with the ‘goodness of fit’. Goodness of fit is used to determine whether sample data are consistent with a hypothesized distribution.

When used for bivariate analysis – the analysis of two variables in conjunction with one another – it is called the chi-square test of association, or the chi-square test of independence, and sometimes the chi-square test of homogeneity.

Chi-square test of independence

The chi-square test is applied when you have two categorical variables from a single population and it evaluates whether there is a significant association between the categories of the two variables.

The chi-square test of independence is used to analyze the frequency table (i.e. contingency table) formed by two categorical variables.

2 x 2 Contingency Table

There are several types of chi square tests depending on the way the data was collected and the hypothesis being tested. We’ll begin with the simplest case: a 2 x 2 contingency table. If we set the 2 x 2 table to the general notation, using the letters a, b, c, and d to denote the contents of the cells, then we would have the following table:

General notation for a 2 x 2 contingency table. For a 2 x 2 contingency table the Chi Square statistic is calculated by the formula: This calculated Chi-square statistic is compared to the critical value (obtained from statistical tables) with a degrees of freedom df = (r−1) × (c−1) and p = 0.05. Where, ‘r ‘is the number of rows and ‘c’ is the number of column in the contingency table. Chi-square test of significance

Chi-square test examines whether rows and columns of a contingency table are statistically significantly associated.

• Null hypothesis (H0): There is no association between the two variables. That means the row and the column variables of the contingency table are independent.
• Alternative hypothesis (H1): There is an association between the two variables. That means the row and column variables are dependent. For each cell of the table, we have to calculate the expected value under null hypothesis.

The chi-square test is always testing what scientists call the null hypothesis, which states that there is no significant difference between the expected and observed result.

If the calculated Chi-square statistic is greater than the Chi-Square table value, we will reject the null hypothesis then we must conclude that the row and the column variables are related to each other. This implies that they are significantly associated.

If Chi-square statistic value is as large as, say 20, it would indicate a substantial difference between our observed values and our expected values. A  Chi-square statistic value is zero, on other hand, indicates that the observed frequencies exactly match the expected frequencies. The value of Chi-square can never be negative because the differences between the observed and expected frequencies are always squared.

Chi-square Test in R

In R the Chisq.test () function is used to test the association between two categorical variables.  EG:-The Car93 data set from the MASS library which represents the data from the same of different type of cars in USA in the year 1993. By using this dataset we need to test the type of Airbags and the type of type of car sold have any significant relationship between them. If association is observed then we can estimate which types of cars can sell better with what types of air bags. We have a chi-squared value of 33.0009 and p-value of 0.0002723.Since we get a p-value less than the significance level of 0.05, we reject the null hypothesis and conclude that the two variables Airbags and Type have a significant relationship. Chi Square Goodness of Fit (One Sample Test)

A chi-square goodness of fit test – sometimes called the chi-square one-sample test – can help us to do this as it tells us whether there is a difference between what was actually observed in our results and what we might expect to observe by chance. So, the test is able to determine whether a single categorical variable fits a theoretical distribution or not. It enables us to make an assessment of whether the frequencies across the categories of the variable are likely to distribute according to random variation or something more meaningful.

For the chi-square goodness of fit test to be useful, a number of assumptions first need to be met. The assumptions for the chi-square test of association are the same as they are for the chi-square goodness of fit test:

As an absolute requirement, your data must satisfy the following conditions:

• The variable must be either nominal or ordinal and the data represented as counts/frequencies.
• Each count is independent. That is, one person or observation should not contribute more than once to the table and the total count of your scores should not be more than your sample size: one person = one count.

If your data do not satisfy these conditions then it is not possible to use the test and it should not be used. However, your data should also typically conform to the following:

• None of the expected frequencies in the cells should be less than 5.
• The sample size should be at least 20 – but more is better.

If the data in your sample does not satisfy these two criteria, the test becomes unreliable. That is, any inferences that you may make about your data have a significantly higher likelihood of error. In such instances of low sample size or very low expected frequencies, it has been repeatedly demonstrated by statisticians that the chi-square statistic becomes inflated and no longer provides a useful summary of the data. If your expected frequencies are less than 5, it is probably worth considering collapsing your data into bigger categories or using a different test.

Chi-square Goodness of Test in R

Eg:-In R the built-in data set survey, the Smoke column records the survey response about the student’s smoking habit. Suppose the campus smoking statistics is as below. Determine whether the sample data in survey supports it at a 0.05 significance level.

Heavy   Never   Occas   Regul

4.5%   79.5%    8.5%    7.5% As the p-value 0.991 is greater than the 0.05 significance level, we do not reject the null hypothesis that the sample data in survey supports the campus-wide smoking statistics.

Fisher’s exact test

Fisher exact test proposed in the mid-1930s almost simultaneously by Fisher, Irwin and Yates. Fisher’s exact test is a statistical significance test used in the analysis of contingency tables for two nominal variables and you want to see whether the proportions of one variable are different depending on the value of the other variable.

Fisher’s exact test is more accurate than the chi-squared test of independence when the expected numbers are small. When one of the expected values (note: not the observed values) in a 2 × 2 table is less than 5, and especially when it is less than 1, then Yates’ correction can be improved upon. In such cases the Fisher exact test is a better choice than the Chi-square.

The Fisher Exact test is generally used in one tailed tests. However, it can be used as a two tailed test as well.

• Null hypothesis (H0): The null hypothesis is that the relative proportions of one variable are independent of the second variable; in other words, the proportions at one variable are the same for different values of the second variable. (H0: p1 = p2)
• Alternative hypothesis (H1): The relative proportions of one variable are dependent of the second variable. The alternative hypothesis can be either left-tailed (p1 < p2), right-tailed (p1 > p2), or two-tailed (p1 ≠ p2).

A data set which is called an “R×C table,” where R is the number of rows and C is the number of columns. If the columns represent the study group and the rows represent the outcome, then the null hypothesis could be interpreted as the probability of having a particular outcome not being influenced by the study group, and the test evaluates whether the two study groups differ in the proportions with each outcome. An important assumption for all of the methods outlined, including Fisher’s Exact test, is that the binary data are independent. If the proportions are correlated then more advanced techniques should be applied.

Fishers Exact Test in R

The function fisher.test() is used to perform Fisher’s exact test when the sample size is small to avoid using an approximation that is known to be unreliable for samples.

Eg:-Consider a trial comparing the performance of two boxers. Each of the boxers undertook the trial eight times and the number of successful trials was recorded. The hypothesis under investigation in this experiment is that the performance of the two boxers is similar. If the first boxer was only successful on one trial and the second boxer was successful on four of the eight trials then can we discriminate between their performances?

The data is setup in a matrix: The p-value calculated for the test does not provide any evidence against the assumption of independence. In this example the association between rows and columns   is considered to be not statistically significant, this means that we cannot confidently claim any difference in performance for the two boxers. So we are accepting null hypothesis.

Source:

https://ww2.coastal.edu/kingw/statistics/R-tutorials/goodness.html

http://courses.statistics.com/software/R/Rchisq.htm

http://udel.edu/~mcdonald/statfishers.html

https://en.wikipedia.org/wiki/Fisher’s_exact_test 