In Unit 3 we learned that we can test if two proportions are different by comparing their values. To compare porportions we calculated their difference, scaled that difference to get a z-score, and checked if that value was bigger than 1.96. (We may have also calculated a p-value and checked if it was less than 5%.) But what if we have more than one pair of proportions? How do we check if there are differences?
Once we have more than one pair, we have to take a different approach. The z -test will no longer be adequate. What statisticians do when multiple proportions are invovled is to go back and just lwork with the raw numbers and calculate an expected value. They than look at the differences of the raw value and the expected value and sum up their squares. This is a very different formula but the process to test for differences amon multiple proportions is very similar to the process that just compares two.
This lesson introduces a different kind of test, the chi-squared test, whose implementatoin is very similar to a z-test or t-test. Thus a secondary purppose is to learn to generalize the hypothesis testing process.
After completing this lesson you should be able to:
The chi-squared distribution is a mathematical function that is derived from the normal distribution. Recall that a z-score is a standard normal random variable. If you square a z-score you still have a random variable, and its probabilities P(z2 < c) can be looked up in the appropriate table. Even a sum of k of these squares can be analyzed. The distribution of a sum of squared z-scores is called chi-squared distribution. It is based on the number of terms in the sum. For each independent term you add a degree of freedom. Typically the degree of freedom of the distribtion is the number of categories minus 1. What this means is that if you add a bunch of squared terms each of which has been standardized (the mean subtracted and then divided by s), the result has values that can be analyzed and compared to values in a probabilty table. This table is called the chi-squared table.
The chi-squared distribution is used if you can put your data in a contingency table. Contingency tables are tabulated data where every point is tallied in exactly once. Once tallied you can use the chi-square statistic (i.e. the formula) to test if one cell in the table has a higher (lower) than expected tally, or in the case of groups, if one group has higher than expected tallies.
A bit of Trivia
The chi-squared distribution gets its name from the greek letter "chi", Χ2;, which looks like an "x". Since it is based on squared variables they choose the name, Χ2, or chi-squared.
If you have a table of observed tabulated (count) values and you wish to compare the observed counts to those based on some sort of theory, then you can obtain a chi-square statistic as follows:
Application used in this lesson
There are two applications that you will be asked to learn. The first tests if two (or more) proportions are all the same. A typical application would be to test if 6 sided die has each part turn up with equal likelihood. Here the expected values for each side is one sixth of the number of roles, and the degrees of freedom is 5, one less than the number of categoriesl. The test statistic for the general case that has k possible outcomes is:
where the Expi = ( 1⁄k )·Total, and degrees of freedom = k−1
Example (Comparing multiple proportions)
A university has open house three times in the spring, and attendance is as follows. March - 55 visitors, April - 73 visitors and in May 52 visitors. Test if each month had about a third of the visitors.
- Set up null hypothesis: Ho: Each month had 1/3 of the visitors.
- Get data and determine the expected number of visitors per month if each month had a third of the total.
Total visitors = 55 + 73 + 52 = 180. Thus the expected number per month = (1/3)180 = 60 visitors
- Choose test: Decide on the chi-square test with 2 degrees of freedoml.
- Calculate the test statistic and p-value:
- Χ² = (55−60)²⁄60 + (73−60)²⁄60 +(52−60)²⁄60 = 4.3.
- p-value = P( Χ² > 4.3) = 0.116
- State Conclusion: the data does not show that the one month is prefered over another.
Note even though I use the word "proportions", we actually only use the tallies or count values. A more common application of this statistical test is to determine if two groups have the same set of proportions for a set of a categorical variable. In other words, this test can be used to determine
if one trait (say a person's gender) has bearing on the outcome of another
trait (how that person voted.) The process for this test is similar to the previous one except that the expected values require you to adjust the number of representatives from each group. Thus:
Example (Comparing two distributions)
Is there a difference in how people vote depending on their gender? Here is a typical exit poll result from the 2008 election. Of the 200 male voters 85 voted democratic, 90 voted republican, and the rest for another party. Of the 300 women voters polled, 165 voted democratic and 130 republican. Set up a contigency table and test if the two groups voted differently.
- Set up null hypothesis: Ho: The two groups voted in a similar distribution..
- Get data and determine the expected number of votes if the two groups were identical.
Observed Values Expected Values Dem Rep Other Total Dem Rep Other Total Male 85 100 15 200 200×(.5) = 100 200×(.46) = 92 200×(.04) = 8 200 Female 165 130 5 300 300×(.5) = 150 300×(.46) =138 300×(.04) = 12 300 Total 250 230 20 500 250 (50%) 230 (46%) 20 (4%) 500
- Choose test: Decide on the chi-square test with 2 degrees of freedom.
- Calculate the test statistic and p-value:
- Χ² = (85−100)²⁄ 100 + (100−92)²⁄ 92 +(15−8)²⁄ 8 +
+ (165−150)²⁄ 150 + (130−138)²⁄ 138 +(5−12)²⁄ 12 = 15.1.
- p-value = P( Χ² > 15.1) = 0.0005
- State Conclusion: the data shows that there is a difference between the voting patterns of men and women.
The chi-square test has other applications. Two common ones are:
You will not be held responsible for either of these procedures.
The Chi-square distribution is easily calculated on your calculator or excel.
|Probability problem||Excel command||TI 83/84 command|
|P ( Χ² < c ) = ?||=CHIDIST( c, df )||Χ²cdf(c, 99999, df)*|
+chhree sections of a course and course is off the expected value of row "i" column "j" is: Expected value = (total of row "i")×(total of A column "j") / Grand total