Correct method for Chi-square testing for yes/no data

Question

I have following data:

I am trying to analyze it by applying Chi-square test in Excel with CHITEST(Data B, Data E) function:

I also tried with using only the "Yes" row. Since totals are also mentioned, the "No" will be in the data:

With this method I get a different Chi-square P value.

Which of the above method is correct and why?

Edit: I am surprised that this question is closed. It is about an important and interesting concept for analysis of yes/no data. Especially note that no one has pointed to an existing similar question. It is not about debugging, programming, routine operations or datasets. It also has many high quality comments and a great answer. This question is not about Excel as I am actually doing this work in Libreoffice Calc, which has this identical function. For all these reasons, I request this question should be reopened.

As an aside, there's some error in your data, 19+17+3 doesn't equal to 78. — J-J-J
– J-J-J, Commented Apr 6 at 12:40
You test different hypotheses, you get different answers; which is "correct" depends on what you wanted to know. — PBulls
– PBulls, Commented Apr 6 at 13:11
If your calculations had been correct then the first method would have been more appropriate: you should have got a $\chi^2_2$ statistic of about $0.243$ with Excel giving a $p$-value of about $0.886$ as the probability of exceeding that with $2$ degrees of freedom. — Henry
– Henry, Commented Apr 6 at 13:17
@J-J-J : Yes, a mistake has crept in but I hope my question is clear. If I change it now, it may affect contents of comments and answer. — rnso
– rnso, Commented Apr 6 at 15:56

Peter Flom · Accepted Answer · 2025-04-06 13:24:26Z

11

Expanding on Pbulls's comment, if you include both rows, then you are doing a two-way chi-square (often just called chi-square) and testing whether group is associated with responses of yes or no. If you include only the top row, you are doing a one way chi-square and testing whether the three columns are equally likely in that row (at least, that will be the default; it may be the only option in Excel).

As Pbulls said, which one is right depends on what you are trying to test, but it seems likely that you want to test both rows - that's a much more common use of chi-square.

As an aside, if you are going to be doing statistics regularly, I highly recommend moving away from Excel and learning R or some other stats program (SAS, SPSS, MatLab, Python stats - whatever is used in your field).

Of course, all this is contingent on you getting the data right; the current values for "expected" are incorrect. One advantage of a program like R or whatever is that they would do this for you (why Excel requires you to enter a matrix of expected values is beyond me. But Excel is not a statistics program).

answered Apr 6 at 13:24

Peter Flom

141k37 gold badges201 silver badges484 bronze badges

1

$\begingroup$ I normally use Python but Excel/Libreoffice gives advantage of ability to quickly replace data and also facility of adding rows with copied formulae. $\endgroup$

rnso
– rnso

2025-04-06 18:44:09 +00:00
Commented Apr 6 at 18:44
1

$\begingroup$ @rnso a big disadvantage of Excel is that it can be awfully easy to make a mistake in copying formulas, or forgetting to require a particular column or row to be kept constant with the $ notation, as is sometimes necessary. And that happens in a way that can be awfully difficult to troubleshoot. Excel might be quicker at first, but I've found that the difficulty in troubleshooting Excel formulas makes the initial extra time needed to set up and analyze data in R worth it. $\endgroup$

EdM
– EdM

2025-04-06 18:56:50 +00:00
Commented Apr 6 at 18:56
2

$\begingroup$ +1. However, re:"testing whether the three columns are equally likely in that row", this is not the case for the one-way case. Excel tests the null hypothesis of proportions as defined in the "expected" values. (Here, after the question update, the null hypothesis is percentages of ~48% for Group A, ~43% for Group B, and ~9% for Group C, and not 1/3, 1/3, 1/3) $\endgroup$

J-J-J
– J-J-J

2025-04-06 19:13:31 +00:00
Commented Apr 6 at 19:13
$\begingroup$ Thanks @J-J-J . That sounds pretty dopey. But the whole Excel chisquare program seems that way. $\endgroup$

Peter Flom
– Peter Flom

2025-04-06 23:43:58 +00:00
Commented Apr 6 at 23:43
1

$\begingroup$ @PeterFlom I agree that Excel isn't really good for doing stats. Unfortunately, I have some colleagues who promote its use, because they feel it's easier than other tools for people learning stats (as people are often already familiar with Excel). I think this thread is a good example of why they're wrong. $\endgroup$

J-J-J
– J-J-J

2025-04-07 05:59:30 +00:00
Commented Apr 7 at 5:59

Add a comment |

J-J-J · Accepted Answer · 2025-04-08 14:48:06Z

To expand a bit on Peter Flom's answer, in the second case, the expected values do not really contain information about the counts in the "No" row, contrary to what you state:

the "No" will be in the data.

If someone gave you only the observed "Yes" values $15, 4, 2$, and their corresponding expected values $10.125, 9, 1.875$, and no other information, you wouldn't be able to reconstruct the original table with certainty. Yet, these are the only pieces of information you give to Excel. It cannot deduce that there is a "No" row at all, and that you want to compare groups you sampled on their different answers; maybe you want to compare the "yes" row to a theoretical distribution that does not come from your sample, who knows? Even if Excel could tell that there is originally a "No" row, it couldn't deduce what are its counts ‒ however, this is a crucial piece of information.

Many different tables would be compatible with expected "Yes" values of $10.125, 9, 1.875$. For example, it's the case of those two tables:

Table 1

	A	B	C
yes	15	4	2
no	39	44	8

Table 2

	A	B	C
yes	15	4	2
no	999984	888884	185183

However, in table 2, "yes" seems to be a rare event, while in table 1 it seems much more common. If you randomly sampled 1 million additional observations from the population sampled in table 1, it's quite unlikely you'll end up with something like table 2. So despite having the same expected "yes" values, these tables don't tell the same story. They also don't yield the same p-values.

This is the reason why you should include the "No" row in your calculations (i.e. your first method, a chi-square test of independence), otherwise you omit important information. In the situation described in your question, I don't see a reason why you'd want to conduct a goodness-of-fit test (i.e. your second method) instead of a test of independence; it would be a waste of information, and could mislead you.

StatsStudent · Accepted Answer · 2025-04-08 17:23:08Z

One thing I think that is missing from the answers here, but I think is essential for understanding Chi-Squared ($\chi^2$) tests is how the data are actually collected. Depending on the answer to this question, you end up carrying out different tests (by name) that describe the end goal of the analysis (even though the final answer would be the same).

In this type of categorical data analysis, $\chi^2$ tests generally fall into one of three tests as described below.

1. Pearson's $\chi^2$ test for Goodness of Fit

Data Collection Setup: This test consists of data that is sampled from a population and then classified into a single level of some other characteristic.

Goal: You have a model that is pre-specified in which you think the proportion/probabilities/percentages occur in the population.

Example: You might be interested in testing the eye color of offspring of parents with blue and brown eyes. For simplicity, assume the only outcomes are blue, brown, and green. You have a model specified a priori of the ratios of the eye color of the offspring to be 1:2:1. This would translate to a model and a test of the null hypothesis:

$$ H_0: p_{blue}=1/4, p_{brown}=1/2, p_{green}=1/4 $$

**2. $\chi^2$ test of Homogeneity for a Contingency Table with One Margin Fixed****

Data Collection Setup: From each group of interest in the population you take a random sample of a predetermined, fixed size, look at the resulting characteristic/response category and classify each observation into a single response category. This forms a contingency table where one classification refers to the population and the other to the characteristic/response category.

Goal: The objective is to test whether the populations are similar or homogeneous with respect to their individual cell probabilities. This translates to determining if the observed proportions/probabilities in each characteristic/response category are nearly the same for each population.

Example: You might be interested in testing whether two drugs A and B, have similar proportions of people reporting mild, moderate, or severe side effects. You decide, again, a priori that you will sample, say, 100 patients who received drug A and then separately and independently (through stratified random sampling) randomly sample 400 patients from those who received drug B. You are interested in testing whether there are signficant differences in the patients who reported mild, moderate, and severe symptoms for each drug. This would translate to a model and a test of the null hypothesis:

$$\begin{aligned} H_0: p_{DrugA-Mild}&=p_{DrugB-Mild}, \\ p_{DrugA-Moderate}&=p_{DrugB-Moderate},\\ p_{DrugA-Severe}&=p_{DrugB-Severe} \end{aligned} $$

**3. $\chi^2$ test of Independence for a Contingency Table with Neither Margin Fixed**

Data Collection Setup: This test consists of data that is sampled from a population of interest, and then the data is simultaneously classified into two characteristics/response categories after observing the data (i.e., neither margin is set as fixed a priori).

Goal: This result of the data collection process in a contingency table in which both marginal totals are random. You are interested in determining whether the two characteristics/response categories were seemingly generated through an independent process or whether certain levels of one characteristic/response category tends to be associated or contingent on the levels of the other characteristics/response categories.

Example: You might be interested in testing whether a being Christian or Jewish alters their preference for a Conservative, Independent, or Liberal candidate for office. You conduct a survey, examine your data, and then cross-classify each respondent as being Christian/Jewish and simultaneously, which candidate they prefer for an upcoming election. If the Probability of being Christian and supporting the Conservative Candidate are independent, then $$ P(\text{Christian & Conservative Preference}) = P(Christian) \times P(\text{Conservative Preference}) $$

Using this logic for all the cross-classifications, then directly translates to the null hypothesis to be tested which would simply be, $H_0$:

$\text{Each cell probability equals the product of the corresponding row and column marginal probabilities}$

and would simply be tested with:

$$ \chi^2 = \sum_{cells}{(O-E)^2\over{E}} $$

with degrees of freedom equal to the number of rows $r$ minus 1 times the number of columns $c$ minus 1. This is because the number of degrees of freedom is initially $rc-1$ since there are $rc$ cells into which a single random sample can be classified, but from this, we must subtract the number of estimated parameters, which is $(r-1)(c-1)$ because there are $r-1$ parameters among the row margins and $c-1$ parameters among the column margins. Therefore, the total degrees of freedom in this case is simply (which by algebra is equal to the degrees of freedom in the other tests):

$$\begin{aligned} rc-1 - (r-1) - (c-1)&=(r-1)(c-1) \\ &=(\text{Number of Rows} - 1) \times (\text{Number of Columns} - 1)\\ \end{aligned} $$

Last, but not Least: A Warning About Your Data

In the example data you provided, you show a contingency table in which you have at least two expected cell probabilities of less than 5, which is generally considered a violation of the the assumptions of the Chi-Squared test and your results may not be valid. In this case, you should consider collapsing categories to increase the expected cell probabilities or conduct an alternative test like Fisher's Exact Test or perform a simulation.

"even though the final answer would be the same": Maybe I'm misinterpreting what you are saying here, but a goodness-of-fit test does not give the same final answer as a chi-square test of independence/homogeneity. // Relative to expected counts <5, it is known to be a too conservative rule, not sure if this is relevant here. // As an aside, in the situation described in the question, a GoF test is not adequate, given the collected data. It's also very dubious that the margins (row or columns) are fixed, given the counts (why choosing specifically 21 and 35, instead of balanced numbers?). — J-J-J
– J-J-J, Commented Apr 9 at 7:15

Stack Exchange Network

Correct method for Chi-square testing for yes/no data

3 Answers 3

1. Pearson's $\chi^2$ test for Goodness of Fit

**2. $\chi^2$ test of Homogeneity for a Contingency Table with One Margin Fixed****

**3. $\chi^2$ test of Independence for a Contingency Table with Neither Margin Fixed**

Last, but not Least: A Warning About Your Data

Your Answer

Hot Network Questions

Correct method for Chi-square testing for yes/no data

3 Answers 3

1. Pearson's $\chi^2$ test for Goodness of Fit

2. $\chi^2$ test of Homogeneity for a Contingency Table with One Margin Fixed**

3. $\chi^2$ test of Independence for a Contingency Table with Neither Margin Fixed

Last, but not Least: A Warning About Your Data

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions

**2. $\chi^2$ test of Homogeneity for a Contingency Table with One Margin Fixed****

**3. $\chi^2$ test of Independence for a Contingency Table with Neither Margin Fixed**