(I wrote this paper in 2007 for a Statistics class I took while trying to do a PhD. I am sharing it here for posterity.)
McNemar’s test is a non-parametric method used on nominal data to determine whether the row and column marginal frequencies are equal. It is applied to 2×2 contingency tables with a dichotomous trait with matched pairs of subjects.
Simpson’s paradox is a statistical paradox in which the successes of several groups seem to be reversed when the groups are combined. This seemingly impossible result is encountered often in social science statistics, and occurs when a weighting variable, which is not relevant to the individual group assessment, must be used in the combined assessment.
The paper evaluates the potential effect of Simpson’s paradox in McNemar’s test results and conclusions.
McNemar’s test for significance of changes
Named after Quinn McNemar, who introduced it in 1947.
The McNemar test can be introduced as a variation of the sign test for the case when the data is nominal, and thus can be expressed as “0’s” (zeroes) and “1’s” (ones).
In a study of N subjects (i=1..N), where the effect of some treatment on a characteristic of the subject results in values represented by (Xi, Yi), each the result of treatments X and Y on subject i. We can say the set of values Xi and the set of values Yi constitute two paired samples, where each Xi and Yi can only take a value of 0 or 1.
The data can be presented in a 2 x 2 contingency table with the form:
Table 1. 2×2 contingency table
|Yi = 0||Yi = 1|
|Xi = 0||A = number of (0,0)||B = number of (0,1)|
|Xi = 1||C = number of (1,0)||D = number of (1,1)|
Where A+B+C+D = N
The data consists of the characteristics resulting of treatments on N randomly selected subjects, denoted as (Xi, Yi).
The pairs (Xi, Yi) are mutually independent and the measurement scale is nominal. This results in four possible categories, presented above as (0,0), (0,1), (1,0) and (1,1).
We want to test the hypothesis that the treatments make a difference in the incidence of the characteristic, thus the null hypothesis will state that the treatment does not change the incidence of the characteristic or, what is the same, that the incidence of the characteristic is the same for both treatments.
Thus, we test:
H0: P(Xi = 0) = P(Yi = 0), or 
H0: P(Xi = 1) = P(Yi = 1)
Expressed in terms of proportions the null hypothesis is:
H0: p1 = p2 
In summary, all possible hypotheses expressed in proportions are:
H0: p1 = p2 H0: p1 ≥ p2 H0: p1 ≤ p2 
H1: p1 ≠ p2 H1: p1 < p2 H1: p1 > p2
Since we are testing for p1 = p2, we can re-write to test for p1 – p2 = 0.
Using the equivalence between  and  and the values in Table 1, we say:
McNemar showed that , when B+C>10, and then the appropriate test statistic is:
As Conover (1999) points out the two-tailed test of Z is comparable (for big enough values of B+C) to the one-tailed test of Z2, using a chi-squared distribution with 1 degree of freedom.
For the two-sided test, the p-value is two times the probability of finding a Z greater than the Z found. We reject the null hypothesis if the p-value is less than the level of significance desired.
For the one-sided test, the p-value is the probability of finding a Z greater than the Z found. We reject the null hypothesis if the p-value is less than the level of significance desired.
Simpson’s paradox is the common name for a situation that may occur when two populations are analyzed with respect to the frequency of some characteristic: if the populations are separated into two categories, the population with higher frequency might show a lower frequency within each category.
The paradox arises when the following counter-intuitive relationships are true:
a/b < A/B
c/d < C/D, and 
(a+c)/(c+d) > (A+C)/(B+D)
A simple illustrative example (adapted from Shapiro, 1982):
Table of success rate of two treatments on men, women and both:
Table 2. Paradox example
From looking at the data for each men and women Treatment 1 seems more effective, looking at both men and women combined Treatment 2 seems more effective.
Rigorously the above constitutes a 2x2x2 contingency table with three variables: treatment, sex and success rate. Simpson (1951) states that besides the interactions between attributes (characteristics) in pairs, the statistical paradox is caused either by the “second-order” interaction of the three taken together or by the dependence of the collapsed variable with respect to the other variables.
Aggregated contingency tables affected by Simpson’s paradox
Special care should be used when testing and drawing conclusions on data that is analyzed and presented as a 2×2 contingency situation when there is really the aggregation (collapse) of a 2x2x2 contingency situation.
The effect of the second order interaction among the three variables or the collapsed variable dependency from the others can change the result of the overall test and mislead conclusions.
Concretely, Simpson (1951) presents the 2x2x2 contingency situation in the following form:
Table 3. 2x2x2 contingency table
In the example given earlier, if A is treatment 1, B is men and C is success then a is 60, b is 100 and so forth.
According to Bartlett, as cited by Simpson, the condition for a zero second-order interaction is:
Which in the example given is true.
According to Simpson, the second condition, assuming zero second-order interaction, is that the collapsed variable, “sex”, is independent of treatment for both success or failure, or that it is independent of success for both treatments. Mathematically, the condition is:
, or 
Which in the example given are not true.
A practical example
Wardrop (1995) studies the effect of Simpson’s paradox in the perception of the existence of the “hot hand” in basketball: the fan’s believe that making a shot will influence a player to make the following shot.
He tests the player’s and the overall shooting data (two consecutive free throws) using McNemar test and finds that the overall results support the “hot hand” leading the fan’s to believe in it even though the results for individual players might not indicate the same.
Data is as follows:
Table 4. Hot-hand summary
|Larry Bird||Rick Robey||Total|
|Second shot||Second shot||Second shot|
|First shot||Hit||Miss||Tot||First shot||Hit||Miss||Tot||First shot||Hit||Miss||Tot|
The analysis of the probability of a hit after a hit (phh) and after a miss (pmh), and the p-value of the McNemar test for phh = pmh yields:
Table 5. Test results
|Larry Bird||Rick Robey||Total|
The overall p-value = 0.022 supports rejecting the hypothesis that the probabilities are not the same, leading to believe in the hot-hand phenomenon. The individual data contradicts this conclusion.
Evaluating frequency data that might include collapsed variables can lead to erroneous conclusions. Special care should be used in analyzing the presence of multiple contingency situations or in analyzing the conditions defined by Bartlett and Simpson to prevent the emergence of the statistical paradox.
Conover, W. (1999), “Practical Nonparametric statistics”, Third Edition, John Wiley & Sons.
Daniel, W. (1990), “Applied non-parametric statistics”, Second Edition, Duxbury Press, Pacific Grove, CA.
McNemar, Q. “Note on the sampling error of the difference between correlated proportions or percentages”, Psychometrika, 12 (1947) 153-157
Simpson,E. H. (1951). “The Interpretation of Interaction in Contingency Tables”. Journal of the Royal Statistical Society, Ser. B 13: 238-241.
Shapiro, S. “Collapsing contingency tables – A geometric approach”. The American Statistician, February 1982, Vol. 36, No. 1
Wardrop, R. “Simpson’s Paradox and the Hot Hand in Basketball”, The American Statistician, February 1995, Vol. 49, No. 1
 This is formally a representation of Yes/No data or any nominal data with two categories.
 Two different one-sided tests are also possible, testing to see whether the incidence of the characteristic is either increased or reduced after a treatment
 McNemar’s test is also called the “test for related samples when data consists of frequencies”, it then makes sense to use proportions
 Wardrop justifies not using a one-sided test in the overwhelming evidence against phh > pmh