Join Dr. Ana Diez-Roux, for SERforum Live!
May 1, 12pm EST


Do you ever find yourself struggling to figure out a question about epidemiologic methods, or other topics in epidemiology, and don’t know who to ask? The SERforum allows for individuals to answer questions that come up in our daily work around substantive and methodological topics in epidemiology.
All topics may be viewed, but to read and post comments, SER membership is required. If you are member, login! Not a member, join us!

Forum Navigation
You need to log in to create posts and topics.

Explained variance in a 2*2 table

Neil Pearce in the Int J Epidemiol 2011 reports the explained variance for data in a 2*2 table. As he did not Report how he calculated it, I tried it myself. He used a simple linear regression.  Is it fair to do this given the data structure?

Best wishes


Reply for: Jay Kaufman


There seem to be two issues here: 1) the value of an R2 measure and 2) the validity of a linear model applied to categorical data.  Both are more statistics topics than epidemiology topics, but worth discussing since they have implications for epidemiologic analysis.  My reactions follow.

1)    The R2 plays almost no role in epidemiology papers, and has been denigrated by authors such as Greenland (AJE 1987) as largely or wholly irrelevant.  It has been well established that for etiologic models, the ANOVA-partitioning of variance into “explained” and residual is of no consequence whatsoever, since trivial effects can occur in high R2 models and important effects can occur in low R2 models.  I take the Pearce essay to be reinforcing this same point, and so I think that you are actually in agreement with him in principle.

2)    Linear models for categorical outcomes do surprisingly well in practice, even if they are not considered to be elegant from a theoretical perspective.  This is why the linear probability model (LPM) is still the mainstay of econometric analysis, rather than the alternatives of logit or probit regressions, which are more theoretically attractive for correctly specifying the error structure as binomial.  As that Lumley paper (Lumley et al Annu Rev Public Health. 2002;23:151-690), OLS actually does a very good job of modeling the mean (proportion) for a binary outcome (or any other non-normal distribution) except in really extreme situations, such as what you might find if the probability butts right up against the logical bounds of 0 or 1.  What OLS does more poorly with is the variance, due to the errors not being normal in small samples and the fact that the homoscedasticity assumption must logically be false (since variance is a function of the single binomial parameter p). It is for this reason that econometricians using OLS for binary or categorical outcomes always use the sandwich variance (the one employed in GEE) rather than the ML variance from OLS.  The former does not require homoscedasticity to be a consistent estimator, although the Diggle et al textbook warns that it can be catastrophically inefficient in small samples.


Dear Jay,

thank you for this answer. Very helpful for me. For sure, I heavily disencourage to use explained variance or correlation coefficients in epidemiology.

Congratulations to SER - the FORUM is a very helpful tool !

Best wishes