Quote from Suzanne Bevan on March 15, 2018, 11:53 am

**Reply for: Jay Kaufman**

Andreas,

There seem to be two issues here: 1) the value of an R2 measure and 2) the validity of a linear model applied to categorical data. Both are more statistics topics than epidemiology topics, but worth discussing since they have implications for epidemiologic analysis. My reactions follow.

1) The R2 plays almost no role in epidemiology papers, and has been denigrated by authors such as Greenland (AJE 1987) as largely or wholly irrelevant. It has been well established that for etiologic models, the ANOVA-partitioning of variance into “explained” and residual is of no consequence whatsoever, since trivial effects can occur in high R2 models and important effects can occur in low R2 models. I take the Pearce essay to be reinforcing this same point, and so I think that you are actually in agreement with him in principle.

2) Linear models for categorical outcomes do surprisingly well in practice, even if they are not considered to be elegant from a theoretical perspective. This is why the linear probability model (LPM) is still the mainstay of econometric analysis, rather than the alternatives of logit or probit regressions, which are more theoretically attractive for correctly specifying the error structure as binomial. As that Lumley paper (Lumley et al Annu Rev Public Health. 2002;23:151-690), OLS actually does a very good job of modeling the mean (proportion) for a binary outcome (or any other non-normal distribution) except in really extreme situations, such as what you might find if the probability butts right up against the logical bounds of 0 or 1. What OLS does more poorly with is the variance, due to the errors not being normal in small samples and the fact that the homoscedasticity assumption must logically be false (since variance is a function of the single binomial parameter p). It is for this reason that econometricians using OLS for binary or categorical outcomes always use the sandwich variance (the one employed in GEE) rather than the ML variance from OLS. The former does not require homoscedasticity to be a consistent estimator, although the Diggle et al textbook warns that it can be catastrophically inefficient in small samples.

-Jay

**Reply for: Jay Kaufman**

Andreas,

There seem to be two issues here: 1) the value of an R2 measure and 2) the validity of a linear model applied to categorical data. Both are more statistics topics than epidemiology topics, but worth discussing since they have implications for epidemiologic analysis. My reactions follow.

1) The R2 plays almost no role in epidemiology papers, and has been denigrated by authors such as Greenland (AJE 1987) as largely or wholly irrelevant. It has been well established that for etiologic models, the ANOVA-partitioning of variance into “explained” and residual is of no consequence whatsoever, since trivial effects can occur in high R2 models and important effects can occur in low R2 models. I take the Pearce essay to be reinforcing this same point, and so I think that you are actually in agreement with him in principle.

2) Linear models for categorical outcomes do surprisingly well in practice, even if they are not considered to be elegant from a theoretical perspective. This is why the linear probability model (LPM) is still the mainstay of econometric analysis, rather than the alternatives of logit or probit regressions, which are more theoretically attractive for correctly specifying the error structure as binomial. As that Lumley paper (Lumley et al Annu Rev Public Health. 2002;23:151-690), OLS actually does a very good job of modeling the mean (proportion) for a binary outcome (or any other non-normal distribution) except in really extreme situations, such as what you might find if the probability butts right up against the logical bounds of 0 or 1. What OLS does more poorly with is the variance, due to the errors not being normal in small samples and the fact that the homoscedasticity assumption must logically be false (since variance is a function of the single binomial parameter p). It is for this reason that econometricians using OLS for binary or categorical outcomes always use the sandwich variance (the one employed in GEE) rather than the ML variance from OLS. The former does not require homoscedasticity to be a consistent estimator, although the Diggle et al textbook warns that it can be catastrophically inefficient in small samples.

-Jay