StatBits: Of Anova and residuals

Today I am gonna tell you a story. I am not good at presenting the material so if you keep reading, you might be enlightened. Please take a look also the links given through the end of writing to better understand.

I have seen the following question to an email group some time ago:

I have a data which involves 3 groups of 25 people per each. For each group
a different medication is given and some blood figures are recorded.
There are some figures measured 4 times before and after the medication and some which are measured only before and after medication.
I thought of using Split plot (mixed design) ANOVA. My first question is am I using the right test?

Secondly from different resources on the internet I have read different assumptions of ANOVA especially about normality. Some say that residuals must be normal (which I suppose more correct), some say that DV must be normal for the whole sample or for the each category. Which is correct? Some say that ANOVA is robust against some degree of non normality.

Thirdly about the other assumptions or tests prior ANOVA (which are Box's M and Mauchly's Test), If Box's test is significant what would I do?
I suppose if Mauchly's test is significant then we can use GreenHouse*Geisser or the other corrections.

Fourthly In my data one repeated measure variable is not normally distributed accross the whole sample and both Box's and Mauchly's tests are significant. So would this mean I cant rely on the results of ANOVA?

I made an ln transformation which made the data normal, applied ANOVA but Box's and Mauchly's results didnt change. So what should I do?
Thanks

Some prestigious statistician replied

It is the residuals that must be Gaussian. I don't favor pre-testing because it alters the type I and type II error of the final test.

The older anova methods have somewhat been replaced by newer methods such as mixed effects linear models and generalized least squares. The latter is easier to understand than the former. A typical default correlation pattern is continuous time AR1, i.e., an exponential decline in correlation as time points are farther apart. A case study may be seen in the handouts at http://xxx.xxx.xxx. These methods assume Gaussian residuals though.

Upon hearing this answer,

I routed my way through mixed models. Note that I did not write the name of this statistician for privacy purposes.

Another excerpt from an internet source accrossed some time ago

http://stackoverflow.com/questions/2933253/homoscedascity-test-for-two-way-anova

Answer to question:
Hypothesis testing is the wrong tool to use to asses the validity of model assumptions. If the sample size is small, you have no power to detect any variance differences, even if the variance differences are large. If you have a large sample size you have power to detect even the most trivial deviations from equal variance, so you will almost always reject the null. Simulation studies have shown that preliminary testing of model assumption leads to unreliable type I errors.
Looking at the residuals across all cells is a good indicator, or if your data are normal, you can use the AIC or BIC with/without equal variances as a selection procedure.
If you think there are unequal variances, drop the assumption with something like:
library(car)
model.lm <- lm(formula=x ~ g1 + g2 + g1*g2,data=dat,na.action=na.omit)
Anova(model.lm,type='II',white.adjust='hc3')
You don't loose much power with the robust method (hetroscedastic consistent covariance matrices), so if in doubt go robust.

Upon hearing this answer, some lights have been flashed in my mind and I enlightened.

#1 Model assumptions can not be checked by hypothesis testing, rather they must be checked by other methods (like analysing residuals).
#2 One must use the correct sample size to test hypothesis without altering type I & II errors.
#3 As previously underlined by that prestigious statistician, this person repeats the similar thing ie. preliminary testing of model assumptions leads to unreliable type I errors. (Ooops, so what is taught in most statistics books/courses are incorrect or misleading?).

Please disregard the R-specific parts in the reply. They are irrelevant.

Once upon a time I had came across with the following stories on internet about Unix programming and becoming a Unix hacker
http://www.catb.org/~esr/writings/unix-koans/ and http://catb.org/esr/faqs/loginataka.html

The stories prerequisite some knowledge about Unix, but anyway it is much fun to read them.

Finallly, the Way to Wizardhood is long, and winding, and Fraught with Risks.

StatBits

Wednesday, December 30, 2015

Of Anova and residuals

I have seen the following question to an email group some time ago:

Some prestigious statistician replied

Upon hearing this answer,

Another excerpt from an internet source accrossed some time ago

Upon hearing this answer, some lights have been flashed in my mind and I enlightened.

No comments:

Post a Comment

Popular Posts