StatBits: 2015

Thursday, December 31, 2015

Three types of people

Current research has showed that :)
there are three types of people:
i) those who do not know statistics and who are irrelevant
ii) those who do know that they do not know statistics
iii) those who do not know that they do not know statistics
The rest are outliers!

Wednesday, December 30, 2015

Of Anova and residuals

Today I am gonna tell you a story. I am not good at presenting the material so if you keep reading, you might be enlightened. Please take a look also the links given through the end of writing to better understand.

I have seen the following question to an email group some time ago:

I have a data which involves 3 groups of 25 people per each. For each group
a different medication is given and some blood figures are recorded.
There are some figures measured 4 times before and after the medication and some which are measured only before and after medication.
I thought of using Split plot (mixed design) ANOVA. My first question is am I using the right test?

Secondly from different resources on the internet I have read different assumptions of ANOVA especially about normality. Some say that residuals must be normal (which I suppose more correct), some say that DV must be normal for the whole sample or for the each category. Which is correct? Some say that ANOVA is robust against some degree of non normality.

Thirdly about the other assumptions or tests prior ANOVA (which are Box's M and Mauchly's Test), If Box's test is significant what would I do?
I suppose if Mauchly's test is significant then we can use GreenHouse*Geisser or the other corrections.

Fourthly In my data one repeated measure variable is not normally distributed accross the whole sample and both Box's and Mauchly's tests are significant. So would this mean I cant rely on the results of ANOVA?

I made an ln transformation which made the data normal, applied ANOVA but Box's and Mauchly's results didnt change. So what should I do?
Thanks

Some prestigious statistician replied

It is the residuals that must be Gaussian. I don't favor pre-testing because it alters the type I and type II error of the final test.

The older anova methods have somewhat been replaced by newer methods such as mixed effects linear models and generalized least squares. The latter is easier to understand than the former. A typical default correlation pattern is continuous time AR1, i.e., an exponential decline in correlation as time points are farther apart. A case study may be seen in the handouts at http://xxx.xxx.xxx. These methods assume Gaussian residuals though.

Upon hearing this answer,

I routed my way through mixed models. Note that I did not write the name of this statistician for privacy purposes.

Another excerpt from an internet source accrossed some time ago

http://stackoverflow.com/questions/2933253/homoscedascity-test-for-two-way-anova

Answer to question:
Hypothesis testing is the wrong tool to use to asses the validity of model assumptions. If the sample size is small, you have no power to detect any variance differences, even if the variance differences are large. If you have a large sample size you have power to detect even the most trivial deviations from equal variance, so you will almost always reject the null. Simulation studies have shown that preliminary testing of model assumption leads to unreliable type I errors.
Looking at the residuals across all cells is a good indicator, or if your data are normal, you can use the AIC or BIC with/without equal variances as a selection procedure.
If you think there are unequal variances, drop the assumption with something like:
library(car)
model.lm <- lm(formula=x ~ g1 + g2 + g1*g2,data=dat,na.action=na.omit)
Anova(model.lm,type='II',white.adjust='hc3')
You don't loose much power with the robust method (hetroscedastic consistent covariance matrices), so if in doubt go robust.

Upon hearing this answer, some lights have been flashed in my mind and I enlightened.

#1 Model assumptions can not be checked by hypothesis testing, rather they must be checked by other methods (like analysing residuals).
#2 One must use the correct sample size to test hypothesis without altering type I & II errors.
#3 As previously underlined by that prestigious statistician, this person repeats the similar thing ie. preliminary testing of model assumptions leads to unreliable type I errors. (Ooops, so what is taught in most statistics books/courses are incorrect or misleading?).

Please disregard the R-specific parts in the reply. They are irrelevant.

Once upon a time I had came across with the following stories on internet about Unix programming and becoming a Unix hacker
http://www.catb.org/~esr/writings/unix-koans/ and http://catb.org/esr/faqs/loginataka.html

The stories prerequisite some knowledge about Unix, but anyway it is much fun to read them.

Finallly, the Way to Wizardhood is long, and winding, and Fraught with Risks.

Friday, November 27, 2015

Coefplot in Stata to represent coefficients of a model

Sometimes you need to display coefficients, effect sizes you get from a model in a graphics. Odds ratios, Relative risks or IRRs are of that kind.
Stata has some commands to do this according to this file :marginsplot and coefpilot are two examples. The document I referred explains them with a special emphasis to coefpilot.
After you fit a model, you can use a syntax like
coefplot, coeflabels(var1="label1" var2="label2) eform drop(_cons) xline(1) xtitle(Odds Ratios and Confidence Intervals)
you can represent odds ratios and their confidence intervals obtained from a logistic regression model.
Here eform option make Odds ratios displayed, otherwise plain coefficients are seen.

Wednesday, September 16, 2015

Gradient Descent

Gradient Descent can run faster if features are scaled.

Thursday, August 13, 2015

Installing Rmpi on Ubuntu

Rmpi is an R package allowing use of MPI on R as unterstood from its name. To use it first I installed r-cran-rmpi package which has dependencies like openmpi-common and openmpi-bin but not something like openmpi-dev.
On an ubuntu machine I tried to install it on R, by install.packages("Rmpi") command where R is version 3.2.1. It gave me an error like "configure: error: "Cannot find mpi.h header file" which seemed to be related with unexistence of a dev package.
After a little search I found out that there is an Ubuntu package libopenmpi-dev needed to be installed too. After the installation of libopenmpi-dev Rmpi is installed correctly.

Wednesday, August 12, 2015

Using dopar command with foreach in R

Recently I was trying to run an R code given by a programmer about parallel programming in R. The code is here. I used doParallel package ( doMC works too) to make a cluster and installed each necessary package like caret, e1071 etc to my newly installed last version 3.2.1 of R in Linux.
Sequential execution was working but when it comes to parallel execution I got an error like " Error in evaluation() : task 1 failed - "could not find function "knn3"" . I understood that that is because workers in parallel execution do not know about knn3 function.
After some research I found .packages option when calling foreach at vignettes of foreach package . Adding a vector list of necessary package names to foreach command's .packages options it worked.
Here is the working code:
pr <- foreach(1:runs, .combine = rbind,.packages=c("caret","MASS","nnet","klaR","e1071","rpart")) %dopar% evaluation()

Friday, June 26, 2015

Bagging is a poor man's Bayes!

This is from a talk of Trevor Hastie. Bagging or bootstrap aggregation averages a noisy fitted function refit to many bootstrap samples to reduce its variance.

Wednesday, June 10, 2015

Installing Reshape2 package into R on Kubuntu

I was trying to install reshape2 package into R 3.1.2 on Kubuntu. I got error messages like
shared object ‘stringi.so’ not found
and tried to install package "stringr". I have found that I needed to install package "devtools" and after that using install_github command in that package, I must install stringr from github.
I tried installling devtools but it gave a lot of errors about dependencies about

libssl-dev,

libcurl4-openssl-dev

libxml2-dev

I installed dependencies by hand curl and other packages in devtools like git2r are installed except few ones like stringr.

After reading some web pages, I realised that I need to install "stringi" package.

At the end I succeeded to install stringr and its dependents (roxygen2,httr) and finally devtools!

Anyway you dont need to install devtools when you need reshape2. It just suffices to install stingi package for installing reshape2. I didnt realise before that there is a package named "stringi" but thought that I need to install package "stringr" for the requirement of stringi.so object.

Sunday, January 25, 2015

Zero Inflated Models

Because of overdispersion in many count data a negative binomial model fits better than a poisson model. But for some data we have zero inflation which need to be handled properly. Although negative binomial regression might fit good for those data, zero inflated models are options.
Vuong test makes goodness of fits comparisons.