Sunday, February 10, 2008

Finding a p-value by simulation from a sq. dist vs. chi^2

##Plot a qq-plot of the sq Mahalanobis distance
##vs. the corresponding chi-sq quantiles.
##Return a p-value for the sq. corr. efficient
##between the two based on simulation.
chisq.qqplot <- function(a.matrix, nsamples=10000) {
d.sq <- mahalanobis(a.matrix, colMeans(a.matrix), cov(a.matrix))
d.sq <- sort(d.sq)
rows <- dim(a.matrix)[1]
cols <- dim(a.matrix)[2]
q <- qchisq(((1:rows) - 0.5) / rows, df=cols)
plot(q, d.sq, xlab="theoretical", ylab="sq. distance")

qq.cor <- cor(d.sq, q)
sim.cors <- numeric(nsamples)
for (i in 1:nsamples) {
x <- matrix(rnorm(cols * rows), ncol=cols)
d2 <- mahalanobis(x, colMeans(x), cov(x))
d2 <- sort(d2)
sim.cors[i] <- cor(q, d2)
}

list(d.sq = d.sq,
qq.cor=qq.cor,
qq.cor.sq=qq.cor^2,
sim.cors=sim.cors,
p.value=sum(sim.cors^2 < qq.cor^2) / nsamples)
}
I learned about a couple of R functions in writing this: mahalanobis and colMeans. I wonder what else I'm missing. (On a hunch, I tried help.search("mahalanobis").)

Also: p-values... bleh. But how would deal with this kind of thing in a Bayesian (or post-Bayesian) way?

No comments: