Monday, February 25, 2008

mcmcpvalue and contrasts

I thought Steven McKinney had a really interesting message in "mcmcpvalue and contrasts" on the R-sig-ME list today. I'll just reproduce his R session here:

> library(SASmixed)
> names(Semiconductor)
> str(Semiconductor)

'data.frame': 48 obs. of 5 variables:
$ resistance: num 5.22 5.61 6.11 6.33 6.13 6.14 5.6 5.91 5.49 4.6 ...
$ ET : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
$ Wafer : Factor w/ 3 levels "1","2","3": 1 1 1 1 2 2 2 2 3 3 ...
$ position : Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
$ Grp : Factor w/ 12 levels "1/1","1/2","1/3",..: 1 1 1 1 2 2 2 2 3 3 ...
...
> options(contrasts=c("contr.treatment", "contr.poly"))
> lm1 <- lm(resistance ~ ET * position, data=Semiconductor)
> summary(lm1)


Call:
lm(formula = resistance ~ ET * position, data = Semiconductor)

Residuals:
Min 1Q Median 3Q Max
-1.01333 -0.25750 0.04333 0.28333 0.74667

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.61333 0.26891 20.874 <2e-16 ***
ET2 0.38000 0.38030 0.999 0.325
ET3 0.52333 0.38030 1.376 0.178
ET4 0.72667 0.38030 1.911 0.065 .
position2 -0.16333 0.38030 -0.429 0.670
position3 -0.06000 0.38030 -0.158 0.876
position4 0.27333 0.38030 0.719 0.478
ET2:position2 0.35667 0.53782 0.663 0.512
ET3:position2 0.37333 0.53782 0.694 0.493
ET4:position2 0.37667 0.53782 0.700 0.489
ET2:position3 -0.16667 0.53782 -0.310 0.759
ET3:position3 -0.30333 0.53782 -0.564 0.577
ET4:position3 -0.38333 0.53782 -0.713 0.481
ET2:position4 -0.35000 0.53782 -0.651 0.520
ET3:position4 -0.31667 0.53782 -0.589 0.560
ET4:position4 -0.07333 0.53782 -0.136 0.892
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4658 on 32 degrees of freedom
Multiple R-squared: 0.4211, Adjusted R-squared: 0.1498
F-statistic: 1.552 on 15 and 32 DF, p-value: 0.1449

> options(contrasts=c("contr.helmert", "contr.poly"))
> lm2 <- lm(resistance ~ ET * position, data=Semiconductor)
> summary(lm2)


Call:
lm(formula = resistance ~ ET * position, data = Semiconductor)

Residuals:
Min 1Q Median 3Q Max
-1.01333 -0.25750 0.04333 0.28333 0.74667

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.00292 0.06723 89.292 < 2e-16 ***
ET1 0.17000 0.09507 1.788 0.08324 .
ET2 0.09722 0.05489 1.771 0.08606 .
ET3 0.10986 0.03881 2.830 0.00797 **
position1 0.05667 0.09507 0.596 0.55535
position2 -0.11000 0.05489 -2.004 0.05360 .
position3 0.03542 0.03881 0.912 0.36834
ET1:position1 0.08917 0.13446 0.663 0.51197
ET2:position1 0.03250 0.07763 0.419 0.67826
ET3:position1 0.01667 0.05489 0.304 0.76337
ET1:position2 -0.05750 0.07763 -0.741 0.46427
ET2:position2 -0.03528 0.04482 -0.787 0.43700
ET3:position2 -0.02444 0.03169 -0.771 0.44617
ET1:position3 -0.05167 0.05489 -0.941 0.35363
ET2:position3 -0.01111 0.03169 -0.351 0.72818
ET3:position3 0.01125 0.02241 0.502 0.61909
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4658 on 32 degrees of freedom
Multiple R-squared: 0.4211, Adjusted R-squared: 0.1498
F-statistic: 1.552 on 15 and 32 DF, p-value: 0.1449

Note that R^2 is 0.4211 in both cases and that the F and corresponding p-values are also the same. "This is good - the
two models are mathematically equivalent, so the overall model fit statistics should not differ." (Well, I suppose I should paraphrase that too.)
> options(contrasts=c("contr.treatment", "contr.poly"))
> lmr1 <- lm(resistance ~ ET + position, data=Semiconductor)
> anova(lm1, lmr1)

Analysis of Variance Table

Model 1: resistance ~ ET * position
Model 2: resistance ~ ET + position
Res.Df RSS Df Sum of Sq F Pr(>F)
1 32 6.9421
2 41 7.7515 -9 -0.8095 0.4146 0.9178
> options(contrasts=c("contr.helmert", "contr.poly"))
> lmr2 <- lm(resistance ~ ET + position, data=Semiconductor)
> anova(lm2, lmr2)

Analysis of Variance Table

Model 1: resistance ~ ET * position
Model 2: resistance ~ ET + position
Res.Df RSS Df Sum of Sq F Pr(>F)
1 32 6.9421
2 41 7.7515 -9 -0.8095 0.4146 0.9178

Note that again the results are identical. If they were not, this would be a sign of a numerical issue. But this won't happen in R anyway because it's only in presenting the results that the differences come into play?


No longer following the e-mail: And what do the model matrices look like?
> a <- as.factor(rep(1:3, each=4))
> b <- as.factor(rep(1:2, each=6))
> y <- rnorm(12)
> options(contrasts=c("contr.treatment", "contr.poly"))
> model.matrix(y ~ a*b)

(Intercept) a2 a3 b2 a2:b2 a3:b2
1 1 0 0 0 0 0
2 1 0 0 0 0 0
3 1 0 0 0 0 0
4 1 0 0 0 0 0
5 1 1 0 0 0 0
6 1 1 0 0 0 0
7 1 1 0 1 1 0
8 1 1 0 1 1 0
9 1 0 1 1 0 1
10 1 0 1 1 0 1
11 1 0 1 1 0 1
12 1 0 1 1 0 1
attr(,"assign")
[1] 0 1 1 2 3 3
attr(,"contrasts")
attr(,"contrasts")$a
[1] "contr.treatment"

attr(,"contrasts")$b
[1] "contr.treatment"

> options(contrasts=c("contr.helmert", "contr.poly"))
> model.matrix(y ~ a*b)

(Intercept) a1 a2 b1 a1:b1 a2:b1
1 1 -1 -1 -1 1 1
2 1 -1 -1 -1 1 1
3 1 -1 -1 -1 1 1
4 1 -1 -1 -1 1 1
5 1 1 -1 -1 -1 1
6 1 1 -1 -1 -1 1
7 1 1 -1 1 1 -1
8 1 1 -1 1 1 -1
9 1 0 2 1 0 2
10 1 0 2 1 0 2
11 1 0 2 1 0 2
12 1 0 2 1 0 2
attr(,"assign")
[1] 0 1 1 2 3 3
attr(,"contrasts")
attr(,"contrasts")$a
[1] "contr.helmert"

attr(,"contrasts")$b
[1] "contr.helmert"

The help page for options reminds us that first element in the list for "contrasts" gives the function to be used with unordered factors while the second gives the function to use with ordered ones.
> b <- as.ordered(b)
> model.matrix(y ~ a*b)

(Intercept) a1 a2 b.L a1:b.L a2:b.L
1 1 -1 -1 -0.7071068 0.7071068 0.7071068
2 1 -1 -1 -0.7071068 0.7071068 0.7071068
3 1 -1 -1 -0.7071068 0.7071068 0.7071068
4 1 -1 -1 -0.7071068 0.7071068 0.7071068
5 1 1 -1 -0.7071068 -0.7071068 0.7071068
6 1 1 -1 -0.7071068 -0.7071068 0.7071068
7 1 1 -1 0.7071068 0.7071068 -0.7071068
8 1 1 -1 0.7071068 0.7071068 -0.7071068
9 1 0 2 0.7071068 0.0000000 1.4142136
10 1 0 2 0.7071068 0.0000000 1.4142136
11 1 0 2 0.7071068 0.0000000 1.4142136
12 1 0 2 0.7071068 0.0000000 1.4142136
attr(,"assign")
[1] 0 1 1 2 3 3
attr(,"contrasts")
attr(,"contrasts")$a
[1] "contr.helmert"

attr(,"contrasts")$b
[1] "contr.poly"


//Also, something to remember: options(contrasts=c("contr.SAS", "contr.poly")) sets the system to use contrasts in the style of SAS.

atanh

I was a little startled to find today that atanh's finite values seem to be bounded roughly between -18.71497 and 18.71497. These correspond to inputs of -0.999999999999999944444... and 0.99999999999999994444444... Stepping just a bit beyond these (changing any of those 4s to a 5) we get back an infinite value. I suppose I should work out analytically what's going on. Or just look at this:

> q <- function(x) (1+x) / (1 -x)
> q(0.99999999999999995)
[1] Inf
> q(0.99999999999999994)
[1] 1.80144e+16


The 9899 standard just says, "The atanh functions compute the arc hyperbolic tangent of x. A domain error occurs
for arguments not in the interval [-1, +1]. A range error may occur if the argument equals −1 or +1." Where should I look for more details?

Using R's pnorm I see there's roughly probability 1.869160e-78 beyond these points in the tails of a standard Gaussian distribution (and so the probability of seeing the points beyond them is indistinguishable from zero for any variances much less than one) so that this is of little importance for Fisher's z transform 0.5 atanh(r). Still it's a surprise to me not to see a smooth transition.

> atanh(tanh(18.71497))
[1] 18.71497
> atanh(tanh(18.71498))
[1] Inf

Sunday, February 10, 2008

correlation on subsets

I don't think anyone's subscribing to this so I'll just make this a new post. (The path to freedom lies in attaining obscurity!)
##Find the correlation between each pair of columns
##in d != base conditional on base being in the
##corresponding subset.
##
##d -- 3 col data.frame or matrix
##subset -- list of vectors
##base -- column to use in conditioning
cor.on.subsets <- function(d, subsets, base=2) {
stopifnot(dim(d)[2] == 3)
stopifnot(base %in% 1:3)
x <- setdiff(1:3, base)
n <- length(subsets)
r <- numeric(n)
for (i in 1:n) {
d1 <- d[d[, base] %in% subsets[[i]], ]
r[i] <- cor(d1[,x[1]], d1[, x[2]])
}
r
}

Would it be more stylish to use subset rather than direct row selection in [d[, base] %in% subsets[[i]], ]?

Finding a p-value by simulation from a sq. dist vs. chi^2

##Plot a qq-plot of the sq Mahalanobis distance
##vs. the corresponding chi-sq quantiles.
##Return a p-value for the sq. corr. efficient
##between the two based on simulation.
chisq.qqplot <- function(a.matrix, nsamples=10000) {
d.sq <- mahalanobis(a.matrix, colMeans(a.matrix), cov(a.matrix))
d.sq <- sort(d.sq)
rows <- dim(a.matrix)[1]
cols <- dim(a.matrix)[2]
q <- qchisq(((1:rows) - 0.5) / rows, df=cols)
plot(q, d.sq, xlab="theoretical", ylab="sq. distance")

qq.cor <- cor(d.sq, q)
sim.cors <- numeric(nsamples)
for (i in 1:nsamples) {
x <- matrix(rnorm(cols * rows), ncol=cols)
d2 <- mahalanobis(x, colMeans(x), cov(x))
d2 <- sort(d2)
sim.cors[i] <- cor(q, d2)
}

list(d.sq = d.sq,
qq.cor=qq.cor,
qq.cor.sq=qq.cor^2,
sim.cors=sim.cors,
p.value=sum(sim.cors^2 < qq.cor^2) / nsamples)
}
I learned about a couple of R functions in writing this: mahalanobis and colMeans. I wonder what else I'm missing. (On a hunch, I tried help.search("mahalanobis").)

Also: p-values... bleh. But how would deal with this kind of thing in a Bayesian (or post-Bayesian) way?

Saturday, February 9, 2008

printing a matrix in R

Okay. I know the code to do this exists -- latex in Hmisc. I like the output from this function better though at least for my present purpose.
printMatrix <- function(m, r=2) {
cat("\\left(\\begin{array}{",
rep("c", nrow(m)), "}\n", sep="")
cat(apply(m, 1,
function(row) paste(round(row, r),
collapse=" & ")),
sep=" \\\\\n");
cat("\\end{array}\\right)\n");
}

foo <- function(s3, r=2) {
s <- matrix(c(s3[1], s3[2], s3[2], s3[3]), ncol=2);
cat("$\n")
printMatrix(s, r=r)
cat("$\n\n");
cat("cor: $", cor(s)[1,2], "$\n\n", sep="");
cat("generalized variance: $", det(s), "$\n\n", sep="");
cat("total variance: $", sum(diag(s)), "$\n", sep="");
}

Saturday, February 2, 2008

Printing a matrix in Maxima, again

My adventure continues:
printMatrix(a) := block([as, body],
as : matrixmap(lambda([x],
printf(false, "~a", x)), a),
body: simplode(maplist(lambda([row],
simplode(row, " & ")),
as),
sconc(" ", "\\\\", newline)),
sconc("\\left(\\begin{array}{",
simplode(makelist("r", i, 1,
length(a[1]))), "}",
newline,
body,
newline,
"\\end{array}\\right)"))$
There's just got to be a function in Maxima somewhere that already does this. (There's also just got to be a function that creates a list based on repeating an element. makelist(x, i, 1, n) does the job though.)

(%i42) printMatrix(diag_matrix(1,2,3));
(%o42) \left(\begin{array}{rrr}
1 & 0 & 0 \\
0 & 2 & 0 \\
0 & 0 & 3
\end{array}\right)


Also in the spirit of duplicating functionality that's just gotta be out there: Escape HTML characters. I use this pretty often. Maybe I should make it spiffier.