## Monday, February 25, 2008

### mcmcpvalue and contrasts

I thought Steven McKinney had a really interesting message in "mcmcpvalue and contrasts" on the R-sig-ME list today. I'll just reproduce his R session here:

> library(SASmixed)> names(Semiconductor)> str(Semiconductor)'data.frame': 48 obs. of  5 variables: $resistance: num 5.22 5.61 6.11 6.33 6.13 6.14 5.6 5.91 5.49 4.6 ...$ ET        : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ... $Wafer : Factor w/ 3 levels "1","2","3": 1 1 1 1 2 2 2 2 3 3 ...$ position  : Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ... $Grp : Factor w/ 12 levels "1/1","1/2","1/3",..: 1 1 1 1 2 2 2 2 3 3 ......> options(contrasts=c("contr.treatment", "contr.poly"))> lm1 <- lm(resistance ~ ET * position, data=Semiconductor)> summary(lm1)Call:lm(formula = resistance ~ ET * position, data = Semiconductor)Residuals: Min 1Q Median 3Q Max -1.01333 -0.25750 0.04333 0.28333 0.74667 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.61333 0.26891 20.874 <2e-16 ***ET2 0.38000 0.38030 0.999 0.325 ET3 0.52333 0.38030 1.376 0.178 ET4 0.72667 0.38030 1.911 0.065 . position2 -0.16333 0.38030 -0.429 0.670 position3 -0.06000 0.38030 -0.158 0.876 position4 0.27333 0.38030 0.719 0.478 ET2:position2 0.35667 0.53782 0.663 0.512 ET3:position2 0.37333 0.53782 0.694 0.493 ET4:position2 0.37667 0.53782 0.700 0.489 ET2:position3 -0.16667 0.53782 -0.310 0.759 ET3:position3 -0.30333 0.53782 -0.564 0.577 ET4:position3 -0.38333 0.53782 -0.713 0.481 ET2:position4 -0.35000 0.53782 -0.651 0.520 ET3:position4 -0.31667 0.53782 -0.589 0.560 ET4:position4 -0.07333 0.53782 -0.136 0.892 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4658 on 32 degrees of freedomMultiple R-squared: 0.4211, Adjusted R-squared: 0.1498 F-statistic: 1.552 on 15 and 32 DF, p-value: 0.1449 > options(contrasts=c("contr.helmert", "contr.poly"))> lm2 <- lm(resistance ~ ET * position, data=Semiconductor)> summary(lm2)Call:lm(formula = resistance ~ ET * position, data = Semiconductor)Residuals: Min 1Q Median 3Q Max -1.01333 -0.25750 0.04333 0.28333 0.74667 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.00292 0.06723 89.292 < 2e-16 ***ET1 0.17000 0.09507 1.788 0.08324 . ET2 0.09722 0.05489 1.771 0.08606 . ET3 0.10986 0.03881 2.830 0.00797 ** position1 0.05667 0.09507 0.596 0.55535 position2 -0.11000 0.05489 -2.004 0.05360 . position3 0.03542 0.03881 0.912 0.36834 ET1:position1 0.08917 0.13446 0.663 0.51197 ET2:position1 0.03250 0.07763 0.419 0.67826 ET3:position1 0.01667 0.05489 0.304 0.76337 ET1:position2 -0.05750 0.07763 -0.741 0.46427 ET2:position2 -0.03528 0.04482 -0.787 0.43700 ET3:position2 -0.02444 0.03169 -0.771 0.44617 ET1:position3 -0.05167 0.05489 -0.941 0.35363 ET2:position3 -0.01111 0.03169 -0.351 0.72818 ET3:position3 0.01125 0.02241 0.502 0.61909 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4658 on 32 degrees of freedomMultiple R-squared: 0.4211, Adjusted R-squared: 0.1498 F-statistic: 1.552 on 15 and 32 DF, p-value: 0.1449 Note that R^2 is 0.4211 in both cases and that the F and corresponding p-values are also the same. "This is good - the two models are mathematically equivalent, so the overall model fit statistics should not differ." (Well, I suppose I should paraphrase that too.) > options(contrasts=c("contr.treatment", "contr.poly"))> lmr1 <- lm(resistance ~ ET + position, data=Semiconductor)> anova(lm1, lmr1)Analysis of Variance TableModel 1: resistance ~ ET * positionModel 2: resistance ~ ET + position Res.Df RSS Df Sum of Sq F Pr(>F)1 32 6.9421 2 41 7.7515 -9 -0.8095 0.4146 0.9178> options(contrasts=c("contr.helmert", "contr.poly"))> lmr2 <- lm(resistance ~ ET + position, data=Semiconductor)> anova(lm2, lmr2)Analysis of Variance TableModel 1: resistance ~ ET * positionModel 2: resistance ~ ET + position Res.Df RSS Df Sum of Sq F Pr(>F)1 32 6.9421 2 41 7.7515 -9 -0.8095 0.4146 0.9178 Note that again the results are identical. If they were not, this would be a sign of a numerical issue. But this won't happen in R anyway because it's only in presenting the results that the differences come into play? No longer following the e-mail: And what do the model matrices look like? > a <- as.factor(rep(1:3, each=4))> b <- as.factor(rep(1:2, each=6))> y <- rnorm(12)> options(contrasts=c("contr.treatment", "contr.poly"))> model.matrix(y ~ a*b) (Intercept) a2 a3 b2 a2:b2 a3:b21 1 0 0 0 0 02 1 0 0 0 0 03 1 0 0 0 0 04 1 0 0 0 0 05 1 1 0 0 0 06 1 1 0 0 0 07 1 1 0 1 1 08 1 1 0 1 1 09 1 0 1 1 0 110 1 0 1 1 0 111 1 0 1 1 0 112 1 0 1 1 0 1attr(,"assign")[1] 0 1 1 2 3 3attr(,"contrasts")attr(,"contrasts")$a[1] "contr.treatment"attr(,"contrasts")$b[1] "contr.treatment"> options(contrasts=c("contr.helmert", "contr.poly"))> model.matrix(y ~ a*b) (Intercept) a1 a2 b1 a1:b1 a2:b11 1 -1 -1 -1 1 12 1 -1 -1 -1 1 13 1 -1 -1 -1 1 14 1 -1 -1 -1 1 15 1 1 -1 -1 -1 16 1 1 -1 -1 -1 17 1 1 -1 1 1 -18 1 1 -1 1 1 -19 1 0 2 1 0 210 1 0 2 1 0 211 1 0 2 1 0 212 1 0 2 1 0 2attr(,"assign")[1] 0 1 1 2 3 3attr(,"contrasts")attr(,"contrasts")$a[1] "contr.helmert"attr(,"contrasts")$b[1] "contr.helmert" The help page for options reminds us that first element in the list for "contrasts" gives the function to be used with unordered factors while the second gives the function to use with ordered ones. > b <- as.ordered(b)> model.matrix(y ~ a*b) (Intercept) a1 a2 b.L a1:b.L a2:b.L1 1 -1 -1 -0.7071068 0.7071068 0.70710682 1 -1 -1 -0.7071068 0.7071068 0.70710683 1 -1 -1 -0.7071068 0.7071068 0.70710684 1 -1 -1 -0.7071068 0.7071068 0.70710685 1 1 -1 -0.7071068 -0.7071068 0.70710686 1 1 -1 -0.7071068 -0.7071068 0.70710687 1 1 -1 0.7071068 0.7071068 -0.70710688 1 1 -1 0.7071068 0.7071068 -0.70710689 1 0 2 0.7071068 0.0000000 1.414213610 1 0 2 0.7071068 0.0000000 1.414213611 1 0 2 0.7071068 0.0000000 1.414213612 1 0 2 0.7071068 0.0000000 1.4142136attr(,"assign")[1] 0 1 1 2 3 3attr(,"contrasts")attr(,"contrasts")$a[1] "contr.helmert"attr(,"contrasts")$b[1] "contr.poly" //Also, something to remember: options(contrasts=c("contr.SAS", "contr.poly")) sets the system to use contrasts in the style of SAS. ### atanh I was a little startled to find today that atanh's finite values seem to be bounded roughly between -18.71497 and 18.71497. These correspond to inputs of -0.999999999999999944444... and 0.99999999999999994444444... Stepping just a bit beyond these (changing any of those 4s to a 5) we get back an infinite value. I suppose I should work out analytically what's going on. Or just look at this: > q <- function(x) (1+x) / (1 -x) > q(0.99999999999999995) [1] Inf > q(0.99999999999999994) [1] 1.80144e+16 The 9899 standard just says, "The atanh functions compute the arc hyperbolic tangent of x. A domain error occurs for arguments not in the interval [-1, +1]. A range error may occur if the argument equals −1 or +1." Where should I look for more details? Using R's pnorm I see there's roughly probability 1.869160e-78 beyond these points in the tails of a standard Gaussian distribution (and so the probability of seeing the points beyond them is indistinguishable from zero for any variances much less than one) so that this is of little importance for Fisher's z transform 0.5 atanh(r). Still it's a surprise to me not to see a smooth transition. > atanh(tanh(18.71497)) [1] 18.71497 > atanh(tanh(18.71498)) [1] Inf ## Sunday, February 10, 2008 ### correlation on subsets I don't think anyone's subscribing to this so I'll just make this a new post. (The path to freedom lies in attaining obscurity!) ##Find the correlation between each pair of columns##in d != base conditional on base being in the##corresponding subset.####d -- 3 col data.frame or matrix##subset -- list of vectors##base -- column to use in conditioningcor.on.subsets <- function(d, subsets, base=2) { stopifnot(dim(d)[2] == 3) stopifnot(base %in% 1:3) x <- setdiff(1:3, base) n <- length(subsets) r <- numeric(n) for (i in 1:n) { d1 <- d[d[, base] %in% subsets[[i]], ] r[i] <- cor(d1[,x[1]], d1[, x[2]]) } r} Would it be more stylish to use subset rather than direct row selection in [d[, base] %in% subsets[[i]], ]? ### Finding a p-value by simulation from a sq. dist vs. chi^2 ##Plot a qq-plot of the sq Mahalanobis distance##vs. the corresponding chi-sq quantiles.##Return a p-value for the sq. corr. efficient##between the two based on simulation.chisq.qqplot <- function(a.matrix, nsamples=10000) { d.sq <- mahalanobis(a.matrix, colMeans(a.matrix), cov(a.matrix)) d.sq <- sort(d.sq) rows <- dim(a.matrix)[1] cols <- dim(a.matrix)[2] q <- qchisq(((1:rows) - 0.5) / rows, df=cols) plot(q, d.sq, xlab="theoretical", ylab="sq. distance") qq.cor <- cor(d.sq, q) sim.cors <- numeric(nsamples) for (i in 1:nsamples) { x <- matrix(rnorm(cols * rows), ncol=cols) d2 <- mahalanobis(x, colMeans(x), cov(x)) d2 <- sort(d2) sim.cors[i] <- cor(q, d2) } list(d.sq = d.sq, qq.cor=qq.cor, qq.cor.sq=qq.cor^2, sim.cors=sim.cors, p.value=sum(sim.cors^2 < qq.cor^2) / nsamples)} I learned about a couple of R functions in writing this: mahalanobis and colMeans. I wonder what else I'm missing. (On a hunch, I tried help.search("mahalanobis").) Also: p-values... bleh. But how would deal with this kind of thing in a Bayesian (or post-Bayesian) way? ## Saturday, February 9, 2008 ### printing a matrix in R Okay. I know the code to do this exists -- latex in Hmisc. I like the output from this function better though at least for my present purpose. printMatrix <- function(m, r=2) { cat("\\left(\\begin{array}{", rep("c", nrow(m)), "}\n", sep="") cat(apply(m, 1, function(row) paste(round(row, r), collapse=" & ")), sep=" \\\\\n"); cat("\\end{array}\\right)\n");}foo <- function(s3, r=2) { s <- matrix(c(s3[1], s3[2], s3[2], s3[3]), ncol=2); cat("$\n")   printMatrix(s, r=r)   cat("$\n\n"); cat("cor:$", cor(s)[1,2], "$\n\n", sep=""); cat("generalized variance:$", det(s), "$\n\n", sep=""); cat("total variance:$", sum(diag(s)), "$\n", sep="");} ## Saturday, February 2, 2008 ### Printing a matrix in Maxima, again My adventure continues: printMatrix(a) := block([as, body], as : matrixmap(lambda([x], printf(false, "~a", x)), a), body: simplode(maplist(lambda([row], simplode(row, " & ")), as), sconc(" ", "\\\\", newline)), sconc("\\left(\\begin{array}{", simplode(makelist("r", i, 1, length(a[1]))), "}", newline, body, newline, "\\end{array}\\right)"))$
There's just got to be a function in Maxima somewhere that already does this. (There's also just got to be a function that creates a list based on repeating an element. makelist(x, i, 1, n) does the job though.)

(%i42) printMatrix(diag_matrix(1,2,3));
(%o42) \left(\begin{array}{rrr}
1 & 0 & 0 \\
0 & 2 & 0 \\
0 & 0 & 3
\end{array}\right)

Also in the spirit of duplicating functionality that's just gotta be out there: Escape HTML characters. I use this pretty often. Maybe I should make it spiffier.