Quotidian: 2008

Saturday, June 21, 2008

static field shared between instances of generic

The documentation says that one implementation is shared. I guess it makes sense this translates to one copy of the class variables getting shared as well.

public class GenericStatic<T> {
    static String o = "";
    String oo() {
 return o += "o";
    }

    public static void main(String[] args) {
 GenericStatic<Object> a = new GenericStatic<Object>();
 GenericStatic<String> b = new GenericStatic<String>();
 System.out.println(a.oo());
 System.out.println(b.oo());
    }
}

$ javac GenericStatic.java $ java GenericStatic o oo

Wednesday, June 18, 2008

R: keeping a 1-d subset of a matrix a matrix

By default, if a subsetting operations yields a one-dimensional matrix (row or column vector), the dimensions are dropped.

> (m <- matrix(1:4, nrow=2))
     [,1] [,2]
[1,]    1    3
[2,]    2    4
> (v <- m[1, 1:2])
[1] 1 3

This can cause problems when we want to go ahead and continue using the result in matrix operations:
> v %*% m %*% t(v) Error in v %*% m %*% t(v) : non-conformable arguments

The solution is to explicitly tell R not to drop the dimensions:

> (v <- m[1, 1:2, drop=FALSE])
     [,1] [,2]
[1,]    1    3
> v %*% m %*% t(v)
     [,1]
[1,]   52

Something else to look at: removing a column, given its name in a variable: Chris Handorf, subsetting a dataframe

Friday, June 13, 2008

printf("%02d", n)

irb(main):005:0> printf "%02d\n", 3 03 => nil irb(main):006:0> printf "%02d\n", 13 13 => nil irb(main):007:0> printf "%02d\n", 113 113 => nil
Use printf to format numbers, filling in with zeros for 1 digit numbers. *
(This is Ruby but I'm marking this as C since that's where I'm most likely to end up using it.)

Thursday, June 12, 2008

get flat output in Maxima

I think I've written before this before but a search didn't turn it up. This just concerns getting "flat", all-on-one-line, non-fancy ASCII output from Maxima. (This is especially useful when the output is destined to be converted into C code.)

(%i8) display2d: false$
(%i9) %e^t;

(%o9) %e^t
(%i10) display2d: true$
(%i11) %e^t
;
                                        t
(%o11)                                %e

Wednesday, June 11, 2008

Speaking game using python and Mac OS X's say

#!/usr/bin/env python

import os
import random

def say(word):
    os.system("say " + word)

class Score:
    def __init__(self):
 self.total = 0
 self.correct = 0

if __name__ == '__main__':
    pairs = [["bed", "bad"], ["sped", "speed"],
      ["wary", "very"], ["bull", "ball"],
      ["mall", "mole"], ["pale", "pull"],
      ["soup", "stoop"], ["file", "fill"]]
    scores = {}
    for pair in pairs:
 scores["".join(pair)] = Score()
    input = ""

    print("To quit, enter q")
    while input != "q":
 pair = random.choice(pairs)
 word = random.choice(pair)
 print("1) " + pair[0])
 print("2) " + pair[1])
 say(word)
 input = raw_input("What did the computer say?: ")
 input = input.strip()
 
 if input != "q":
     key = "".join(pair)
     score = scores[key]
     score.total += 1
     if input == word:
  score.correct += 1
  print("Correct!")
     else:
  print("Wrong!")
     print("You\'ve gotten the right answer for "
    + " vs ".join(pair) + " "
    + str(score.correct) + " out of "
    + str(score.total) + " times")

Here's a little game to drill you on somewhat similar sounding pairs of English words. I'm sure much could be done to improve it. Is the use of a class here kosher with Python practice? What's contrary to idiom here?

simulate from AR (and general ARIMA)

> RSiteSearch("generate arima") #~~~~~~~($$$) > layout(matrix(1:2)) > a1 <- ar(y) > y1 <- arima.sim(list(ar=a1$ar), 9600) > plot(y1, type="l") > a2 <- arima0(y, order=c(10, 0, 5)) > y2 <- arima.sim(list(ar=a2$coef[1:10], ma=a2$coef[11:15]), 9600) > plot(y2, type="l")

Tuesday, June 10, 2008

printing an array in gdb

(gdb) p *(bestSplit->param->mu)@9 $6 = {0, 0, 0, 0, 0, 0, 0, 0, 0} (gdb) p *(bestSplit->param->p)@(bestSplit->param->numClusters) $8 = {0, 0, 0, 0, 0, 0, 0, 0, 0}
Just what I've been dreaming of! The syntax: "*" <array name> "@" <number of elements>. See Debugging with GDB -- Arrays

-DBL_MAX

$ cat neg_dbl_max.c
#include <float.h>
#include <stdio.h>

int main()
{
  printf("%.50g\n", -DBL_MAX);
  return 0;
}
$ gcc -Wall neg_dbl_max.c 
$ ./a.out 
-1.7976931348623157081452742373170435679807056752584e+308

(This is running under Mac OS X 10.4 on an Intel Mac.) It'd be cool if gcc would recognize this constant and print it instead.

Monday, June 9, 2008

substitute expressions in Maxima

From the Maxima mailing list: Say you want to substitute one expression for another, also dealing with the case in which the expression to be replaced is multiplied by some constant.

(%i1) e : a^2 + b^2 + c^2;
                                  2    2    2
(%o1)                            c  + b  + a
(%i2) ratsubst(r^2, a^2 + b^2 + c^2, e);
                                       2
(%o2)                                 r
(%i3) e : - 3 * a^2 - 3 * b^2 - 3 * c^2;
                                  2      2      2
(%o3)                        - 3 c  - 3 b  - 3 a
(%i4) ratsubst(r^2, a^2 + b^2 + c^2, e);
                                         2
(%o4)                               - 3 r

if every vector in a list has the same length, create a matrix

matrixIfAllSameLen <- function(a.list) { if (length(unique(unlist(lapply(a.list, length)))) == 1) { n <- length(a.list[[1]]) return(matrix(unlist(a.list), ncol=n, byrow=TRUE)) } else { return(a.list) } }

Saturday, June 7, 2008

Getting a quiet NaN in C

nan.c:
#include <stdio.h>
#include <stdlib.h>

int main()
{
  double x = strtod("NAN", 0);
  printf("%f\n", x);
  printf("%d\n", (int)(x == x));
  return 0;
}

$ gcc nan.c $ ./a.out nan 0
This is portable, right?

nan1.c:
#include <math.h>
#include <stdio.h>

int main()
{
  printf("%f\n", NAN);
  return 0;
}

$ gcc nan1.c $ ./a.out nan
Is this portable??

I'm working under Mac OS X 10.4 here. Let's see if these programs work under Red Hat Linux. Yes and no:
vincent% gcc nan1.c nan1.c: In function ‘main’: nan1.c:6: error: ‘NAN’ undeclared (first use in this function) nan1.c:6: error: (Each undeclared identifier is reported only once nan1.c:6: error: for each function it appears in.)
So, strtod seems to be one way to go. How about that nice sounding nan function?

vincent% cat nan2.c 
#include <math.h>
#include <stdio.h>

int main()
{
  double x = nan("");
  printf("%f\n", x);
  printf("%d\n", (int)(x == x));
  return 0;
}
vincent% gcc nan2.c -lm
vincent% ./a.out
0.000000
1

(I'm also working here under Linux. I'm not really sure what string I should drop in for the argument for nan. "hi!" also gives the same results as above. I'll stick with strtod for now.)

NaN

Is it safe to define a NaN using 0.0/0.0? Why doesn't the C standard define a function to return a quiet NaN? (Or does it??? I don't think so.) What do some of the open source implementations of the standard library's math functions do? Plan 9 sqrt.c -- ach, where does NaN() come from?

//

Well, well, well: I should've tried apropos nan right away:

man nan

But I'm still a little unclear on the portability of these. Is it safer from a portability perspective to use strtod?

\\

An article suggesting comparing floats in terms of the signed-magnitude numbers corresponding to the bit representations of the floats! Comparing floating point numbers. Cool!

proportions p such that sum(round(100 * p)) != 100

> for (i in 1:10000) { u <- runif(3); p <- u / sum(u); k <- round(100 * p); if (sum(k) != 100) { print(p); print(k); break} } [1] 0.03470525 0.45335042 0.51194434 [1] 3 45 51 > c(0.03470525, 0.45335042, 0.51194434) [1] 0.03470525 0.45335042 0.51194434 > p <- c(0.03470525, 0.45335042, 0.51194434) > sum(round(100 * p)) [1] 99

> p <- c(0.035, 0.45, 0.515) > sum(round(100 * p)) [1] 101 > sum(p) [1] 1 > 1 - sum(p) [1] 0

Friday, June 6, 2008

partition number according to proportions

/**
 * On exit, 
 * counts[k] = number of points assigned to kth class,
 * k = 0:(numClasses - 1) out of numPoints. Ensures that counts[k] > 0
 * for every k while also sum(counts) = numPoints.
 */
void proportionsToCounts(const double* proportions,
    int numClasses,
    int numPoints,
    int* counts)
{
  int total = 0;
  int numEmpty = 0;
  int i;

  assert(numPoints >= numClasses);
  
  for (i = 0; i < numClasses; ++i) {
    counts[i] = (int) round(numPoints * proportions[i]);
    total += counts[i];
    if (!counts[i]) {
      ++numEmpty;
    }
  }

  if (numEmpty > 0 || total != numPoints) {
    const double remaining = numPoints - numClasses;
    int assigned = 0;
    for (i = 1; i < numClasses; ++i) {
      const int extra = (int) (remaining * proportions[i]);
      assigned += extra;
      counts[i] = 1 + extra;
    }
    
    counts[0] = 1 + numPoints - numClasses - assigned;
  }
}

Earlier I had implemented this using only the second loop, now only used conditionally, with difference that I first rounded remaining * proportions[i] before truncating it to an int. However, this caused problems with too many points getting allocated, leading to a negative remainder and counts[0] getting an invalid value. Switching to simply truncating to zero, however, led us to sometimes not get exact results even when they were available. Eg, since (int) (97 * 0.5) = 48, the code under the if statement above divides 100 into (26, 49, 25) given proportions (0.25, 0.5, 0.25) rather (25, 50, 25) as one would hope. Hence, I have added the first loop and delegated the "safer" code to be called only in case of emergencies.

I'm sure there's a better way to do this though, but the above is probably more than good enough for my purposes.

Wednesday, June 4, 2008

Working with dates

Our cat is supposed to receive medicine every day for 60 days and then every other day, starting April 7. Working in Python:
>>> import datetime >>> start = datetime.date(2008, 4, 7) >>> d = start + datetime.timedelta(days=60) >>> d datetime.date(2008, 6, 6) >>> d.ctime() 'Fri Jun 6 00:00:00 2008'

Saturday, May 31, 2008

Gibbs sampler for 1d Ising model

This depends on Matsumoto's Mersenne Twister.

gibbs1d.h:
#ifndef __GIBBS1D_H__
#define __GIBBS1D_H__

#include <vector>

/**
 * Produces a sequence of variates with P(x) propto exp(-beta * H(nodes))
 * as its stationary distribution, where H(nodes) = the number of consecutive
 * pairs with equal state.
 */
class Gibbs1d {
  
public:

  /**
   * beta -- larger values imply larger affinities for forming clusters.
   * numNodes -- number of points along the line.
   * q -- number of states for each point.
   */
  Gibbs1d(double beta, int numNodes, int q = 2);

  /**
   * Update the state of each node by Gibbs sampling.
   */
  void iterate();

  /**
   * Returns a pointer to a numNodes length array giving
   * the current state of each node.
   */
  const int* current() const;

private:
  //Hide the copy ctor and assignment operator
  Gibbs1d(const Gibbs1d&);
  Gibbs1d& operator=(const Gibbs1d&);

  const double beta;
  const int numNodes;
  const int q;
  std::vector<int> states;
  const double probNotEq1;
  const double probNotEq2;
  const double probEq;
  const double prob1Out;
};

#endif

gibbs1d.cpp:
#include <algorithm>
#include <cmath>
#include "gibbs1d.h"
#include "mtrandom.h"
using namespace std;

/**
 * beta -- larger values imply larger affinities for forming clusters.
 * numNodes -- number of points along the line.
 * q -- number of states for each point.
 */
Gibbs1d::Gibbs1d(double beta, int numNodes, int q)
  : beta(beta),
    numNodes(numNodes),
    q(q),
    states(numNodes + 2),
    probNotEq1(exp(beta) / (2.0 * exp(beta) + q - 2)),
    probNotEq2(2.0 * probNotEq1),
    probEq(exp(2.0 * beta) / (exp(2.0 * beta) + q - 1)),
    prob1Out(exp(beta) / (exp(beta) + q - 1))
{
  //Add sentinel values to the ends to avoid special casing
  //the boundaries.
  states[0] = states[numNodes + 1] = -1;
}

/**
 * Returns an integer in the interval [0, q-1]
 * != exclude, assigning equal probability to each
 * allowed value.
 * Assumes exclude is in [0, q-1].
 */
static int uniformNotEq(int q, int exclude)
{
  int r = (int) ((q - 1) * genrand_real2());
  if (r >= exclude) {
    return r + 1;
  }
  else {
    return r;
  }
}

/**
 * Returns an integer in the interval [0, q-1]
 * != exclude1 or exclude2, assigning equal probability to each
 * allowed value.
 * Assumes exclude1 and exclude2 are in [0, q-1]
 * and that exclude1 != exclude2.
 */
static int uniformNotEq(int q, int exclude1, int exclude2)
{
  if (exclude1 > exclude2) {
    swap(exclude1, exclude2);
  }

  int r = (int) ((q - 2) * genrand_real2());
  if (r >= exclude1) {
    r = r + 1;
  }
  
  if (r >= exclude2) {
    r = r + 1;
  }

  return r;
}

/**
 * Returns the given state with the given probability.
 * Otherwise, chooses one of the other q - 1 states
 * uniformly.
 */
static int withProbElseUniformNotEq(double prob, int state, int q)
{
  const double u = genrand_real2();
  if (u <= prob) {
    return state;
  }
  else {
    return uniformNotEq(q, state);
  }
}

void Gibbs1d::iterate()
{
  for (int i = 1; i <= numNodes; ++i) {
    const int left = states[i - 1];
    const int right = states[i + 1];

    if (left >= 0) {
      if (right >= 0) {
 if (left != right) {
   const double u = genrand_real2();
   if (u <= probNotEq1) {
     states[i] = left;
   }
   else if (u <= probNotEq2) {
     states[i] = right;
   }
   else {
     states[i] = uniformNotEq(q, left, right);
   }
 }
 else { // left == right
   states[i] = withProbElseUniformNotEq(probEq, left, q);
 }
      }
      else { //right < 0
 states[i] = withProbElseUniformNotEq(prob1Out, left, q);
      }
    }
    else {
      if (right >= 0) {
 states[i] = withProbElseUniformNotEq(prob1Out, right, q);
      }
      else {
 states[i] = (int) (q * genrand_real2());
      }
    }
  }
}

/**
 * Returns a pointer to a numNodes length array giving
 * the current state of each node.
 */
const int* Gibbs1d::current() const
{
  return &states[1];
}

I think there must be a much more elegant implementation of this. For one thing, the use of "sentinel" values isn't providing much of a payoff here. And the code for computing the probabilities is just very ugly. What's a better alternative to the nested if-statements?

//I really hate how Blogger cuts off the edges of wide elements. I suppose I can edit the style to fix this.

Friday, May 30, 2008

%e

From the Maxima mailing list today:
(%i4) load("simplify_sum")$ (%i5) simplify_sum(sum(1/n!,n,0,inf)); (%o5) %e

Someone had noted that sum(1/n!,n,0,inf),simpsum; wasn't evaluating to %e.

Someone else pointed out

(%i6) taylor(%e^n, n, 0, 6);
                            2    3    4    5     6
                           n    n    n    n     n
(%o6)/T/           1 + n + -- + -- + -- + --- + --- + . . .
                           2    6    24   120   720

What else does Maxima know?

(%i8) sum(k, k, 1, n), simpsum;
                                     2
                                    n  + n
(%o8)                               ------
                                      2
...
(%i10) simplify_sum(sum((-1)^(k+1) / k, k, 1, inf));
                          - 2 log(2) - %gamma   %gamma
(%o10)                  - ------------------- - ------
                                   2              2
(%i11) ratsimp(%);
(%o11)                              log(2)
(%i12) simplify_sum(sum(1/k^2, k, 1, inf));
                                        2
                                     %pi
(%o12)                               ----
                                      6
(%i13) simplify_sum(sum(1/k^3, k, 1, inf));
(%o13)                              zeta(3)
...
(%i17) simplify_sum(sum(4^(-k) * z^(2*k)/(k!)^2, k, 0, inf));
                                 inf
                                 ====    2 k
                                 \      z
(%o17)                            >    ------
                                 /      k   2
                                 ====  4  k!
                                 k = 0
... No Bessel functions?
(%i19) simplify_sum(sum(1/k^10, k, 1, inf));
                                        10
                                     %pi
(%o19)                               -----
                                     93555
(%i20) simplify_sum(sum(1/k^50, k, 1, inf));
                                                     50
                       39604576419286371856998202 %pi
(%o20)        ---------------------------------------------------
              285258771457546764463363635252374414183254365234375
...
(%i22) simplify_sum(sum(1/(k^2 + 1), k, 1, inf));
                       %i psi (1 - %i)   %i psi (%i + 1)
                             0                 0
(%o22)                 --------------- - ---------------
                              2                 2
... What is this?

Tuesday, May 27, 2008

Problem solving

I have enjoyed looking at Chris Chatfield's book "Problem solving: a statistician's guide." A few points from his summary section "How to be an effective statistician": "Look at the data... errors are inevitable when processing a large-scale data set, and steps must be taken to deal with them." [On the other hand, "large scale" often means "collected automatically", and the less human intervention, the less opportunity to screw it up.]
"Rather than asking 'What technique shall I use here?' it may be better to ask 'How can I summarize these data and understand them?'" "If an effect is 'clear', then the exact choice of analysis procedure may not be crucial." [Should we just be looking at testing for effects? Is there not an effect in the real world?] "A simple approach is often to be preferred to a complicated approach, as the former is easier for the 'client' to understand and is less likely to lead to serious blunders." [If something is not understandable, it's not verifiable. (But if it's not understandable, what good is it anyway?]

Articles by Chatfield referenced in the book:
Teaching a course in applied statistics
The initial examination of data
Model Uncertainty, Data Mining and Statistical Inference
Avoiding Statistical Pitfalls

Also referenced:
Teaching and Examining Applied Statistics by A.G. Hawkes

Leaking some Internal primes

I'm still just getting a feel for the C++ code implementing the algorithm described "Stochastic Graph Partition: Generalizing the Swendsen-Wang Method." (Author's post on LJ.) One thing I noticed right away though is that it depends on tr1! The readme mentioned it being compiled on a Mac using gcc 4.0.1, and I thought, "I have that! Do I already have TR1?!" A quick check with Spotlight revealed that it was indeed there. So, now here is some useless code depending on implementation details which shouldn't be used:

#include <algorithm>
#include <tr1/hashtable>
#include <iostream>
#include <iterator>

int main()
{
  Internal::X<0> x;
  std::copy(x.primes, x.primes + x.n_primes,
     std::ostream_iterator<unsigned long>(std::cout, "\n"));
}

Monday, May 26, 2008

compromise

Richard Morey provides a solution to the indentation wars (in R):

Forget indenting. I like to just make my code all one line, with commands separated by ";".

A sufficient condition for a mixture of equal variance normals to be unimodal:
abs(mu1 - mu2) <= 2 * sigma * sqrt(1 + abs(log(p) - log(1 - p)) / 2) -- Everitt and Hand, "Finite Mixture Distribution", referencing Behboodian 1970. [Is this the right article?] Does estimation of the parameters become much harder when the threshold from bimodal to unimodal is crossed? What are similar conditions for a mixture with k components? For a mixture of k Ricians?

> sigma = 1 > # p = 0.5 > mu <- rep(c(0, 1), c(5000, 5000)) > y <- rnorm(10000, mean=mu, sd=sigma) > hist(y) ##Appears to be unimodal > mu <- rep(c(0, 3), c(5000, 5000)) > y <- rnorm(10000, mean=mu, sd=sigma) > hist(y) ##Appears to be bimodal

> mu <- rep(c(0, 1), c(5000, 5000)) > y <- sqrt(rnorm(10000, mean=mu, sd=sigma)^2 + rnorm(10000, sd=sigma)^2) > hist(y, nclass=100) ##Appears to be unimodal > mu <- rep(c(0, 3), c(5000, 5000)) > y <- sqrt(rnorm(10000, mean=mu, sd=sigma)^2 + rnorm(10000, sd=sigma)^2) > hist(y, nclass=100) ##Appears to be bimodal

The same pattern shows up using plot(density(...)) in place of hist(...).

Monday, May 19, 2008

random partition

#include <assert.h>
#include <math.h>
#include "partition.h"
#include <Rmath.h>
#include <stdlib.h>

/**
 * Returns a array of length numClasses normalized to sum
 * to 1.
 */
static double* randomProportions(int numClasses)
{
  double* props = malloc(sizeof(double) * numClasses);
  int i;
  double total = 0.0;

  for (i = 0; i < numClasses; ++i) {
    props[i] = runif(0.0, 1.0);
    total += props[i];
  }

  for (i = 0; i < numClasses; ++i) {
    props[i] /= total;
  }
  
  return props;
}

/**
 * Ensures at least one element is assigned to each
 * class. The caller is responsible for freeing the
 * returned object.
 */
int* randomCounts(int len, int numClasses)
{
  int* counts = malloc(sizeof(int) * numClasses);
  double* props = randomProportions(numClasses);
  const double remaining = len - numClasses;
  int total = 0;
  int i;

  for (i = 0; i < numClasses; ++i) {
    counts[i] = 1;
  }
  
  for (i = 1; i < numClasses; ++i) {
    const int extra = (int) round(remaining * props[i]);
    total += extra;
    counts[i] += extra;
  }

  counts[0] += remaining - total;

  free(props);
  
  return counts;
}

/**
 * On exit, classes[i] = the class assigned to the kth class,
 * k = 0:(numClasses - 1). Assumes classes has length at least len.
 * Assumes that numClasses <= len. Assigns at least one element
 * to each class. The classes are assigned in blocks, ie, they
 * partition the indeces 0:(len - 1) into numClasses classes,
 * ordered from class 0 and up.
 */
void randomPartition(int len, int numClasses, int* classes)
{
  int* counts = randomCounts(len, numClasses);
  int i;
  int currentClass = 0;

  assert(numClasses <= len);

  for (i = 0; i < len; ++i) {
    classes[i] = currentClass;

    --counts[currentClass];

    if (!counts[currentClass]) {
      ++currentClass;
    }
  }

  free(counts);
}

I'm not completely happy with this design. I should probably have made "randomPartition" take the counts as an argument (although then it wouldn't've been so "random") since at least for my work, we're almost always interested in the counts (and will just have to recalculate them if they're not available). I wonder what standard implementations of this are available. I suppose I should be checking that the mallocs are succeeding as well. Are there any common platforms where accessing address 0 doesn't lead to a seg fault? Or what platforms don't segfault on accessing 0?

fast software development

How to Prototype a Game in Under 7 Days:
Tips and Tricks from 4 Grad Students Who Made Over 50 Games in 1 Semester. I've had this tab open for a while (weeks?). I think there are some valuable lessons here anyway. How should we test such ideas more formally? Something like the following?: Randomly split a pool of developers into two groups. Instruct group A to follow development methodology X for x days and group B to follow methodology Y for y days. Then after a rest period, instruct A to follow Y for y days and B to follow X for x days, finally releasing the products from every period simultaneously, taking steps to ensure equal exposure (a new random ordering on the public project page for each visit) and gathering the public's ratings of the products.

Saturday, May 17, 2008

a bug from using optim for 1d problem

Recently, a bug was described on r-devel, one introduced by using optim for a one-dimensional optimization problem (rather than optimize): HoltWinters fitted level parameter not bounded between 0 and 1. Looking at the source code for HoltWinters, I see the author used L-BFGS-B... so, maybe this is also an example of that method going out of bounds.

The R code leading to the bug:
... error <- function(p) hw(p, beta, gamma)$SSE sol <- optim(optim.start["alpha"], error, method = "L-BFGS-B", lower = 0, upper = 1, control = optim.control) alpha <- sol$par ...

Wednesday, May 14, 2008

Besag

I think I will learn a lot from this: Markov Chain Monte Carlo for Statistical Inference (2000) (2001) by Julian Besag

[R] books about MCMC to use MCMC R packages?

---

Computing the mean in R (2.6.0):

 case REALSXP:
     PROTECT(ans = allocVector(REALSXP, 1));
     for (i = 0; i < n; i++) s += REAL(x)[i];
     s /= n;
     if(R_FINITE((double)s)) {
  for (i = 0; i < n; i++) t += (REAL(x)[i] - s);
  s += t/n;
     }
     REAL(ans)[0] = s;
     break;

where t and s were previously initialized to 0.0. (Again I wonder if it would make a noticeable difference to cache the result of REAL(x).) So, in exact math, at the end of the loop t = sum(x) - sum(x) = 0. What does this accomplish in finite precision math?
> y <- seq(-1e6, 1e6, len=1000) > mean(y) [1] -1.433398e-11 > sum(y) / 1000 [1] -1.44355e-11
Hm.

And what does Recall (main/eval.c) look like on the inside?

SEXP attribute_hidden do_recall(SEXP call, SEXP op, SEXP args, SEXP rho)
{
    RCNTXT *cptr;
    SEXP s, ans ;
    cptr = R_GlobalContext;
    /* get the args supplied */
    while (cptr != NULL) {
 if (cptr->callflag == CTXT_RETURN && cptr->cloenv == rho)
     break;
 cptr = cptr->nextcontext;
    }
    args = cptr->promargs;
    /* get the env recall was called from */
    s = R_GlobalContext->sysparent;
    while (cptr != NULL) {
 if (cptr->callflag == CTXT_RETURN && cptr->cloenv == s)
     break;
 cptr = cptr->nextcontext;
    }
    if (cptr == NULL)
 error(_("'Recall' called from outside a closure"));

    /* If the function has been recorded in the context, use it
       otherwise search for it by name or evaluate the expression
       originally used to get it.
    */
    if (cptr->callfun != R_NilValue)
 PROTECT(s = cptr->callfun);
    else if( TYPEOF(CAR(cptr->call)) == SYMSXP)
 PROTECT(s = findFun(CAR(cptr->call), cptr->sysparent));
    else
 PROTECT(s = eval(CAR(cptr->call), cptr->sysparent));
    ans = applyClosure(cptr->call, s, args, cptr->sysparent, R_BaseEnv);
    UNPROTECT(1);
    return ans;
}

Renaming a function screws up a recursive function defined in terms of its own name:
> fact <- function(n) if (n > 1) n * fact(n - 1) else 1 > fact(3) [1] 6 > ffact <- fact > fact <- function(x) 0 ##Whoops! > ffact(3) [1] 0
But not one defined in terms of Recall:
> fact <- function(n) if (n > 1) n * Recall(n - 1) else 1 > fact(3) [1] 6 > ffact <- fact > fact <- function(x) 0 ##Whoops! > ffact(3) [1] 6
It occurs to me that I've never written a recursive function in R before. It looks like Recall gets a modest amount of use in the base code though.

Tuesday, May 13, 2008

JavaScript: Fill in fields from given data

Fill in input elements

I'm not happy with this code. I think it's much more verbose than it needs to be. More important though it's not flexible enough. And maybe most important, it doesn't allow you to undo its actions. I'm writing this with a view to make it easier to input data into WebCT. Not surprisingly, the current greasemonkey scripts aimed at sprucing up WebCT seem to be aimed at making life more pleasant just for students.

//I wish Blogger was a little smarter about '<' in javascript URLs -- or do some browsers barf on this as well?

//I'll have to look at this some more later: http://www.quirksmode.org/

//And don't worry -- I've also been thinking about the Ising and Potts models. I'm just about to check out Grimmett's book and will pursue an idea on how to generate Ising-distributed variates in a bit. However, I cannot believe that others would've missed this method before so I suspect I've missed something. I will certainly learn something from the attempt, however. I'm looking at Besag's papers for some details on pseudo-likelihood. From the other students, I learned it's covered in 601 so I will glance at Kaiser's notes for that.

Our code seems to be spinning along nicely. I've started a second run on real data and will do some quick runs on all the remaining real data sets tonight.

Monday, May 12, 2008

NULL

NULL is defined in stddef.h

To preserve capitalization of elements of a title in a LaTeX bibliography, eg, acryonyms, surround them in curly braces.

Generalized Linear Mixed Models, etc.

There has recently been a very interesting thread on generalized linear mixed models on R-sig-ME. Ken Beath writes, "For choosing the number of classes for mixture models it has been shown that BIC fails theoretically but works well in practice (proven with simulations) and compares well to results from parametric bootstrapping of the LRT." Where should one look for discussion of this?

Dimitris Rizopoulos writes, "Well, in Linear Mixed Models the integral over the random effects can be analytically evaluated and thus no approximation (i.e., AGQ or Laplace) is required. In GLMMs this is not the case and thus the log-likelihood needs to be calculated approximately. One method for approximating the integral is the AGQ, and in fact Laplace is AGQ with one quadrature point." Should a stat computing class cover all this?

Doug Bates writes, "That is, you gain more from using the Laplace approximation at the conditional modes of the random effects (which is the 'adaptive' part) than increasing the number of Gauss-Hermite quadrature points at the marginal modes. The tricky bit of AGQ or Laplace is determining the conditional modes and that part is already done." I'll carefully read through the source code for lme4 sometime soon I think. Doesn't this sound like fun?

Peter Dalgaard gives describes some issues related to generalized linear mixed models. The Laplace method is a trick from Bayesian methods based on the observation that the likelihood approaches a Gaussian -- "Approximate using value at peak and Hessian."

Some fun R functions:

> embed(1:5, dim=3)
     [,1] [,2] [,3]
[1,]    3    2    1
[2,]    4    3    2
[3,]    5    4    3
> embed(1:5, dim=2)
     [,1] [,2]
[1,]    2    1
[2,]    3    2
[3,]    4    3
[4,]    5    4

> as.integer(intToBits(as.integer(3))) [1] 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Using intToBits, we can write a function to print out the bits of a floating point number (ignoring exceptional cases such as Inf):

intToBitString <- function(x, bits=32) {
  substring(paste(rev(as.integer(intToBits(as.integer(x)))),
              collapse=""),
            32 - bits + 1, 32)
}

doubleToBitString <- function(x) {
  stopifnot(is.finite(x))
  e <- floor(log2(x))
  m <- (abs(x * 2 ^ -e) - 1)
  shift <- 8
  s.bits <- ""
  remaining <- .Machine$double.digits
  while (remaining > 0) {
    u <- min(remaining, shift)
    m <- m * 2 ^ u
    k <- floor(m)
    s.bits <- paste(s.bits, intToBitString(k, u), sep="")
    m <- m - k
    remaining <- remaining - u
  }
  
  mexp <- .Machine$double.exponent

  paste(if (x > 0) 0 else 1,
         intToBitString(e + 2^(mexp - 1) - 1, mexp),
         s.bits)
}

(We could easily avoid depending on the base being 2, but is it worth it?) The output for pi seems to match that given for pi on Wikipedia page.

> doubleToBitString(pi) [1] "0 10000000000 10010010000111111011010101000100010000101101000110000"
Here, just for kicks, is an implementation using the hack of dynamically creating a shared library to accomplish this task. (I wrote this a couple years ago in response to a class assignment to print out the bits of a floating point number in R. Well, the code I actually turned in was pure R -- and way longer.)

create.so <- function() {
  writeLines(c("void doubleToBitString(const double* x,"
               " int* length, char** result)",
               "{",
               "  const char* bytes = (const void*)x;",
               "  int numBytes = sizeof(double);",
               "  static char buffer[1024];",
               "  int i;",
               "  for (i = 0; i != numBytes; ++i) {",
               "    int j;",
               "    for (j = 0; j != 8; ++j) {",
               "      int mask = 1 << j;",
               "      buffer[8*i + j] = (mask & bytes[i]) ? \'1\' : \'0\';",
               "    }",
               "  }",
               "  buffer[8*numBytes] = \'\\0\';",
               "  *result = buffer;",
               "  *length = numBytes;",
               "}"),
             con = "bitstring.c")
  system("R CMD SHLIB bitstring.c", ignore.stderr=T)
}

doubleToBitString <- function(x) {
  if (!file.exists("bitstring.so")) {
    create.so()
  }
  dyn.load("bitstring.so")
  out <- .C("doubleToBitString",
            x=as.double(x),
            len=as.integer(1),
            bits=as.character('junk'))
  return(out$bits)
}

Whoa. Does this write into memory it shouldn't be touching?

Tuesday, May 6, 2008

C unit testing

I looked around at unit testing frameworks for C today. CppUnit caught my attention because it mentions being written by Michael Feathers. However, I'm working on code for someone who is not a fan of C++ so I decided it would be better to aim for a pure C solution. I'm trying out CuTest. It seems it will fulfill my needs. (Check sounded better to me and seems to have seen some love and care more recently, but the configure script failed to work on my computer, and I didn't want to devote much time to just getting the framework itself to compile.) CuTest seems all right so far. I wish the library functions made it bit easier to programmatically access the test results, however. For instance, I would like the program to exit with status 1 if any of the tests fail (or make some other externally obvious sign that doesn't require parsing the text output). (Would it be bad style to use exit status for this?)

frailty/random effects in survreg

library(survival) rsev <- function(n) log(-log(runif(n))) x <- rnorm(1000) v <- rep(1:100, each=10) q <- rep(rnorm(100, sd=2), each=10) y <- exp(1.0 + 3.0 * x + q + 5.0 * rsev(1000)) z <- rep(1, 1000) d <- data.frame(x=x, v=as.factor(v)) f <- survreg(Surv(y, z) ~ x + frailty.gaussian(v), data=d) f

Call:
survreg(formula = Surv(y, z) ~ x + frailty.gaussian(v), data = d)

                    coef se(coef) se2   Chisq DF   p      
(Intercept)         1.20 0.254    0.156  22.3  1.0 2.4e-06
x                   2.96 0.166    0.162 319.3  1.0 0.0e+00
frailty.gaussian(v)                     176.6 63.2 1.2e-12

Scale= 4.7 

Iterations: 7 outer, 23 Newton-Raphson
     Variance of random effect= 4 
Degrees of freedom for terms=  0.4  1.0 63.2  1.0 
Likelihood ratio test=463  on 63.5 df, p=0  n= 1000

Saturday, May 3, 2008

reading numbers from a file -- less buggy?

My previous readArray implementation contained a bug. I had implemented in terms of line-oriented input. However, this meant specifying a maximum line length (in characters). I thought 10000 characters was more than reasonable... well, it turns out my own code routinely prints out lines of numbers that are longer than that. (This is itself probably itself something that should change.) So, fgets was actually hitting this limit, while in the midst of reading in part of a number, splitting it into two parts, the next part showing up on the next fgets call, thus occasionally introducing two invalid numbers. If this hadn't introduced an extra number (and my code hadn't been checking it was getting the right number of numbers), I might have missed this.

One way to fix this would be to implement a function for line-oriented input that would guarantee the entire line had been read -- very close to the spirit of readArray. Instead though I've chosen to give up on line-oriented input and instead read things in a number at a time. The resulting code is a bit cleaner I think anyway (although there is still some duplication).

(Another fix would be to use a language for which reading a file full of a given data type is a feature of the standard library. I guess we almost have this with C++. Haskell must get this right, though, right?)

#include <stdlib.h>

/**
 * Reads doubles from in until EOF.
 * If le n is not null, *len is set to
 * the number of elements read.
 * The caller is responsible for freeing the 
 * memory of the returned object.
 * Returns null in case of an error.
 */
double* readArray(FILE* in, int* len)
{
  int capacity = 100;
  double* y = malloc(sizeof(double) * capacity);
  int n = 0;
  while (!ferror(in) && !feof(in)) {
    double v;
    int status = fscanf(in, "%lf", &v);
    if (status == 1) {
      if (capacity == n) {
 capacity *= 2;
 y = realloc(y, capacity * sizeof(double));
      }
      
      y[n] = v;
      ++n;
    }
    else if (status == 0) {
      fprintf(stderr,
              "# Invalid input? (read %d values so far)\n",
       n);
      free(y);
      return 0;
    }
    else if (status == EOF && !feof(in)) {
      fprintf(stderr,
             "# Got error (read %d values so far)\n", n);
      perror("");
      free(y);
      return 0;
    }
  }

  if (len) {
    *len = n;
  }

  return y;
}

Saturday, April 26, 2008

logistic regression on a circle

> x <- rnorm(1000) > y <- rnorm(1000) > inside <- ifelse(x^2 + y^2 < 1, 1, 0) > f <- glm(inside ~ I(y^2)+I(x^2)+x*y, family=binomial()) Warning messages: 1: In glm.fit(x = X, y = Y, weights = weights, start = start, etastart = etastart, : algorithm did not converge 2: In glm.fit(x = X, y = Y, weights = weights, start = start, etastart = etastart, : fitted probabilities numerically 0 or 1 occurred > d <- expand.grid(x=seq(-1,1,len=100), y=seq(-1,1,len=100)) > p <- predict(f, type="response", newdata=d) > image(matrix(p, ncol=100))

This seems to work pretty well, as one might expect.

Now let's apply this to the Mandelbrot Set:

mset.data <- function(width=200, iterations=100) {
  mset <- as.vector(outer(seq(-2, 2, len=width),
                  seq(-2, 2, len=width) * 1i, FUN="+"))
  cmset <- NULL
  z <- numeric(length(mset))
  for (i in 1:iterations) {
    z <- z^2 + mset
    outside <- Mod(z) > 2
    cmset <- append(cmset, mset[outside])
    inside <- !outside
    mset <- mset[inside]
    z <- z[inside]
  }
  all <- c(cmset, mset)
  all.next <- all + all^2
  inside <- rep(c(0, 1), c(length(cmset), length(mset)))
  x <- Re(all)
  y <- Im(all)
  data.frame(inside=inside, x1=Re(all), y1=Im(all),
             x2=Re(all.next), y2=Im(all.next))
}

mset.prob <- function(width=200) {
  d <- mset.data(width=width)
  f <- glm(inside ~ I(x1^2) + I(y1^2) +
                x1*y1 + I(x2^2) + I(y2^2) + x2*y2,
           data=d, family=binomial())
  z1 <- as.vector(outer(seq(-2, 2, len=width),
                 seq(-2, 2, len=width) * 1i, FUN="+"))
  z2 <- z1 + z1^2
  p <- predict(f, type="response",
               newdata=data.frame(x1=Re(z1), y1=Im(z1),
                                  x2=Re(z2), y2=Im(z2)))
  matrix(p, nrow=width, ncol=width)
}

> p <- mset.prob() > image(seq(-2,2,len=200), seq(-2,2,len=200), p, xlab="x", ylab="y")

How're we doing in actually computing the M-Set?
> md <- mset.data() > o <- md$inside==1 > plot(c(), xlim=c(-2,2), ylim=c(-2,2), type="n") > points(md$x[o], md$y[o], pch=1)
Looks about right.

Monday, April 21, 2008

int division by zero

What output do you get if you get a division by zero in integer math in C?

$ cat int_div_by_zero.c 
#include <stdio.h>

int main()
{
  printf("%d\n", 3 / 0);
  return 0;
}
$ gcc int_div_by_zero.c 
int_div_by_zero.c: In function 'main':
int_div_by_zero.c:5: warning: division by zero
$ ./a.out 
Floating point exception

Kaboom! Nothing. This result immediately raises the question, "Is integer math division as slow as or slower than floating point division since it seems to be implemented in terms of it?"

(We get the same result with 0/0 and also even when we compile with -fno-trapping-math.)

Sunday, April 20, 2008

survreg

Just checking that I understand the model being fit by survreg from the survival package:

##Generate n variates from the standard (mu=0, sigma=1)
##Smallest Extreme Value distribution.
rsev <- function(n) log(-log(runif(n)))

##g -- RNG for the underlying standard distribution
##for the log-location-scale family.
##Defaults to rsev to produce Weibull variates.
gen.data <- function(x, beta0=3, beta1=9, sigma=4, g=rsev) {
  n <- length(x)
  exp(beta0 + beta1 * x + sigma * g(n))
}

> x <- rnorm(10000) > Time <- gen.data(x=x) > f <- survreg(Surv(Time, rep(1, 10000)) ~ x) > f Call: survreg(formula = Surv(Time, rep(1, 10000)) ~ x) Coefficients: (Intercept) x 3.032166 9.037986 Scale= 3.968822 Loglik(model)= -37402.9 Loglik(intercept only)= -46087.6 Chisq= 17369.46 on 1 degrees of freedom, p= 0 n= 10000 > Time <- gen.data(x=x, g=rnorm) > f <- survreg(Surv(Time, rep(1, 10000)) ~ x, dist="lognormal") > f Call: survreg(formula = Surv(Time, rep(1, 10000)) ~ x, dist = "lognormal") Coefficients: (Intercept) x 3.018112 8.926638 Scale= 3.995970 Loglik(model)= -58522.6 Loglik(intercept only)= -67530.3 Chisq= 18015.41 on 1 degrees of freedom, p= 0 n= 10000 >

So, it looks like the model being fit is T ~ F((log(t) - mu) / sigma) where mu = X * beta. What's it doing in the case of dist="exponential", "gaussian", or "logistic" then? Oh, you can specify the transformation, eg, log, via the "trans", "dtrans", and "itrans" fields of the distribution object, where leaving the fields unspecified seems to imply that the identity transformation should be used. (See survreg.distributions.)

Friday, April 11, 2008

weighted random sample in C

/** * Sample compile: * gcc -Wall -lRmath weighted_sample.c -I/Library/Frameworks/R.framework/Resources/include -L/Library/Frameworks/R.framework/Resources/lib * $ ./a.out mean: 0.409 */ #define MATHLIB_STANDALONE #include <Rmath.h> #include <stdio.h> #include <stdlib.h>


/**
 * Compute a cumulative sum of y, ie,
 * result[j] = sum(i=0:j) y[i].
 * The caller is responsible for freeing the returned
 * object.
 */
double* normedCumSum(const double* y, int len)
{
  double* result = malloc(len * sizeof(double));
  if (len > 0) {
    double norm;
    int i;

    result[0] = y[0];
    
    for (i = 1; i < len; ++i) {
      result[i] = result[i - 1] + y[i];
    }
    
    norm = 1.0 / result[len - 1];
    for (i = 0; i < len; ++i) {
      result[i] *= norm;
    }
  }

  return result;
}

/**
 * Sample reps elements from y (with replacement) with the given
 * weights w. The caller is responsible for freeing the returned
 * object.
 */
double* sample(const double* y, const double* w, int len, int reps)
{
  double* result = malloc(reps * sizeof(double));
  double* probs = normedCumSum(w, len);
  int i;
  
  for (i = 0; i < reps; ++i) {
    int j;

    const double u = runif(0.0, 1.0);
    for (j = 0; j < len; ++j) {
      if (u <= probs[j]) {
 result[i] = y[j];
 break;
      }
    }
  }

  free(probs);

  return result;
}

int main()
{
  double y[2] = {0, 1};
  double w[2] = {0.6, 0.4};
  double* x = sample(y, w, 2, 1000);
  double sum = 0.0;
  int i;
  for (i = 0; i < 1000; ++i) {
    sum += x[i];
  }
  printf("mean: %g\n", sum / 1000);

  return 0;
}

Could be quite a bit more efficient. It should work though.

pseudorandom samples in Matlab

>> randsample([1,2,3,4], 2) ans = 2 3

From the help page:

RANDSAMPLE Random sample, with or without replacement.
Y = RANDSAMPLE(N,K) returns Y as a 1-by-K vector of values sampled
uniformly at random, without replacement, from the integers 1:N.

Y = RANDSAMPLE(POPULATION,K) returns K values sampled uniformly at
random, without replacement, from the values in the vector POPULATION.

Y = RANDSAMPLE(...,REPLACE) returns a sample taken with replacement if
REPLACE is true, or without replacement if REPLACE is false (the default).

Y = RANDSAMPLE(...,true,W) returns a weighted sample, using positive
weights W, taken with replacement. W is often a vector of probabilities.
This function does not support weighted sampling without replacement.

Example: Generate a random sequence of the characters ACGT, with
replacement, according to specified probabilities.

R = randsample('ACGT',48,true,[0.15 0.35 0.35 0.15])

See also rand, randperm.

I'm still learning how to use the help system. It's pretty silly, but I found this via Google search rather than via the internal search for the program. ("sample", "random", and "random sample" all returned relatively huge lists of results, with the one I was interested settled down in the middle. Ah, putting quote marks around "random sample" brings randsample to near the top.)

Thursday, April 10, 2008

smbclient

smbclient \\\\[host name]\\[share]

http://www.faqs.org/docs/Linux-HOWTO/SMB-HOWTO.html

When scp failed me, smbclient saved the hour.

Tuesday, April 8, 2008

Row major

row_major.c:
#include <stdio.h>

int main() {
  char a[4][3][2];
  int i;
  
  for (i = 0; i < 4; ++i) {
    int j;
    for (j = 0; j < 3; ++j) {
      int k;
      for (k = 0; k < 2; ++k) {
        printf("(%d, %d, %d) ==> %d\n", i, j, k,
               (int)(&a[i][j][k] - &a[0][0][0]));
      }
    }
  }
  return 0;
}

$ gcc -Wall row_major.c $ ./a.out (0, 0, 0) ==> 0 (0, 0, 1) ==> 1 (0, 1, 0) ==> 2 (0, 1, 1) ==> 3 (0, 2, 0) ==> 4 (0, 2, 1) ==> 5 (1, 0, 0) ==> 6 (1, 0, 1) ==> 7 (1, 1, 0) ==> 8 (1, 1, 1) ==> 9 (1, 2, 0) ==> 10 (1, 2, 1) ==> 11 (2, 0, 0) ==> 12 (2, 0, 1) ==> 13 (2, 1, 0) ==> 14 (2, 1, 1) ==> 15 (2, 2, 0) ==> 16 (2, 2, 1) ==> 17 (3, 0, 0) ==> 18 (3, 0, 1) ==> 19 (3, 1, 0) ==> 20 (3, 1, 1) ==> 21 (3, 2, 0) ==> 22 (3, 2, 1) ==> 23

This matches what the FFTW3 manual describes: For an n_1 x ... x n_d array, (i_1, ... i_d) ==> i_d + n_d * (i_{d-1} + n_{d-1} * (... n_2 * i_1) ... ). But then i_d = the "row coordinate" since it is runs over this index that are grouped together? Having a[i][j] refer to the ith column and jth row will take some getting used to although it is another thing I should've been aware of long ago. (On the other hand, I rarely use staticly allocated multidimensional arrays and so have almost certainly never written anything incorrect as a result of this gap in knowledge.)

//Does Java use the same convention? How about the various packages for allocating arrays in C?

Saturday, April 5, 2008

print out the vector implementation

a=`locate stl_vector.h | head -n 1`; cat $a

Objective C doesn't have keyword/named arguments

This recently came up on Apple's Objective C list: Selectors vs named arguments. If Objective C really did have keyword arguments, then the following would be selector would be illegal: foo:foo:. But it's not. I know I've misspoken on this because I remember asking someone if there were any languages with nothing but keyword arguments, and then Smalltalk came falsely to mind. So, are there any languages with nothing but keyword arguments, ie, you must always use calls of the form foo(arg1="pizza")? This could of course seem a little irritating for functions with only one parameter. And now that I realize that Smalltalk is doing something else (duh!), the idea seems less appealing overall because I really like what Smalltalk does.

Saturday, March 29, 2008

scanf character classes -- oops

There was a serious bug in the code I posted for charToIntRange earlier (which I'll fix momentarily). I had thought that a character range "%[...]" in scanf acts like "%c" (with default width 1, no NUL added) but actually it acts like "%s". So, although &divider had the right type -- char* -- I was corrupting memory leading to an odd crash later. I guess the lesson for me is to read the documentation more thoroughly. (This is the first time I've used character classes in C. scanf suddenly seems much cooler than before now that I know of them.) This is also an error the compiler could catch though. If the argument corresponding to "%[...]" is a pointer to a char variable, this is always a sign of error. Maybe this case isn't common enough to warrant special attention though.

charClass2.c:
#include <stdio.h>
#include <string.h>

int main(int numArgs, char** args)
{
  int i;
  for (i = 0; i < numArgs; ++i) {
    char x[3];
    char y[1000];
    if (sscanf(args[i], "%2[a]%999[a]", x, y) == 2) {
      printf("%zu %zu\n", strlen(x), strlen(y));
    }
    else {
      printf(":-(\n");
    }
  }
  return 0;
}

$ gcc -Wall charClass2.c $ ./a.out a aa aaa aaaa baaaa :-( :-( :-( 2 1 2 2 :-(
("%zu" is not documented in the man page for printf on my system. I found it after getting warnings with earlier code about an incompatibility with size_t. It looks like it's a C99 extension. Discussion.)

//Actually... I didn't need a character class at all. Character literals match themselves. Of course! Ach. Lesson for me: Don't forget stuff like this.

//And it's not portable to assume char* index(const char *s, int c) is defined in string.h. I've switched to strchr, which I should've been using from the beginning.

generalized logit

This is what I feel like writing now:

y = a f(x)/(b + c f(x))
y b + y c f(x) = a f(x)
b y = (a - yc) f(x)
x = f^-1(b y / (a - y c))

Learn Haskell in 10 minutes... Finally!

stripHtml.hs: stripHtml :: String -> String stripHtml ('<':xs) = eatTag xs stripHtml (x:xs) = x : stripHtml xs stripHtml [] = [] eatTag :: String -> String eatTag ('>':xs) = stripHtml xs eatTag (x:xs) = eatTag xs eatTag [] = []

$ ghci
   ___         ___ _
  / _ \ /\  /\/ __(_)
 / /_\// /_/ / /  | |      GHC Interactive, version 6.6, for Haskell 98.
/ /_\\/ __  / /___| |      http://www.haskell.org/ghc/
\____/\/ /_/\____/|_|      Type :? for help.

Loading package base ... linking ... done.
Prelude> :l stripHtml.hs
[1 of 1] Compiling Main             ( stripHtml.hs, interpreted )
Ok, modules loaded: Main.
*Main> stripHtml "<b><blink>What's</blink></b><h1> a better</h1> way to d<tt>o this</tt>?"
"What's a better way to do this?"

This looks like fun.

Friday, March 28, 2008

Parse an int range parameter

#include <stdio.h>
#include <string.h>

/**
 * If str is of the form "<m>:<n>",
 * set *min = m
 * and *max = n. Otherwise,
 * if str if of the form "<n>",
 * set *min = defaultMin and
 * *max = n.
 * Return the number of numbers
 * successfully parsed (1 or 2 indicating success,
 * 0 failure).
 */
int charToIntRange(const char* str,
     int defaultMin,
     int* min,
     int* max)
{
  if (strchr(str, ':')) {
    if (sscanf(str, "%d:%d", min, max) == 2) {
      return 2;
    }
    else {
      return 0;
    }
  }
  else {
    *min = defaultMin;
    return sscanf(str, "%d", max);
  }
}

int main(int numArgs, char** args)
{
  int i;
  for (i = 1; i < numArgs; ++i) {
    int m;
    int n;
    if (charToIntRange(args[i], -1666, &m, &n)) {
      printf("%d:%d\n", m, n);
    }
    else {
      printf("failed to parse <<%s>>\n", args[i]);
    }
  }

  return 0;
}

$ gcc -Wall charToIntRange.c $ ./a.out 1 155:7 -3:3 :1 1: 19:-1 19:1a t15:5 -1666:1 155:7 -3:3 failed to parse <<:1>> failed to parse <<1:>> 19:-1 19:1 failed to parse <<t15:5>>

It should probably fail with an input of "19:1a", but I'll leave off worrying about that for now.

Is the format "%[:]" portable?

Wednesday, March 26, 2008

Rotate a matrix 90 degrees clockwise


##Rotate a square matrix 90 degrees clockwise.
rot90 <- function(a) {
  n <- dim(a)[1]
  stopifnot(n==dim(a)[2])
  t(a[n:1, ])
}

JavaScript:
Fill in a default value for every blank text field

Hello.

Matlab: Write numbers to a file:

f = fopen('test.txt', 'w');
fprintf(f, '%g\n', 1:50);
fclose(f);

(I'm sure there's a better way to do this.)

Wednesday, March 19, 2008

Marginal Ricians with correlated errors

First, what's a good way to generate errors with a desired correlation structure? Let's start with the following:

> d <- as.matrix(dist(1:4, diag=T, upper=T))
> s <- 0.3^d
> a <- chol(s)
> m <- matrix(nrow=10000,ncol=4)
> for (i in 1:10000) { x <- rnorm(4); m[i,] <- a %*% x}
> var(m)
           [,1]       [,2]      [,3]       [,4]
[1,] 1.09839578 0.30672920 0.1062927 0.03032918
[2,] 0.30672920 1.01055057 0.3031668 0.09055224
[3,] 0.10629268 0.30316682 1.0044863 0.27050059
[4,] 0.03032918 0.09055224 0.2705006 0.88942696
> s
      1    2    3     4
1 1.000 0.30 0.09 0.027
2 0.300 1.00 0.30 0.090
3 0.090 0.30 1.00 0.300
4 0.027 0.09 0.30 1.000

Is it surprising to find the sample variance such a relatively poor approximation to the true variance even with 10000 observations? (I guess the answer is "no!")

Now, let's use this to see what happens to a mixture of marginally Rician distributions as the correlation gets cranked up.

##Produce elements that are marginally Rician distributed but for
##which the noise between elements is correlated with the
##structure cor(X_i, X_j) = cor(Y_i, Y_j) =  rho^abs(i-j)
##but {X_i} independent of {Y_i}, i = 1:length(mu) and
##result[i] = sqrt(X_i^2 + Y_i^2).
##The value is a matrix of draws from the given distribution,
##each row a single draw so that the ith column corresponds
##to mu[i]
corRice <- function(mu, n, rho, sigma) {
  k <- length(mu)
  d <- as.matrix(dist(1:k, diag=TRUE, upper=TRUE))
  r <- (rho ^ d)
  a <- sigma * chol(r)
  result <- matrix(nrow=n, ncol=k)
  for (i in 1:n) {
    x <- (a %*% rnorm(k)) + mu
    y <- a %*% rnorm(k)
    result[i, ] <- sqrt(x^2 + y^2)
  }
  result
}

> r0 <- corRice(rep(c(0, 50, 100), each=50), n=10000, rho=0, sigma=10) > r1 <- corRice(rep(c(0, 50, 100), each=50), n=10000, rho=0.9, sigma=10) > r2 <- corRice(rep(c(0, 50, 100), each=50), n=10000, rho=0.99, sigma=10) > plot(quantile(r0, probs=(1:999)/1000), quantile(r1, probs=(1:999)/1000), xlab="quantiles with rho=0", ylab="quantiles with rho=0.9") > plot(quantile(r0, probs=(1:999)/1000), quantile(r2, probs=(1:999)/1000), xlab="quantiles with rho=0", ylab="quantiles with rho=0.99")

Hmmmmmmmm.

Tuesday, March 18, 2008

R's char functions

Previously, I posted some functions for manipulating strings in R. I was surprised not to have found these in R. Obviously, I didn't look hard enough. RSiteSearch("strings") led me to a post mentioning the following:
> tolower("hI THerE, guYs!") [1] "hi there, guys!" > toupper("hI THerE, guYs!") [1] "HI THERE, GUYS!" > casefold("Hi") [1] "hi" > casefold("Hi", upper=TRUE) [1] "HI" > gsub("-", "", "what-what-what---?", fixed=TRUE) [1] "whatwhatwhat?"
Well, I knew about gsub before at least.

S-PLUS's gsub doesn't recognize the fixed param but otherwise casefold and gsub work the same. (Of course, this is the point of having casefold in R.)

Monday, March 17, 2008

R special cases x^2 and x^0.5

Pulled from src/main/arithmetic.c:

static SEXP real_unary(ARITHOP_TYPE code, SEXP s1, SEXP lcall)
{
    int i, n;
    SEXP ans;

    switch (code) {
    case PLUSOP: return s1;
    case MINUSOP:
 ans = duplicate(s1);
 n = LENGTH(s1);
 for (i = 0; i < n; i++)
     REAL(ans)[i] = -REAL(s1)[i];
 return ans;
    default:
 errorcall(lcall, _("invalid unary operator"));
    }
    return s1; /* never used; to keep -Wall happy */
}

Would this be noticeably faster if we redefined the case for MINUSOP as

case MINUSOP:
  {
    const double* y1 = REAL(s1)[i];
    double* yAns;
    ans = duplicate(s1);
    yAns = REAL(ans);
    n = LENGTH(s1);
    for (i = 0; i < n; ++i)
      yAns[i] = -y1[i];
    return ans;
  }

?
We have the following defines:
Rinternals.h: #define REAL(x) ((double *) DATAPTR(x))
Rinternals.h: #define DATAPTR(x) (((SEXPREC_ALIGN *) (x)) + 1)
So, it seems like this could theoretically save 2n additions per negation. Woo! Is gcc smart enough to notice the legality of this move and take advantage of it?

Poking around in arithmetic.c also shows that x^2.0 and x^0.5 are special-cased to use more efficient alternatives than pow while there's another special routine R_pow_di to handle raising a double to an integer power. Which is used by pi^3? Setting breakpoints on R_pow and R_pow_di shows R_pow is used. In fact, pi^as.integer(3) also uses R_pow.

I wonder what the distribution of powers used in production R code looks like. 2 is almost certainly the most popular. Would it be worthwhile special casing other integer powers or even including a generic check for an integer power? Do most implementations of pow do this already?

Friday, March 14, 2008

an assortment of C

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define LINE_SIZE 10000

/**
 * Reads doubles from in until EOF.
 * If le n is not null, *len is set to
 * the number of elements read.
 * The caller is responsible for freeing the 
 * memory of the returned object.
 * Returns null in case of an error.
 */
double* readArray(FILE* in, int* len)
{
  const char* SPACES = " \t\r\n";
  char* line = malloc(LINE_SIZE);
  int capacity = 100;
  double* y = malloc(sizeof(double) * capacity);
  int n = 0;
  while (!ferror(in) && !feof(in)) {
    if (fgets(line, LINE_SIZE, in)) {
      const char* next = line;
      while (*next) {
 next += strspn(next, SPACES);
 if (*next) {
   double v;
   if (sscanf(next, "%lf", &v) == 1) {
     if (capacity == n) {
       capacity *= 2;
       y = realloc(y, capacity * sizeof(double));
     }

     y[n] = v;
     ++n;
   }
   else {
     free(line);
     free(y);
     return 0;
   }
 }
 next += strcspn(next, SPACES);
      }
    }
    else {
      if (!feof(in)) {
 free(line);
 free(y);
 return 0;
      }
    }
  }

  if (len) {
    *len = n;
  }

  return y;
}

/**
 * Reads doubles from file.
 * If len is not null, *len is set to the
 * number of elements read.
 * The caller is responsible for freeing the 
 * memory of the returned object.
 * Returns null in case of an error.
 */
double* readArrayFromFile(const char* file, int* len)
{
  FILE* in = fopen(file, "r");
  double* result = 0;
  if (!in) {
    fprintf(stderr, "Problem opening %s\n", file);
    perror("");
  }
  else {
    result = readArray(in, len);
    fclose(in);
  }
  return result;
}

/**
 * Returns 1 if tail is a suffix of s, 0 otherwise.
 */
int endsWith(const char* s, const char* tail)
{
  const int sLen = strlen(s);
  const int tailLen = strlen(tail);
  if (sLen >= tailLen) {
    return strcmp(s + sLen - tailLen, tail) == 0;
  }
  else {
    return 0;
  }
}

int remove(double* y, const int len,
const double toRemove)
{
  int newLen = 0;
  int i;
  
  for (i = 0; i < len; ++i) {
    const double yi = y[i];
    if (yi != toRemove) {
      y[newLen] = yi;
      ++newLen;
    }
  }
  return newLen;
}

Thursday, March 13, 2008

Change the working directory in SAS

Just double click the status bar on the bottom of the main window giving the location of the current working directory!
Change the SAS Working Folder
This was not obvious to me.

These look possibly useful:
proc discrim [PDF]
proc princomp [PDF]

Wednesday, March 12, 2008

letter counts per paragraph

Count up the number of each letter by paragraph.

#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <strings.h>

int main(int numArgs, char** args)
{
  const int lineSize = 10000;

  if (numArgs != 2) {
    fprintf(stderr, "Usage: par_char_freq <text file>\n");
    exit(1);
  }
  else {
    char* line = malloc(lineSize);
    FILE* in = fopen(args[1], "r");
    int counts[26];
    int paragraphTotal = 0;

    if (!in) {
      fprintf(stderr, "Problem opening %s\n", args[1]);
      perror("");
      exit(1);
    }

    bzero(counts, 26 * sizeof(int));

    while (!feof(in) && !ferror(in)) {
      int lineTotal = 0;
      if (fgets(line, lineSize, in)) {
 int i;
 for (i = 0; line[i]; ++i) {
   const char c = tolower(line[i]);
   if ('a' <= c && c <= 'z') {
     ++counts[c - 'a'];
     ++lineTotal;
   }
 }
 
 if (lineTotal) {
   paragraphTotal += lineTotal;
 }
 else if (paragraphTotal) {
   int i;
   for (i = 0; i < 26; ++i) {
     printf("%d ", counts[i]);
   }
   printf("\n");

   paragraphTotal = 0;
   bzero(counts, 26 * sizeof(int));
 }
      }
    }

    fclose(in);
    free(line);
  }
  return 0;
}

Tuesday, March 11, 2008

The most permutable word is "trace"?

Which English word has the most permutations which are also English words? (Of course each of the lexible permutations is also an answer. Is "lexible" a sensible term for having the property of being a word?)

Here's some C++ code to look at this:

most_permutable.cpp:
#include <algorithm>
#include <cctype>
#include <fstream>
#include <functional>
#include <iostream>
#include <map>
#include <string>
using namespace std;

template<typename T>
class select2nd {
public:
  typedef T argument_type;
  
  typename T::second_type operator()(const T& t) {
    return t.second;
  }
};

template<typename B, typename U1, typename U2>
class binary_compose {
  public:
  typedef typename B::result_type result_type;
  typedef typename U1::argument_type arg1_type;
  typedef typename U2::argument_type arg2_type;
  binary_compose(const B& b, const U1& u1, const U2& u2) :
    b(b), u1(u1), u2(u2) {}

  result_type operator()(const arg1_type& x1,
    const arg2_type& x2)
  {
    return b(u1(x1), u2(x2));
  }

private:
  B b;
  U1 u1;
  U2 u2;
};

template<typename B, typename U1, typename U2>
binary_compose<B, U1, U2>
compose2(B b, U1 u1, U2 u2)
{
  return binary_compose<B, U1, U2>(b, u1, u2);
}

typedef map<string, int>::iterator iter;

int main(int numArgs, char** args)
{
  map<string, int> permutationCounts;
  ifstream in("/usr/share/dict/words");
  while (in) {
    string word;
    in >> word;
    if (!word.empty() && islower(word[0])) {
      sort(word.begin(), word.end());
      iter p = permutationCounts.find(word);
      if (p != permutationCounts.end()) {
 ++(*p).second;
      }
      else {
 permutationCounts.insert(make_pair(word, 1));
      }
    }
  }

  iter m = max_element(permutationCounts.begin(),
         permutationCounts.end(),
         compose2(less<int>(),
    select2nd<pair<string, int> >(),
    select2nd<pair<string, int> >()));
  cout << (*m).first << " with " << (*m).second << '\n';
}

Drum roll:
$ g++ -Wall most_permutable.cpp $ ./a.out acert with 9

What are the 9?

find_perm.cpp:
#include <algorithm>
#include <cstdio>
#include <fstream>
#include <iostream>
#include <string>
using namespace std;

int main(int numArgs, char** args)
{
  if (numArgs != 3) {
    cerr << "Usage: find_perm <file> <letters>\n";
    return 1;
  }

  ifstream in(args[1]);
  if (!in) {
    cerr << "Problem opening " << args[1] << '\n';
    perror("");
    return 1;
  }

  string letters(args[2]);
  sort(letters.begin(), letters.end());
  const string::size_type n = letters.size();

  string word;
  while (in) {
    in >> word;

    if (n == word.size()) {
      string key(word);

      sort(key.begin(), key.end());
      
      if (key == letters) {
 cout << word << '\n';
      }
    }
  }

  in.close();

}

$ g++ -Wall -o find_perm find_perm.cpp $ ./find_perm oops Usage: find_perm <file> <letters> $ ./find_perm oops oops Problem opening oops No such file or directory $ ./find_perm /usr/share/dict/words trace caret carte cater crate creat creta react recta trace

What the heck is "creat"? Hm. Google doesn't know about it. It is really there though:
$ grep -w -n creat /usr/share/dict/words 45105:creat

Saturday, March 8, 2008

strings again

getChars <- function(s) {
  n <- nchar(s)
  if (n > 0) substring(s, 1:n, 1:n) else character(0)
}

strip <- function(s, chars) {
  s.chars <- getChars(s)
  paste(s.chars[!(s.chars %in% chars)], collapse="")
}

tr <- function(s, from, to) {
  chars <- getChars(s)
  o <- match(chars, from)
  paste(ifelse(!is.na(o), to[o], chars), collapse="")
}

lower <- function(s) {
  tr(s, from=LETTERS, to=letters)
}

upper <- function(s) {
  tr(s, from=letters, to=LETTERS)
}

Another go: S-PLUS doesn't have strsplit so I use a different (and more efficient?) method for getting at the characters of a string.

> system.time(replicate(10000, strsplit("1234567890", "")[[1]])) user system elapsed 0.102 0.004 0.105 > system.time(replicate(10000, substring("1234567890", 1:10, 1:10))) user system elapsed 0.297 0.003 0.299

That's a surprise. Maybe I should try avoiding creating the index list twice? Still strsplit seems so much heavier.
//The source code for strsplit reveals that they make a special case of the pattern "". (See src/main/character.c.)
//Well, this is documented in the help page for strsplit as well.

Monday, February 25, 2008

mcmcpvalue and contrasts

I thought Steven McKinney had a really interesting message in "mcmcpvalue and contrasts" on the R-sig-ME list today. I'll just reproduce his R session here:

> library(SASmixed)
> names(Semiconductor)
> str(Semiconductor)
'data.frame': 48 obs. of  5 variables:
 $ resistance: num  5.22 5.61 6.11 6.33 6.13 6.14 5.6 5.91 5.49 4.6 ...
 $ ET        : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
 $ Wafer     : Factor w/ 3 levels "1","2","3": 1 1 1 1 2 2 2 2 3 3 ...
 $ position  : Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
 $ Grp       : Factor w/ 12 levels "1/1","1/2","1/3",..: 1 1 1 1 2 2 2 2 3 3 ...
...
> options(contrasts=c("contr.treatment", "contr.poly"))
> lm1 <- lm(resistance ~ ET * position, data=Semiconductor)
> summary(lm1)

Call:
lm(formula = resistance ~ ET * position, data = Semiconductor)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.01333 -0.25750  0.04333  0.28333  0.74667 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    5.61333    0.26891  20.874   <2e-16 ***
ET2            0.38000    0.38030   0.999    0.325    
ET3            0.52333    0.38030   1.376    0.178    
ET4            0.72667    0.38030   1.911    0.065 .  
position2     -0.16333    0.38030  -0.429    0.670    
position3     -0.06000    0.38030  -0.158    0.876    
position4      0.27333    0.38030   0.719    0.478    
ET2:position2  0.35667    0.53782   0.663    0.512    
ET3:position2  0.37333    0.53782   0.694    0.493    
ET4:position2  0.37667    0.53782   0.700    0.489    
ET2:position3 -0.16667    0.53782  -0.310    0.759    
ET3:position3 -0.30333    0.53782  -0.564    0.577    
ET4:position3 -0.38333    0.53782  -0.713    0.481    
ET2:position4 -0.35000    0.53782  -0.651    0.520    
ET3:position4 -0.31667    0.53782  -0.589    0.560    
ET4:position4 -0.07333    0.53782  -0.136    0.892    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.4658 on 32 degrees of freedom
Multiple R-squared: 0.4211, Adjusted R-squared: 0.1498 
F-statistic: 1.552 on 15 and 32 DF,  p-value: 0.1449 

> options(contrasts=c("contr.helmert", "contr.poly"))
> lm2 <- lm(resistance ~ ET * position, data=Semiconductor)
> summary(lm2)

Call:
lm(formula = resistance ~ ET * position, data = Semiconductor)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.01333 -0.25750  0.04333  0.28333  0.74667 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    6.00292    0.06723  89.292  < 2e-16 ***
ET1            0.17000    0.09507   1.788  0.08324 .  
ET2            0.09722    0.05489   1.771  0.08606 .  
ET3            0.10986    0.03881   2.830  0.00797 ** 
position1      0.05667    0.09507   0.596  0.55535    
position2     -0.11000    0.05489  -2.004  0.05360 .  
position3      0.03542    0.03881   0.912  0.36834    
ET1:position1  0.08917    0.13446   0.663  0.51197    
ET2:position1  0.03250    0.07763   0.419  0.67826    
ET3:position1  0.01667    0.05489   0.304  0.76337    
ET1:position2 -0.05750    0.07763  -0.741  0.46427    
ET2:position2 -0.03528    0.04482  -0.787  0.43700    
ET3:position2 -0.02444    0.03169  -0.771  0.44617    
ET1:position3 -0.05167    0.05489  -0.941  0.35363    
ET2:position3 -0.01111    0.03169  -0.351  0.72818    
ET3:position3  0.01125    0.02241   0.502  0.61909    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.4658 on 32 degrees of freedom
Multiple R-squared: 0.4211, Adjusted R-squared: 0.1498 
F-statistic: 1.552 on 15 and 32 DF,  p-value: 0.1449

Note that R^2 is 0.4211 in both cases and that the F and corresponding p-values are also the same. "This is good - the
two models are mathematically equivalent, so the overall model fit statistics should not differ." (Well, I suppose I should paraphrase that too.)

> options(contrasts=c("contr.treatment", "contr.poly"))
> lmr1 <- lm(resistance ~ ET + position, data=Semiconductor)
> anova(lm1, lmr1)
Analysis of Variance Table

Model 1: resistance ~ ET * position
Model 2: resistance ~ ET + position
  Res.Df     RSS Df Sum of Sq      F Pr(>F)
1     32  6.9421                           
2     41  7.7515 -9   -0.8095 0.4146 0.9178
> options(contrasts=c("contr.helmert", "contr.poly"))
> lmr2 <- lm(resistance ~ ET + position, data=Semiconductor)
> anova(lm2, lmr2)
Analysis of Variance Table

Model 1: resistance ~ ET * position
Model 2: resistance ~ ET + position
  Res.Df     RSS Df Sum of Sq      F Pr(>F)
1     32  6.9421                           
2     41  7.7515 -9   -0.8095 0.4146 0.9178

Note that again the results are identical. If they were not, this would be a sign of a numerical issue. But this won't happen in R anyway because it's only in presenting the results that the differences come into play?

No longer following the e-mail: And what do the model matrices look like?

> a <- as.factor(rep(1:3, each=4))
> b <- as.factor(rep(1:2, each=6))
> y <- rnorm(12)
> options(contrasts=c("contr.treatment", "contr.poly"))
> model.matrix(y ~ a*b)
   (Intercept) a2 a3 b2 a2:b2 a3:b2
1            1  0  0  0     0     0
2            1  0  0  0     0     0
3            1  0  0  0     0     0
4            1  0  0  0     0     0
5            1  1  0  0     0     0
6            1  1  0  0     0     0
7            1  1  0  1     1     0
8            1  1  0  1     1     0
9            1  0  1  1     0     1
10           1  0  1  1     0     1
11           1  0  1  1     0     1
12           1  0  1  1     0     1
attr(,"assign")
[1] 0 1 1 2 3 3
attr(,"contrasts")
attr(,"contrasts")$a
[1] "contr.treatment"

attr(,"contrasts")$b
[1] "contr.treatment"

> options(contrasts=c("contr.helmert", "contr.poly"))
> model.matrix(y ~ a*b)
   (Intercept) a1 a2 b1 a1:b1 a2:b1
1            1 -1 -1 -1     1     1
2            1 -1 -1 -1     1     1
3            1 -1 -1 -1     1     1
4            1 -1 -1 -1     1     1
5            1  1 -1 -1    -1     1
6            1  1 -1 -1    -1     1
7            1  1 -1  1     1    -1
8            1  1 -1  1     1    -1
9            1  0  2  1     0     2
10           1  0  2  1     0     2
11           1  0  2  1     0     2
12           1  0  2  1     0     2
attr(,"assign")
[1] 0 1 1 2 3 3
attr(,"contrasts")
attr(,"contrasts")$a
[1] "contr.helmert"

attr(,"contrasts")$b
[1] "contr.helmert"

The help page for options reminds us that first element in the list for "contrasts" gives the function to be used with unordered factors while the second gives the function to use with ordered ones.

> b <- as.ordered(b)
> model.matrix(y ~ a*b)
   (Intercept) a1 a2        b.L     a1:b.L     a2:b.L
1            1 -1 -1 -0.7071068  0.7071068  0.7071068
2            1 -1 -1 -0.7071068  0.7071068  0.7071068
3            1 -1 -1 -0.7071068  0.7071068  0.7071068
4            1 -1 -1 -0.7071068  0.7071068  0.7071068
5            1  1 -1 -0.7071068 -0.7071068  0.7071068
6            1  1 -1 -0.7071068 -0.7071068  0.7071068
7            1  1 -1  0.7071068  0.7071068 -0.7071068
8            1  1 -1  0.7071068  0.7071068 -0.7071068
9            1  0  2  0.7071068  0.0000000  1.4142136
10           1  0  2  0.7071068  0.0000000  1.4142136
11           1  0  2  0.7071068  0.0000000  1.4142136
12           1  0  2  0.7071068  0.0000000  1.4142136
attr(,"assign")
[1] 0 1 1 2 3 3
attr(,"contrasts")
attr(,"contrasts")$a
[1] "contr.helmert"

attr(,"contrasts")$b
[1] "contr.poly"

//Also, something to remember: options(contrasts=c("contr.SAS", "contr.poly")) sets the system to use contrasts in the style of SAS.

atanh

I was a little startled to find today that atanh's finite values seem to be bounded roughly between -18.71497 and 18.71497. These correspond to inputs of -0.999999999999999944444... and 0.99999999999999994444444... Stepping just a bit beyond these (changing any of those 4s to a 5) we get back an infinite value. I suppose I should work out analytically what's going on. Or just look at this:
> q <- function(x) (1+x) / (1 -x) > q(0.99999999999999995) [1] Inf > q(0.99999999999999994) [1] 1.80144e+16

The 9899 standard just says, "The atanh functions compute the arc hyperbolic tangent of x. A domain error occurs
for arguments not in the interval [-1, +1]. A range error may occur if the argument equals −1 or +1." Where should I look for more details?

Using R's pnorm I see there's roughly probability 1.869160e-78 beyond these points in the tails of a standard Gaussian distribution (and so the probability of seeing the points beyond them is indistinguishable from zero for any variances much less than one) so that this is of little importance for Fisher's z transform 0.5 atanh(r). Still it's a surprise to me not to see a smooth transition.
> atanh(tanh(18.71497)) [1] 18.71497 > atanh(tanh(18.71498)) [1] Inf

Sunday, February 10, 2008

correlation on subsets

I don't think anyone's subscribing to this so I'll just make this a new post. (The path to freedom lies in attaining obscurity!)

##Find the correlation between each pair of columns
##in d != base conditional on base being in the
##corresponding subset.
##
##d -- 3 col data.frame or matrix
##subset -- list of vectors
##base -- column to use in conditioning
cor.on.subsets <- function(d, subsets, base=2) {
  stopifnot(dim(d)[2] == 3)
  stopifnot(base %in% 1:3)
  x <- setdiff(1:3, base)
  n <- length(subsets)
  r <- numeric(n)
  for (i in 1:n) {
    d1 <- d[d[, base] %in% subsets[[i]], ]
    r[i] <- cor(d1[,x[1]], d1[, x[2]])
  }
  r
}

Would it be more stylish to use subset rather than direct row selection in [d[, base] %in% subsets[[i]], ]?

Finding a p-value by simulation from a sq. dist vs. chi^2

##Plot a qq-plot of the sq Mahalanobis distance
##vs. the corresponding chi-sq quantiles.
##Return a p-value for the sq. corr. efficient
##between the two based on simulation.
chisq.qqplot <- function(a.matrix, nsamples=10000) {
  d.sq <- mahalanobis(a.matrix, colMeans(a.matrix), cov(a.matrix))
  d.sq <- sort(d.sq)
  rows <- dim(a.matrix)[1]
  cols <- dim(a.matrix)[2]
  q <- qchisq(((1:rows) - 0.5) / rows, df=cols)
  plot(q, d.sq, xlab="theoretical", ylab="sq. distance")
  
  qq.cor <- cor(d.sq, q)
  sim.cors <- numeric(nsamples)
  for (i in 1:nsamples) {
    x <- matrix(rnorm(cols * rows), ncol=cols)
    d2 <- mahalanobis(x, colMeans(x), cov(x))
    d2 <- sort(d2)
    sim.cors[i] <- cor(q, d2)
  }

  list(d.sq = d.sq,
       qq.cor=qq.cor,
       qq.cor.sq=qq.cor^2,
       sim.cors=sim.cors,
       p.value=sum(sim.cors^2 < qq.cor^2) / nsamples)
}

I learned about a couple of R functions in writing this: mahalanobis and colMeans. I wonder what else I'm missing. (On a hunch, I tried help.search("mahalanobis").)

Also: p-values... bleh. But how would deal with this kind of thing in a Bayesian (or post-Bayesian) way?

Saturday, February 9, 2008

printing a matrix in R

Okay. I know the code to do this exists -- latex in Hmisc. I like the output from this function better though at least for my present purpose.

printMatrix <- function(m, r=2) {
     cat("\\left(\\begin{array}{",
         rep("c", nrow(m)), "}\n", sep="")
     cat(apply(m, 1,
               function(row) paste(round(row, r),
                                    collapse=" & ")),
         sep=" \\\\\n");
     cat("\\end{array}\\right)\n");
}

foo <- function(s3, r=2) { 
   s <- matrix(c(s3[1], s3[2], s3[2], s3[3]), ncol=2);
   cat("$\n")
   printMatrix(s, r=r)
   cat("$\n\n");
   cat("cor: $", cor(s)[1,2], "$\n\n", sep="");
   cat("generalized variance: $", det(s), "$\n\n", sep="");
   cat("total variance: $", sum(diag(s)), "$\n", sep="");
}

Saturday, February 2, 2008

Printing a matrix in Maxima, again

My adventure continues:

printMatrix(a) := block([as, body],
     as : matrixmap(lambda([x],
                      printf(false, "~a", x)), a),
    body: simplode(maplist(lambda([row],
                             simplode(row, " & ")),
                                         as),
              sconc(" ", "\\\\", newline)),
     sconc("\\left(\\begin{array}{",
                 simplode(makelist("r", i, 1,
                                    length(a[1]))), "}",
                 newline,
                 body,
                 newline,
                 "\\end{array}\\right)"))$

There's just got to be a function in Maxima somewhere that already does this. (There's also just got to be a function that creates a list based on repeating an element. makelist(x, i, 1, n) does the job though.)

(%i42) printMatrix(diag_matrix(1,2,3)); (%o42) \left(\begin{array}{rrr} 1 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 3 \end{array}\right)

Also in the spirit of duplicating functionality that's just gotta be out there: Escape HTML characters. I use this pretty often. Maybe I should make it spiffier.

Thursday, January 31, 2008

C++ static arrays are initialized every time

The language would be very broken if this wasn't true, but I'm happier having some verification of this anyway:

#include <iostream>
using namespace std;

int main()
{
  for (int i = 0; i < 3; ++i) {
    int a[1] = {13};
    cout << a[0] << endl;
  }
  return 0;
}

$ g++ array_init_every_time.cpp $ ./a.out 13 13 13

Wednesday, January 30, 2008

Do I think in closures?

R:
> a <- "Yes"
> f <- function() { return(a) }
> a <- "No"
> f()
[1] "No"

Okay. That's not what I was expecting.

Scheme:
> (define a "Yes")
> (define f (lambda () a))
> (define a "No")
> (f)
"No"

Hm. But (in R):
> f <- (function() {a <- "What?"; list(yes = function() a <- "Yes!", huh = function() cat(a, "\n"))})() > f$huh() What? > f$yes() > f$huh() What?
I guess I'll have to take a look at the manual to see exactly what <- means.

Monday, January 28, 2008

Another way to print a matrix in Maxima

I've found Maxima's provided method for printing a matrix as TeX to produce code that isn't usable in LaTeX without a good amount of work. There are apparently conflicts with the more recent version of amsmath. Here're the beginnings of a replacement

printMatrix(a) := block([as],
     as : matrixmap(lambda([x],
                      printf(false, "~a", x)), a),
     simplode(maplist(lambda([row],
                             simplode(row, " & ")),
                                         as),
              sconc(" ", "\\\\", newline)))$

Naturally, again I have to wonder very strongly if there isn't already a better way to do this.

Looking back at diag_matrix, I'm a little embarrassed to see it is in the manual. I could've sworn I looked at all the matches for "diagonal" in the manual.

Sunday, January 27, 2008

Create a diagonal matrix in Maxima

There's got to be a better way to do this! But how?

diag(d) := block([n], n : length(d), 
                      genmatrix(lambda([i, j],
                          if i = j then d[i] else 0),
                              n, n))$

---

January 28: Yes, there is a better way! In reply to a question on the Maxima list, Barton Willis pointed out a couple functions: diag in package diag and mat_diag which is included in the environment loaded by default:

(%i90) diag_matrix(1,2,3);
                                  [ 1  0  0 ]
                                  [         ]
(%o90)                            [ 0  2  0 ]
                                  [         ]
                                  [ 0  0  3 ]

Saturday, January 26, 2008

Approaching normality

qqnorm(rt(10000, df=1), xlab="Gaussian(0,1) quantiles", ylab="t_1 quantiles", main="t_1 v. Gaussian QQ-Plot")

qqnorm(rt(10000, df=10), xlab="Gaussian(0,1) quantiles", ylab="t_10 quantiles", main="t_10 v. Gaussian QQ-Plot") abline(0, 1)

qqnorm(rt(10000, df=100), xlab="Gaussian(0,1) quantiles", ylab="t_100 quantiles", main="t_100 v. Gaussian QQ-Plot") abline(0, 1)

y <- sqrt(rnorm(10000, mean=1)^2 + rnorm(10000)^2) qqnorm(y, xlab="Gaussian(0,1) quantiles", ylab="Rice(1,1) quantiles", main="Rice(1,1) v. Gaussian QQ-Plot") qqline(y)

y <- sqrt(rnorm(10000, mean=1, sd=10)^2 + rnorm(10000, sd=10)^2) qqnorm(y, xlab="Gaussian(0,1) quantiles", ylab="Rice(1,100) quantiles", main="Rice(1,100) v. Gaussian QQ-Plot") qqline(y)

Oh, I've got this backwards. Duh.

y <- sqrt(rnorm(10000, mean=100, sd=10)^2 + rnorm(10000, sd=10)^2) qqnorm(y, xlab="Gaussian(0,1) quantiles", ylab="Rice(100,100) quantiles", main="Rice(100,100) v. Gaussian QQ-Plot") qqline(y)

But there's more to it than signal-to-noise ratio, it seems. At least the relation between the quantiles doesn't look nearly so linear for Rice(1,1) as it does for Rice(100,100). It doesn't seem surprising that the distance of mu from the origin would play a role, however. This would probably be a good time to break out some math. But it's past my bed time.