Saturday, April 26, 2008

logistic regression on a circle

> x <- rnorm(1000)
> y <- rnorm(1000)
> inside <- ifelse(x^2 + y^2 < 1, 1, 0)
> f <- glm(inside ~ I(y^2)+I(x^2)+x*y, family=binomial())
Warning messages:
1: In glm.fit(x = X, y = Y, weights = weights, start = start, etastart = etastart, :
algorithm did not converge
2: In glm.fit(x = X, y = Y, weights = weights, start = start, etastart = etastart, :
fitted probabilities numerically 0 or 1 occurred
> d <- expand.grid(x=seq(-1,1,len=100), y=seq(-1,1,len=100))
> p <- predict(f, type="response", newdata=d)
> image(matrix(p, ncol=100))


This seems to work pretty well, as one might expect.

Now let's apply this to the Mandelbrot Set:
mset.data <- function(width=200, iterations=100) {
mset <- as.vector(outer(seq(-2, 2, len=width),
seq(-2, 2, len=width) * 1i, FUN="+"))
cmset <- NULL
z <- numeric(length(mset))
for (i in 1:iterations) {
z <- z^2 + mset
outside <- Mod(z) > 2
cmset <- append(cmset, mset[outside])
inside <- !outside
mset <- mset[inside]
z <- z[inside]
}
all <- c(cmset, mset)
all.next <- all + all^2
inside <- rep(c(0, 1), c(length(cmset), length(mset)))
x <- Re(all)
y <- Im(all)
data.frame(inside=inside, x1=Re(all), y1=Im(all),
x2=Re(all.next), y2=Im(all.next))
}

mset.prob <- function(width=200) {
d <- mset.data(width=width)
f <- glm(inside ~ I(x1^2) + I(y1^2) +
x1*y1 + I(x2^2) + I(y2^2) + x2*y2,
data=d, family=binomial())
z1 <- as.vector(outer(seq(-2, 2, len=width),
seq(-2, 2, len=width) * 1i, FUN="+"))
z2 <- z1 + z1^2
p <- predict(f, type="response",
newdata=data.frame(x1=Re(z1), y1=Im(z1),
x2=Re(z2), y2=Im(z2)))
matrix(p, nrow=width, ncol=width)
}

> p <- mset.prob()
> image(seq(-2,2,len=200), seq(-2,2,len=200), p, xlab="x", ylab="y")


How're we doing in actually computing the M-Set?
> md <- mset.data()
> o <- md$inside==1
> plot(c(), xlim=c(-2,2), ylim=c(-2,2), type="n")
> points(md$x[o], md$y[o], pch=1)

Looks about right.

Monday, April 21, 2008

int division by zero

What output do you get if you get a division by zero in integer math in C?
$ cat int_div_by_zero.c 
#include <stdio.h>

int main()
{
printf("%d\n", 3 / 0);
return 0;
}
$ gcc int_div_by_zero.c
int_div_by_zero.c: In function 'main':
int_div_by_zero.c:5: warning: division by zero
$ ./a.out
Floating point exception
Kaboom! Nothing. This result immediately raises the question, "Is integer math division as slow as or slower than floating point division since it seems to be implemented in terms of it?"

(We get the same result with 0/0 and also even when we compile with -fno-trapping-math.)

Sunday, April 20, 2008

survreg

Just checking that I understand the model being fit by survreg from the survival package:

##Generate n variates from the standard (mu=0, sigma=1)
##Smallest Extreme Value distribution.
rsev <- function(n) log(-log(runif(n)))

##g -- RNG for the underlying standard distribution
##for the log-location-scale family.
##Defaults to rsev to produce Weibull variates.
gen.data <- function(x, beta0=3, beta1=9, sigma=4, g=rsev) {
n <- length(x)
exp(beta0 + beta1 * x + sigma * g(n))
}
> x <- rnorm(10000)
> Time <- gen.data(x=x)
> f <- survreg(Surv(Time, rep(1, 10000)) ~ x)
> f
Call:
survreg(formula = Surv(Time, rep(1, 10000)) ~ x)

Coefficients:
(Intercept) x
3.032166 9.037986

Scale= 3.968822

Loglik(model)= -37402.9 Loglik(intercept only)= -46087.6
Chisq= 17369.46 on 1 degrees of freedom, p= 0
n= 10000
> Time <- gen.data(x=x, g=rnorm)
> f <- survreg(Surv(Time, rep(1, 10000)) ~ x, dist="lognormal")
> f
Call:
survreg(formula = Surv(Time, rep(1, 10000)) ~ x, dist = "lognormal")

Coefficients:
(Intercept) x
3.018112 8.926638

Scale= 3.995970

Loglik(model)= -58522.6 Loglik(intercept only)= -67530.3
Chisq= 18015.41 on 1 degrees of freedom, p= 0
n= 10000
>


So, it looks like the model being fit is T ~ F((log(t) - mu) / sigma) where mu = X * beta. What's it doing in the case of dist="exponential", "gaussian", or "logistic" then? Oh, you can specify the transformation, eg, log, via the "trans", "dtrans", and "itrans" fields of the distribution object, where leaving the fields unspecified seems to imply that the identity transformation should be used. (See survreg.distributions.)

Friday, April 11, 2008

weighted random sample in C

/**
* Sample compile:
* gcc -Wall -lRmath weighted_sample.c -I/Library/Frameworks/R.framework/Resources/include -L/Library/Frameworks/R.framework/Resources/lib
* $ ./a.out
mean: 0.409
*/

#define MATHLIB_STANDALONE
#include <Rmath.h>

#include <stdio.h>
#include <stdlib.h>

/**
* Compute a cumulative sum of y, ie,
* result[j] = sum(i=0:j) y[i].
* The caller is responsible for freeing the returned
* object.
*/
double* normedCumSum(const double* y, int len)
{
double* result = malloc(len * sizeof(double));
if (len > 0) {
double norm;
int i;

result[0] = y[0];

for (i = 1; i < len; ++i) {
result[i] = result[i - 1] + y[i];
}

norm = 1.0 / result[len - 1];
for (i = 0; i < len; ++i) {
result[i] *= norm;
}
}

return result;
}

/**
* Sample reps elements from y (with replacement) with the given
* weights w. The caller is responsible for freeing the returned
* object.
*/
double* sample(const double* y, const double* w, int len, int reps)
{
double* result = malloc(reps * sizeof(double));
double* probs = normedCumSum(w, len);
int i;

for (i = 0; i < reps; ++i) {
int j;

const double u = runif(0.0, 1.0);
for (j = 0; j < len; ++j) {
if (u <= probs[j]) {
result[i] = y[j];
break;
}
}
}

free(probs);

return result;
}

int main()
{
double y[2] = {0, 1};
double w[2] = {0.6, 0.4};
double* x = sample(y, w, 2, 1000);
double sum = 0.0;
int i;
for (i = 0; i < 1000; ++i) {
sum += x[i];
}
printf("mean: %g\n", sum / 1000);

return 0;
}
Could be quite a bit more efficient. It should work though.

pseudorandom samples in Matlab

>> randsample([1,2,3,4], 2)
ans =
2 3


From the help page:
RANDSAMPLE Random sample, with or without replacement.
Y = RANDSAMPLE(N,K) returns Y as a 1-by-K vector of values sampled
uniformly at random, without replacement, from the integers 1:N.

Y = RANDSAMPLE(POPULATION,K) returns K values sampled uniformly at
random, without replacement, from the values in the vector POPULATION.

Y = RANDSAMPLE(...,REPLACE) returns a sample taken with replacement if
REPLACE is true, or without replacement if REPLACE is false (the default).

Y = RANDSAMPLE(...,true,W) returns a weighted sample, using positive
weights W, taken with replacement. W is often a vector of probabilities.
This function does not support weighted sampling without replacement.

Example: Generate a random sequence of the characters ACGT, with
replacement, according to specified probabilities.

R = randsample('ACGT',48,true,[0.15 0.35 0.35 0.15])

See also rand, randperm.
I'm still learning how to use the help system. It's pretty silly, but I found this via Google search rather than via the internal search for the program. ("sample", "random", and "random sample" all returned relatively huge lists of results, with the one I was interested settled down in the middle. Ah, putting quote marks around "random sample" brings randsample to near the top.)

Thursday, April 10, 2008

smbclient

smbclient \\\\[host name]\\[share]

http://www.faqs.org/docs/Linux-HOWTO/SMB-HOWTO.html

When scp failed me, smbclient saved the hour.

Tuesday, April 8, 2008

Row major

row_major.c:
#include <stdio.h>

int main() {
char a[4][3][2];
int i;

for (i = 0; i < 4; ++i) {
int j;
for (j = 0; j < 3; ++j) {
int k;
for (k = 0; k < 2; ++k) {
printf("(%d, %d, %d) ==> %d\n", i, j, k,
(int)(&a[i][j][k] - &a[0][0][0]));
}
}
}
return 0;
}

$ gcc -Wall row_major.c
$ ./a.out
(0, 0, 0) ==> 0
(0, 0, 1) ==> 1
(0, 1, 0) ==> 2
(0, 1, 1) ==> 3
(0, 2, 0) ==> 4
(0, 2, 1) ==> 5
(1, 0, 0) ==> 6
(1, 0, 1) ==> 7
(1, 1, 0) ==> 8
(1, 1, 1) ==> 9
(1, 2, 0) ==> 10
(1, 2, 1) ==> 11
(2, 0, 0) ==> 12
(2, 0, 1) ==> 13
(2, 1, 0) ==> 14
(2, 1, 1) ==> 15
(2, 2, 0) ==> 16
(2, 2, 1) ==> 17
(3, 0, 0) ==> 18
(3, 0, 1) ==> 19
(3, 1, 0) ==> 20
(3, 1, 1) ==> 21
(3, 2, 0) ==> 22
(3, 2, 1) ==> 23


This matches what the FFTW3 manual describes: For an n_1 x ... x n_d array, (i_1, ... i_d) ==> i_d + n_d * (i_{d-1} + n_{d-1} * (... n_2 * i_1) ... ). But then i_d = the "row coordinate" since it is runs over this index that are grouped together? Having a[i][j] refer to the ith column and jth row will take some getting used to although it is another thing I should've been aware of long ago. (On the other hand, I rarely use staticly allocated multidimensional arrays and so have almost certainly never written anything incorrect as a result of this gap in knowledge.)

//Does Java use the same convention? How about the various packages for allocating arrays in C?

Saturday, April 5, 2008

print out the vector implementation

a=`locate stl_vector.h | head -n 1`; cat $a

Objective C doesn't have keyword/named arguments

This recently came up on Apple's Objective C list: Selectors vs named arguments. If Objective C really did have keyword arguments, then the following would be selector would be illegal: foo:foo:. But it's not. I know I've misspoken on this because I remember asking someone if there were any languages with nothing but keyword arguments, and then Smalltalk came falsely to mind. So, are there any languages with nothing but keyword arguments, ie, you must always use calls of the form foo(arg1="pizza")? This could of course seem a little irritating for functions with only one parameter. And now that I realize that Smalltalk is doing something else (duh!), the idea seems less appealing overall because I really like what Smalltalk does.