Monday, May 19, 2008

random partition

#include <assert.h>
#include <math.h>
#include "partition.h"
#include <Rmath.h>
#include <stdlib.h>

/**
* Returns a array of length numClasses normalized to sum
* to 1.
*/
static double* randomProportions(int numClasses)
{
double* props = malloc(sizeof(double) * numClasses);
int i;
double total = 0.0;

for (i = 0; i < numClasses; ++i) {
props[i] = runif(0.0, 1.0);
total += props[i];
}

for (i = 0; i < numClasses; ++i) {
props[i] /= total;
}

return props;
}

/**
* Ensures at least one element is assigned to each
* class. The caller is responsible for freeing the
* returned object.
*/
int* randomCounts(int len, int numClasses)
{
int* counts = malloc(sizeof(int) * numClasses);
double* props = randomProportions(numClasses);
const double remaining = len - numClasses;
int total = 0;
int i;

for (i = 0; i < numClasses; ++i) {
counts[i] = 1;
}

for (i = 1; i < numClasses; ++i) {
const int extra = (int) round(remaining * props[i]);
total += extra;
counts[i] += extra;
}

counts[0] += remaining - total;

free(props);

return counts;
}

/**
* On exit, classes[i] = the class assigned to the kth class,
* k = 0:(numClasses - 1). Assumes classes has length at least len.
* Assumes that numClasses <= len. Assigns at least one element
* to each class. The classes are assigned in blocks, ie, they
* partition the indeces 0:(len - 1) into numClasses classes,
* ordered from class 0 and up.
*/
void randomPartition(int len, int numClasses, int* classes)
{
int* counts = randomCounts(len, numClasses);
int i;
int currentClass = 0;

assert(numClasses <= len);

for (i = 0; i < len; ++i) {
classes[i] = currentClass;

--counts[currentClass];

if (!counts[currentClass]) {
++currentClass;
}
}

free(counts);
}
I'm not completely happy with this design. I should probably have made "randomPartition" take the counts as an argument (although then it wouldn't've been so "random") since at least for my work, we're almost always interested in the counts (and will just have to recalculate them if they're not available). I wonder what standard implementations of this are available. I suppose I should be checking that the mallocs are succeeding as well. Are there any common platforms where accessing address 0 doesn't lead to a seg fault? Or what platforms don't segfault on accessing 0?

No comments: