Quotidian: March 2008

Saturday, March 29, 2008

scanf character classes -- oops

There was a serious bug in the code I posted for charToIntRange earlier (which I'll fix momentarily). I had thought that a character range "%[...]" in scanf acts like "%c" (with default width 1, no NUL added) but actually it acts like "%s". So, although &divider had the right type -- char* -- I was corrupting memory leading to an odd crash later. I guess the lesson for me is to read the documentation more thoroughly. (This is the first time I've used character classes in C. scanf suddenly seems much cooler than before now that I know of them.) This is also an error the compiler could catch though. If the argument corresponding to "%[...]" is a pointer to a char variable, this is always a sign of error. Maybe this case isn't common enough to warrant special attention though.

charClass2.c:
#include <stdio.h>
#include <string.h>

int main(int numArgs, char** args)
{
  int i;
  for (i = 0; i < numArgs; ++i) {
    char x[3];
    char y[1000];
    if (sscanf(args[i], "%2[a]%999[a]", x, y) == 2) {
      printf("%zu %zu\n", strlen(x), strlen(y));
    }
    else {
      printf(":-(\n");
    }
  }
  return 0;
}

$ gcc -Wall charClass2.c $ ./a.out a aa aaa aaaa baaaa :-( :-( :-( 2 1 2 2 :-(
("%zu" is not documented in the man page for printf on my system. I found it after getting warnings with earlier code about an incompatibility with size_t. It looks like it's a C99 extension. Discussion.)

//Actually... I didn't need a character class at all. Character literals match themselves. Of course! Ach. Lesson for me: Don't forget stuff like this.

//And it's not portable to assume char* index(const char *s, int c) is defined in string.h. I've switched to strchr, which I should've been using from the beginning.

generalized logit

This is what I feel like writing now:

y = a f(x)/(b + c f(x))
y b + y c f(x) = a f(x)
b y = (a - yc) f(x)
x = f^-1(b y / (a - y c))

Learn Haskell in 10 minutes... Finally!

stripHtml.hs: stripHtml :: String -> String stripHtml ('<':xs) = eatTag xs stripHtml (x:xs) = x : stripHtml xs stripHtml [] = [] eatTag :: String -> String eatTag ('>':xs) = stripHtml xs eatTag (x:xs) = eatTag xs eatTag [] = []

$ ghci
   ___         ___ _
  / _ \ /\  /\/ __(_)
 / /_\// /_/ / /  | |      GHC Interactive, version 6.6, for Haskell 98.
/ /_\\/ __  / /___| |      http://www.haskell.org/ghc/
\____/\/ /_/\____/|_|      Type :? for help.

Loading package base ... linking ... done.
Prelude> :l stripHtml.hs
[1 of 1] Compiling Main             ( stripHtml.hs, interpreted )
Ok, modules loaded: Main.
*Main> stripHtml "<b><blink>What's</blink></b><h1> a better</h1> way to d<tt>o this</tt>?"
"What's a better way to do this?"

This looks like fun.

Friday, March 28, 2008

Parse an int range parameter

#include <stdio.h>
#include <string.h>

/**
 * If str is of the form "<m>:<n>",
 * set *min = m
 * and *max = n. Otherwise,
 * if str if of the form "<n>",
 * set *min = defaultMin and
 * *max = n.
 * Return the number of numbers
 * successfully parsed (1 or 2 indicating success,
 * 0 failure).
 */
int charToIntRange(const char* str,
     int defaultMin,
     int* min,
     int* max)
{
  if (strchr(str, ':')) {
    if (sscanf(str, "%d:%d", min, max) == 2) {
      return 2;
    }
    else {
      return 0;
    }
  }
  else {
    *min = defaultMin;
    return sscanf(str, "%d", max);
  }
}

int main(int numArgs, char** args)
{
  int i;
  for (i = 1; i < numArgs; ++i) {
    int m;
    int n;
    if (charToIntRange(args[i], -1666, &m, &n)) {
      printf("%d:%d\n", m, n);
    }
    else {
      printf("failed to parse <<%s>>\n", args[i]);
    }
  }

  return 0;
}

$ gcc -Wall charToIntRange.c $ ./a.out 1 155:7 -3:3 :1 1: 19:-1 19:1a t15:5 -1666:1 155:7 -3:3 failed to parse <<:1>> failed to parse <<1:>> 19:-1 19:1 failed to parse <<t15:5>>

It should probably fail with an input of "19:1a", but I'll leave off worrying about that for now.

Is the format "%[:]" portable?

Wednesday, March 26, 2008

Rotate a matrix 90 degrees clockwise


##Rotate a square matrix 90 degrees clockwise.
rot90 <- function(a) {
  n <- dim(a)[1]
  stopifnot(n==dim(a)[2])
  t(a[n:1, ])
}

JavaScript:
Fill in a default value for every blank text field

Hello.

Matlab: Write numbers to a file:

f = fopen('test.txt', 'w');
fprintf(f, '%g\n', 1:50);
fclose(f);

(I'm sure there's a better way to do this.)

Wednesday, March 19, 2008

Marginal Ricians with correlated errors

First, what's a good way to generate errors with a desired correlation structure? Let's start with the following:

> d <- as.matrix(dist(1:4, diag=T, upper=T))
> s <- 0.3^d
> a <- chol(s)
> m <- matrix(nrow=10000,ncol=4)
> for (i in 1:10000) { x <- rnorm(4); m[i,] <- a %*% x}
> var(m)
           [,1]       [,2]      [,3]       [,4]
[1,] 1.09839578 0.30672920 0.1062927 0.03032918
[2,] 0.30672920 1.01055057 0.3031668 0.09055224
[3,] 0.10629268 0.30316682 1.0044863 0.27050059
[4,] 0.03032918 0.09055224 0.2705006 0.88942696
> s
      1    2    3     4
1 1.000 0.30 0.09 0.027
2 0.300 1.00 0.30 0.090
3 0.090 0.30 1.00 0.300
4 0.027 0.09 0.30 1.000

Is it surprising to find the sample variance such a relatively poor approximation to the true variance even with 10000 observations? (I guess the answer is "no!")

Now, let's use this to see what happens to a mixture of marginally Rician distributions as the correlation gets cranked up.

##Produce elements that are marginally Rician distributed but for
##which the noise between elements is correlated with the
##structure cor(X_i, X_j) = cor(Y_i, Y_j) =  rho^abs(i-j)
##but {X_i} independent of {Y_i}, i = 1:length(mu) and
##result[i] = sqrt(X_i^2 + Y_i^2).
##The value is a matrix of draws from the given distribution,
##each row a single draw so that the ith column corresponds
##to mu[i]
corRice <- function(mu, n, rho, sigma) {
  k <- length(mu)
  d <- as.matrix(dist(1:k, diag=TRUE, upper=TRUE))
  r <- (rho ^ d)
  a <- sigma * chol(r)
  result <- matrix(nrow=n, ncol=k)
  for (i in 1:n) {
    x <- (a %*% rnorm(k)) + mu
    y <- a %*% rnorm(k)
    result[i, ] <- sqrt(x^2 + y^2)
  }
  result
}

> r0 <- corRice(rep(c(0, 50, 100), each=50), n=10000, rho=0, sigma=10) > r1 <- corRice(rep(c(0, 50, 100), each=50), n=10000, rho=0.9, sigma=10) > r2 <- corRice(rep(c(0, 50, 100), each=50), n=10000, rho=0.99, sigma=10) > plot(quantile(r0, probs=(1:999)/1000), quantile(r1, probs=(1:999)/1000), xlab="quantiles with rho=0", ylab="quantiles with rho=0.9") > plot(quantile(r0, probs=(1:999)/1000), quantile(r2, probs=(1:999)/1000), xlab="quantiles with rho=0", ylab="quantiles with rho=0.99")

Hmmmmmmmm.

Tuesday, March 18, 2008

R's char functions

Previously, I posted some functions for manipulating strings in R. I was surprised not to have found these in R. Obviously, I didn't look hard enough. RSiteSearch("strings") led me to a post mentioning the following:
> tolower("hI THerE, guYs!") [1] "hi there, guys!" > toupper("hI THerE, guYs!") [1] "HI THERE, GUYS!" > casefold("Hi") [1] "hi" > casefold("Hi", upper=TRUE) [1] "HI" > gsub("-", "", "what-what-what---?", fixed=TRUE) [1] "whatwhatwhat?"
Well, I knew about gsub before at least.

S-PLUS's gsub doesn't recognize the fixed param but otherwise casefold and gsub work the same. (Of course, this is the point of having casefold in R.)

Monday, March 17, 2008

R special cases x^2 and x^0.5

Pulled from src/main/arithmetic.c:

static SEXP real_unary(ARITHOP_TYPE code, SEXP s1, SEXP lcall)
{
    int i, n;
    SEXP ans;

    switch (code) {
    case PLUSOP: return s1;
    case MINUSOP:
 ans = duplicate(s1);
 n = LENGTH(s1);
 for (i = 0; i < n; i++)
     REAL(ans)[i] = -REAL(s1)[i];
 return ans;
    default:
 errorcall(lcall, _("invalid unary operator"));
    }
    return s1; /* never used; to keep -Wall happy */
}

Would this be noticeably faster if we redefined the case for MINUSOP as

case MINUSOP:
  {
    const double* y1 = REAL(s1)[i];
    double* yAns;
    ans = duplicate(s1);
    yAns = REAL(ans);
    n = LENGTH(s1);
    for (i = 0; i < n; ++i)
      yAns[i] = -y1[i];
    return ans;
  }

?
We have the following defines:
Rinternals.h: #define REAL(x) ((double *) DATAPTR(x))
Rinternals.h: #define DATAPTR(x) (((SEXPREC_ALIGN *) (x)) + 1)
So, it seems like this could theoretically save 2n additions per negation. Woo! Is gcc smart enough to notice the legality of this move and take advantage of it?

Poking around in arithmetic.c also shows that x^2.0 and x^0.5 are special-cased to use more efficient alternatives than pow while there's another special routine R_pow_di to handle raising a double to an integer power. Which is used by pi^3? Setting breakpoints on R_pow and R_pow_di shows R_pow is used. In fact, pi^as.integer(3) also uses R_pow.

I wonder what the distribution of powers used in production R code looks like. 2 is almost certainly the most popular. Would it be worthwhile special casing other integer powers or even including a generic check for an integer power? Do most implementations of pow do this already?

Friday, March 14, 2008

an assortment of C

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define LINE_SIZE 10000

/**
 * Reads doubles from in until EOF.
 * If le n is not null, *len is set to
 * the number of elements read.
 * The caller is responsible for freeing the 
 * memory of the returned object.
 * Returns null in case of an error.
 */
double* readArray(FILE* in, int* len)
{
  const char* SPACES = " \t\r\n";
  char* line = malloc(LINE_SIZE);
  int capacity = 100;
  double* y = malloc(sizeof(double) * capacity);
  int n = 0;
  while (!ferror(in) && !feof(in)) {
    if (fgets(line, LINE_SIZE, in)) {
      const char* next = line;
      while (*next) {
 next += strspn(next, SPACES);
 if (*next) {
   double v;
   if (sscanf(next, "%lf", &v) == 1) {
     if (capacity == n) {
       capacity *= 2;
       y = realloc(y, capacity * sizeof(double));
     }

     y[n] = v;
     ++n;
   }
   else {
     free(line);
     free(y);
     return 0;
   }
 }
 next += strcspn(next, SPACES);
      }
    }
    else {
      if (!feof(in)) {
 free(line);
 free(y);
 return 0;
      }
    }
  }

  if (len) {
    *len = n;
  }

  return y;
}

/**
 * Reads doubles from file.
 * If len is not null, *len is set to the
 * number of elements read.
 * The caller is responsible for freeing the 
 * memory of the returned object.
 * Returns null in case of an error.
 */
double* readArrayFromFile(const char* file, int* len)
{
  FILE* in = fopen(file, "r");
  double* result = 0;
  if (!in) {
    fprintf(stderr, "Problem opening %s\n", file);
    perror("");
  }
  else {
    result = readArray(in, len);
    fclose(in);
  }
  return result;
}

/**
 * Returns 1 if tail is a suffix of s, 0 otherwise.
 */
int endsWith(const char* s, const char* tail)
{
  const int sLen = strlen(s);
  const int tailLen = strlen(tail);
  if (sLen >= tailLen) {
    return strcmp(s + sLen - tailLen, tail) == 0;
  }
  else {
    return 0;
  }
}

int remove(double* y, const int len,
const double toRemove)
{
  int newLen = 0;
  int i;
  
  for (i = 0; i < len; ++i) {
    const double yi = y[i];
    if (yi != toRemove) {
      y[newLen] = yi;
      ++newLen;
    }
  }
  return newLen;
}

Thursday, March 13, 2008

Change the working directory in SAS

Just double click the status bar on the bottom of the main window giving the location of the current working directory!
Change the SAS Working Folder
This was not obvious to me.

These look possibly useful:
proc discrim [PDF]
proc princomp [PDF]

Wednesday, March 12, 2008

letter counts per paragraph

Count up the number of each letter by paragraph.

#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <strings.h>

int main(int numArgs, char** args)
{
  const int lineSize = 10000;

  if (numArgs != 2) {
    fprintf(stderr, "Usage: par_char_freq <text file>\n");
    exit(1);
  }
  else {
    char* line = malloc(lineSize);
    FILE* in = fopen(args[1], "r");
    int counts[26];
    int paragraphTotal = 0;

    if (!in) {
      fprintf(stderr, "Problem opening %s\n", args[1]);
      perror("");
      exit(1);
    }

    bzero(counts, 26 * sizeof(int));

    while (!feof(in) && !ferror(in)) {
      int lineTotal = 0;
      if (fgets(line, lineSize, in)) {
 int i;
 for (i = 0; line[i]; ++i) {
   const char c = tolower(line[i]);
   if ('a' <= c && c <= 'z') {
     ++counts[c - 'a'];
     ++lineTotal;
   }
 }
 
 if (lineTotal) {
   paragraphTotal += lineTotal;
 }
 else if (paragraphTotal) {
   int i;
   for (i = 0; i < 26; ++i) {
     printf("%d ", counts[i]);
   }
   printf("\n");

   paragraphTotal = 0;
   bzero(counts, 26 * sizeof(int));
 }
      }
    }

    fclose(in);
    free(line);
  }
  return 0;
}

Tuesday, March 11, 2008

The most permutable word is "trace"?

Which English word has the most permutations which are also English words? (Of course each of the lexible permutations is also an answer. Is "lexible" a sensible term for having the property of being a word?)

Here's some C++ code to look at this:

most_permutable.cpp:
#include <algorithm>
#include <cctype>
#include <fstream>
#include <functional>
#include <iostream>
#include <map>
#include <string>
using namespace std;

template<typename T>
class select2nd {
public:
  typedef T argument_type;
  
  typename T::second_type operator()(const T& t) {
    return t.second;
  }
};

template<typename B, typename U1, typename U2>
class binary_compose {
  public:
  typedef typename B::result_type result_type;
  typedef typename U1::argument_type arg1_type;
  typedef typename U2::argument_type arg2_type;
  binary_compose(const B& b, const U1& u1, const U2& u2) :
    b(b), u1(u1), u2(u2) {}

  result_type operator()(const arg1_type& x1,
    const arg2_type& x2)
  {
    return b(u1(x1), u2(x2));
  }

private:
  B b;
  U1 u1;
  U2 u2;
};

template<typename B, typename U1, typename U2>
binary_compose<B, U1, U2>
compose2(B b, U1 u1, U2 u2)
{
  return binary_compose<B, U1, U2>(b, u1, u2);
}

typedef map<string, int>::iterator iter;

int main(int numArgs, char** args)
{
  map<string, int> permutationCounts;
  ifstream in("/usr/share/dict/words");
  while (in) {
    string word;
    in >> word;
    if (!word.empty() && islower(word[0])) {
      sort(word.begin(), word.end());
      iter p = permutationCounts.find(word);
      if (p != permutationCounts.end()) {
 ++(*p).second;
      }
      else {
 permutationCounts.insert(make_pair(word, 1));
      }
    }
  }

  iter m = max_element(permutationCounts.begin(),
         permutationCounts.end(),
         compose2(less<int>(),
    select2nd<pair<string, int> >(),
    select2nd<pair<string, int> >()));
  cout << (*m).first << " with " << (*m).second << '\n';
}

Drum roll:
$ g++ -Wall most_permutable.cpp $ ./a.out acert with 9

What are the 9?

find_perm.cpp:
#include <algorithm>
#include <cstdio>
#include <fstream>
#include <iostream>
#include <string>
using namespace std;

int main(int numArgs, char** args)
{
  if (numArgs != 3) {
    cerr << "Usage: find_perm <file> <letters>\n";
    return 1;
  }

  ifstream in(args[1]);
  if (!in) {
    cerr << "Problem opening " << args[1] << '\n';
    perror("");
    return 1;
  }

  string letters(args[2]);
  sort(letters.begin(), letters.end());
  const string::size_type n = letters.size();

  string word;
  while (in) {
    in >> word;

    if (n == word.size()) {
      string key(word);

      sort(key.begin(), key.end());
      
      if (key == letters) {
 cout << word << '\n';
      }
    }
  }

  in.close();

}

$ g++ -Wall -o find_perm find_perm.cpp $ ./find_perm oops Usage: find_perm <file> <letters> $ ./find_perm oops oops Problem opening oops No such file or directory $ ./find_perm /usr/share/dict/words trace caret carte cater crate creat creta react recta trace

What the heck is "creat"? Hm. Google doesn't know about it. It is really there though:
$ grep -w -n creat /usr/share/dict/words 45105:creat

Saturday, March 8, 2008

strings again

getChars <- function(s) {
  n <- nchar(s)
  if (n > 0) substring(s, 1:n, 1:n) else character(0)
}

strip <- function(s, chars) {
  s.chars <- getChars(s)
  paste(s.chars[!(s.chars %in% chars)], collapse="")
}

tr <- function(s, from, to) {
  chars <- getChars(s)
  o <- match(chars, from)
  paste(ifelse(!is.na(o), to[o], chars), collapse="")
}

lower <- function(s) {
  tr(s, from=LETTERS, to=letters)
}

upper <- function(s) {
  tr(s, from=letters, to=LETTERS)
}

Another go: S-PLUS doesn't have strsplit so I use a different (and more efficient?) method for getting at the characters of a string.

> system.time(replicate(10000, strsplit("1234567890", "")[[1]])) user system elapsed 0.102 0.004 0.105 > system.time(replicate(10000, substring("1234567890", 1:10, 1:10))) user system elapsed 0.297 0.003 0.299

That's a surprise. Maybe I should try avoiding creating the index list twice? Still strsplit seems so much heavier.
//The source code for strsplit reveals that they make a special case of the pattern "". (See src/main/character.c.)
//Well, this is documented in the help page for strsplit as well.