Saturday, March 8, 2008

strings again

getChars <- function(s) {
n <- nchar(s)
if (n > 0) substring(s, 1:n, 1:n) else character(0)
}

strip <- function(s, chars) {
s.chars <- getChars(s)
paste(s.chars[!(s.chars %in% chars)], collapse="")
}

tr <- function(s, from, to) {
chars <- getChars(s)
o <- match(chars, from)
paste(ifelse(!is.na(o), to[o], chars), collapse="")
}

lower <- function(s) {
tr(s, from=LETTERS, to=letters)
}

upper <- function(s) {
tr(s, from=letters, to=LETTERS)
}
Another go: S-PLUS doesn't have strsplit so I use a different (and more efficient?) method for getting at the characters of a string.

> system.time(replicate(10000, strsplit("1234567890", "")[[1]]))
user system elapsed
0.102 0.004 0.105
> system.time(replicate(10000, substring("1234567890", 1:10, 1:10)))
user system elapsed
0.297 0.003 0.299


That's a surprise. Maybe I should try avoiding creating the index list twice? Still strsplit seems so much heavier.
//The source code for strsplit reveals that they make a special case of the pattern "". (See src/main/character.c.)
//Well, this is documented in the help page for strsplit as well.

No comments: