Hi, We've had several solutions, and I was curious about their relative efficiency. Here's a test with a moderately large data vector:> library("microbenchmark") > set.seed(123) # for reproducibility > x <- sample(xc, 1e4, replace=TRUE) # "data" > microbenchmark(John = John <- xn[x],+ Rich = Rich <- xn[match(x, xc)], + Jeff = Jeff <- { + n <- as.integer( sub( "[a-i]$", "", x ) ) + d <- match( sub( "^\\d+", "", x ), letters[1:9] ) + d[ is.na( d ) ] <- 0 + n + d / 10 + }, + David = David <- as.numeric(gsub("a", ".3", + gsub("b", ".5", + gsub("c", ".7", x)))), + times=1000L + ) Unit: microseconds expr min lq mean median uq max neval cld John 228.816 345.371 513.5614 503.5965 533.0635 10829.08 1000 a Rich 217.395 343.035 534.2074 489.0075 518.3260 15388.96 1000 a Jeff 10325.471 13070.737 15387.2545 15397.9790 17204.0115 153486.94 1000 b David 14256.673 18148.492 20185.7156 20170.3635 22067.6690 34998.95 1000 c> all.equal(John, Rich)[1] TRUE> all.equal(John, David)[1] "names for target but not for current"> all.equal(John, Jeff)[1] "names for target but not for current" "Mean relative difference: 0.1498243" Of course, efficiency isn't the only consideration, and aesthetically (and no doubt subjectively) I prefer Rich Heiberger's solution. OTOH, Jeff's solution is more general in that it generates the correspondence between letters and numbers. The argument for Jeff's solution would, however, be stronger if it gave the desired answer. Best, John> On Jul 10, 2020, at 3:28 PM, David Carlson <dcarlson at tamu.edu> wrote: > > Here is a different approach: > > xc <- c("1", "1a", "1b", "1c", "2", "2a", "2b", "2c") > xn <- as.numeric(gsub("a", ".3", gsub("b", ".5", gsub("c", ".7", xc)))) > xn > # [1] 1.0 1.3 1.5 1.7 2.0 2.3 2.5 2.7 > > David L Carlson > Professor Emeritus of Anthropology > Texas A&M University > > On Fri, Jul 10, 2020 at 1:10 PM Fox, John <jfox at mcmaster.ca> wrote: > Dear Jean-Louis, > > There must be many ways to do this. Here's one simple way (with no claim of optimality!): > > > xc <- c("1", "1a", "1b", "1c", "2", "2a", "2b", "2c") > > xn <- c(1, 1.3, 1.5, 1.7, 2, 2.3, 2.5, 2.7) > > > > set.seed(123) # for reproducibility > > x <- sample(xc, 20, replace=TRUE) # "data" > > > > names(xn) <- xc > > z <- xn[x] > > > > data.frame(z, x) > z x > 1 2.5 2b > 2 2.5 2b > 3 1.5 1b > 4 2.3 2a > 5 1.5 1b > 6 1.3 1a > 7 1.3 1a > 8 2.3 2a > 9 1.5 1b > 10 2.0 2 > 11 1.7 1c > 12 2.3 2a > 13 2.3 2a > 14 1.0 1 > 15 1.3 1a > 16 1.5 1b > 17 2.7 2c > 18 2.0 2 > 19 1.5 1b > 20 1.5 1b > > I hope this helps, > John > > ----------------------------- > John Fox, Professor Emeritus > McMaster University > Hamilton, Ontario, Canada > Web: http::/socserv.mcmaster.ca/jfox > > > On Jul 10, 2020, at 1:50 PM, Jean-Louis Abitbol <abitbol at sent.com> wrote: > > > > Dear All > > > > I have a character vector, representing histology stages, such as for example: > > xc <- c("1", "1a", "1b", "1c", "2", "2a", "2b", "2c") > > > > and this goes on to 3, 3a etc in various order for each patient. I do have of course a pre-established classification available which does change according to the histology criteria under assessment. > > > > I would want to convert xc, for plotting reasons, to a numeric vector such as > > > > xn <- c(1, 1.3, 1.5, 1.7, 2, 2.3, 2.5, 2.7) > > > > Unfortunately I have no clue on how to do that. > > > > Thanks for any help and apologies if I am missing the obvious way to do it. > > > > JL > > -- > > Verif30042020 > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!V7p9rtNSgBWmF3KJ3U_01fR7vP_I7y-OnWHiTFxwRZ6bVJ3-emOwkBtcU3rSW6I$ > > PLEASE do read the posting guide https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!V7p9rtNSgBWmF3KJ3U_01fR7vP_I7y-OnWHiTFxwRZ6bVJ3-emOwkBtcg7nzsmk$ > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!V7p9rtNSgBWmF3KJ3U_01fR7vP_I7y-OnWHiTFxwRZ6bVJ3-emOwkBtcU3rSW6I$ > PLEASE do read the posting guide https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!V7p9rtNSgBWmF3KJ3U_01fR7vP_I7y-OnWHiTFxwRZ6bVJ3-emOwkBtcg7nzsmk$ > and provide commented, minimal, self-contained, reproducible code.
Many thanks to all. This help-list is wonderful. I have used Rich Heiberger solution using match and found something to learn in each answer. off topic, I also enjoyed very much his 2008 paper on the graphical presentation of safety data.... Best wishes. On Fri, Jul 10, 2020, at 10:02 PM, Fox, John wrote:> Hi, > > We've had several solutions, and I was curious about their relative > efficiency. Here's a test with a moderately large data vector: > > > library("microbenchmark") > > set.seed(123) # for reproducibility > > x <- sample(xc, 1e4, replace=TRUE) # "data" > > microbenchmark(John = John <- xn[x], > + Rich = Rich <- xn[match(x, xc)], > + Jeff = Jeff <- { > + n <- as.integer( sub( "[a-i]$", "", x ) ) > + d <- match( sub( "^\\d+", "", x ), letters[1:9] ) > + d[ is.na( d ) ] <- 0 > + n + d / 10 > + }, > + David = David <- as.numeric(gsub("a", ".3", > + gsub("b", ".5", > + gsub("c", ".7", x)))), > + times=1000L > + ) > Unit: microseconds > expr min lq mean median uq max neval cld > John 228.816 345.371 513.5614 503.5965 533.0635 10829.08 1000 a > Rich 217.395 343.035 534.2074 489.0075 518.3260 15388.96 1000 a > Jeff 10325.471 13070.737 15387.2545 15397.9790 17204.0115 153486.94 1000 b > David 14256.673 18148.492 20185.7156 20170.3635 22067.6690 34998.95 1000 c > > all.equal(John, Rich) > [1] TRUE > > all.equal(John, David) > [1] "names for target but not for current" > > all.equal(John, Jeff) > [1] "names for target but not for current" "Mean relative difference: > 0.1498243" > > Of course, efficiency isn't the only consideration, and aesthetically > (and no doubt subjectively) I prefer Rich Heiberger's solution. OTOH, > Jeff's solution is more general in that it generates the correspondence > between letters and numbers. The argument for Jeff's solution would, > however, be stronger if it gave the desired answer. > > Best, > John > > > On Jul 10, 2020, at 3:28 PM, David Carlson <dcarlson at tamu.edu> wrote: > > > > Here is a different approach: > > > > xc <- c("1", "1a", "1b", "1c", "2", "2a", "2b", "2c") > > xn <- as.numeric(gsub("a", ".3", gsub("b", ".5", gsub("c", ".7", xc)))) > > xn > > # [1] 1.0 1.3 1.5 1.7 2.0 2.3 2.5 2.7 > > > > David L Carlson > > Professor Emeritus of Anthropology > > Texas A&M University > > > > On Fri, Jul 10, 2020 at 1:10 PM Fox, John <jfox at mcmaster.ca> wrote: > > Dear Jean-Louis, > > > > There must be many ways to do this. Here's one simple way (with no claim of optimality!): > > > > > xc <- c("1", "1a", "1b", "1c", "2", "2a", "2b", "2c") > > > xn <- c(1, 1.3, 1.5, 1.7, 2, 2.3, 2.5, 2.7) > > > > > > set.seed(123) # for reproducibility > > > x <- sample(xc, 20, replace=TRUE) # "data" > > > > > > names(xn) <- xc > > > z <- xn[x] > > > > > > data.frame(z, x) > > z x > > 1 2.5 2b > > 2 2.5 2b > > 3 1.5 1b > > 4 2.3 2a > > 5 1.5 1b > > 6 1.3 1a > > 7 1.3 1a > > 8 2.3 2a > > 9 1.5 1b > > 10 2.0 2 > > 11 1.7 1c > > 12 2.3 2a > > 13 2.3 2a > > 14 1.0 1 > > 15 1.3 1a > > 16 1.5 1b > > 17 2.7 2c > > 18 2.0 2 > > 19 1.5 1b > > 20 1.5 1b > > > > I hope this helps, > > John > > > > ----------------------------- > > John Fox, Professor Emeritus > > McMaster University > > Hamilton, Ontario, Canada > > Web: http::/socserv.mcmaster.ca/jfox > > > > > On Jul 10, 2020, at 1:50 PM, Jean-Louis Abitbol <abitbol at sent.com> wrote: > > > > > > Dear All > > > > > > I have a character vector, representing histology stages, such as for example: > > > xc <- c("1", "1a", "1b", "1c", "2", "2a", "2b", "2c") > > > > > > and this goes on to 3, 3a etc in various order for each patient. I do have of course a pre-established classification available which does change according to the histology criteria under assessment. > > > > > > I would want to convert xc, for plotting reasons, to a numeric vector such as > > > > > > xn <- c(1, 1.3, 1.5, 1.7, 2, 2.3, 2.5, 2.7) > > > > > > Unfortunately I have no clue on how to do that. > > > > > > Thanks for any help and apologies if I am missing the obvious way to do it. > > > > > > JL > > > -- > > > Verif30042020 > > > > > > ______________________________________________ > > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!V7p9rtNSgBWmF3KJ3U_01fR7vP_I7y-OnWHiTFxwRZ6bVJ3-emOwkBtcU3rSW6I$ > > > PLEASE do read the posting guide https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!V7p9rtNSgBWmF3KJ3U_01fR7vP_I7y-OnWHiTFxwRZ6bVJ3-emOwkBtcg7nzsmk$ > > > and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!V7p9rtNSgBWmF3KJ3U_01fR7vP_I7y-OnWHiTFxwRZ6bVJ3-emOwkBtcU3rSW6I$ > > PLEASE do read the posting guide https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!V7p9rtNSgBWmF3KJ3U_01fR7vP_I7y-OnWHiTFxwRZ6bVJ3-emOwkBtcg7nzsmk$ > > and provide commented, minimal, self-contained, reproducible code. > >-- Verif30042020
Hello Jean-Louis, Noting the subject line of your post I thought the first answer would have been encoding histology stages as factors, and "unclass-ing" them to obtain integers that then can be mathematically manipulated. You can get a lot of work done with all the commands listed on the "factor" help page: ?factor samples <- 1:36 values <- runif(length(samples), min=1, max=length(samples)) hist <- rep(c("1", "1a", "1b", "1c", "2", "2a", "2b", "2c"), times=1:8) data1 <- data.frame("samples" = samples, "values" = values, "hist" = hist ) (data1$hist <- factor(data1$hist, levels=c("1", "1a", "1b", "1c", "2", "2a", "2b", "2c")) ) unclass(data1$hist) library(RColorBrewer); pal_1 <- brewer.pal(8, "Pastel2") barplot(data1$value, beside=T, col=pal_1[data1$hist]) plot(data1$hist, data1$value, col=pal_1) pal_2 <- brewer.pal(8, "Dark2") plot(unclass(data1$hist)/4, data1$value, pch=19, col=pal_2[data1$hist] ) group <- c(rep(0,10),rep(1,26)); data1$group <- group library(lattice); dotplot(hist ~ values | group, data=data1, xlim=c(0,36) ) HTH, Bill. W. Michels, Ph.D. On Fri, Jul 10, 2020 at 1:41 PM Jean-Louis Abitbol <abitbol at sent.com> wrote:> > Many thanks to all. This help-list is wonderful. > > I have used Rich Heiberger solution using match and found something to learn in each answer. > > off topic, I also enjoyed very much his 2008 paper on the graphical presentation of safety data.... > > Best wishes. > > > On Fri, Jul 10, 2020, at 10:02 PM, Fox, John wrote: > > Hi, > > > > We've had several solutions, and I was curious about their relative > > efficiency. Here's a test with a moderately large data vector: > > > > > library("microbenchmark") > > > set.seed(123) # for reproducibility > > > x <- sample(xc, 1e4, replace=TRUE) # "data" > > > microbenchmark(John = John <- xn[x], > > + Rich = Rich <- xn[match(x, xc)], > > + Jeff = Jeff <- { > > + n <- as.integer( sub( "[a-i]$", "", x ) ) > > + d <- match( sub( "^\\d+", "", x ), letters[1:9] ) > > + d[ is.na( d ) ] <- 0 > > + n + d / 10 > > + }, > > + David = David <- as.numeric(gsub("a", ".3", > > + gsub("b", ".5", > > + gsub("c", ".7", x)))), > > + times=1000L > > + ) > > Unit: microseconds > > expr min lq mean median uq max neval cld > > John 228.816 345.371 513.5614 503.5965 533.0635 10829.08 1000 a > > Rich 217.395 343.035 534.2074 489.0075 518.3260 15388.96 1000 a > > Jeff 10325.471 13070.737 15387.2545 15397.9790 17204.0115 153486.94 1000 b > > David 14256.673 18148.492 20185.7156 20170.3635 22067.6690 34998.95 1000 c > > > all.equal(John, Rich) > > [1] TRUE > > > all.equal(John, David) > > [1] "names for target but not for current" > > > all.equal(John, Jeff) > > [1] "names for target but not for current" "Mean relative difference: > > 0.1498243" > > > > Of course, efficiency isn't the only consideration, and aesthetically > > (and no doubt subjectively) I prefer Rich Heiberger's solution. OTOH, > > Jeff's solution is more general in that it generates the correspondence > > between letters and numbers. The argument for Jeff's solution would, > > however, be stronger if it gave the desired answer. > > > > Best, > > John > > > > > On Jul 10, 2020, at 3:28 PM, David Carlson <dcarlson at tamu.edu> wrote: > > > > > > Here is a different approach: > > > > > > xc <- c("1", "1a", "1b", "1c", "2", "2a", "2b", "2c") > > > xn <- as.numeric(gsub("a", ".3", gsub("b", ".5", gsub("c", ".7", xc)))) > > > xn > > > # [1] 1.0 1.3 1.5 1.7 2.0 2.3 2.5 2.7 > > > > > > David L Carlson > > > Professor Emeritus of Anthropology > > > Texas A&M University > > > > > > On Fri, Jul 10, 2020 at 1:10 PM Fox, John <jfox at mcmaster.ca> wrote: > > > Dear Jean-Louis, > > > > > > There must be many ways to do this. Here's one simple way (with no claim of optimality!): > > > > > > > xc <- c("1", "1a", "1b", "1c", "2", "2a", "2b", "2c") > > > > xn <- c(1, 1.3, 1.5, 1.7, 2, 2.3, 2.5, 2.7) > > > > > > > > set.seed(123) # for reproducibility > > > > x <- sample(xc, 20, replace=TRUE) # "data" > > > > > > > > names(xn) <- xc > > > > z <- xn[x] > > > > > > > > data.frame(z, x) > > > z x > > > 1 2.5 2b > > > 2 2.5 2b > > > 3 1.5 1b > > > 4 2.3 2a > > > 5 1.5 1b > > > 6 1.3 1a > > > 7 1.3 1a > > > 8 2.3 2a > > > 9 1.5 1b > > > 10 2.0 2 > > > 11 1.7 1c > > > 12 2.3 2a > > > 13 2.3 2a > > > 14 1.0 1 > > > 15 1.3 1a > > > 16 1.5 1b > > > 17 2.7 2c > > > 18 2.0 2 > > > 19 1.5 1b > > > 20 1.5 1b > > > > > > I hope this helps, > > > John > > > > > > ----------------------------- > > > John Fox, Professor Emeritus > > > McMaster University > > > Hamilton, Ontario, Canada > > > Web: http::/socserv.mcmaster.ca/jfox > > > > > > > On Jul 10, 2020, at 1:50 PM, Jean-Louis Abitbol <abitbol at sent.com> wrote: > > > > > > > > Dear All > > > > > > > > I have a character vector, representing histology stages, such as for example: > > > > xc <- c("1", "1a", "1b", "1c", "2", "2a", "2b", "2c") > > > > > > > > and this goes on to 3, 3a etc in various order for each patient. I do have of course a pre-established classification available which does change according to the histology criteria under assessment. > > > > > > > > I would want to convert xc, for plotting reasons, to a numeric vector such as > > > > > > > > xn <- c(1, 1.3, 1.5, 1.7, 2, 2.3, 2.5, 2.7) > > > > > > > > Unfortunately I have no clue on how to do that. > > > > > > > > Thanks for any help and apologies if I am missing the obvious way to do it. > > > > > > > > JL > > > > -- > > > > Verif30042020 > > > > > > > > ______________________________________________ > > > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > > > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!V7p9rtNSgBWmF3KJ3U_01fR7vP_I7y-OnWHiTFxwRZ6bVJ3-emOwkBtcU3rSW6I$ > > > > PLEASE do read the posting guide https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!V7p9rtNSgBWmF3KJ3U_01fR7vP_I7y-OnWHiTFxwRZ6bVJ3-emOwkBtcg7nzsmk$ > > > > and provide commented, minimal, self-contained, reproducible code. > > > > > > ______________________________________________ > > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!V7p9rtNSgBWmF3KJ3U_01fR7vP_I7y-OnWHiTFxwRZ6bVJ3-emOwkBtcU3rSW6I$ > > > PLEASE do read the posting guide https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!V7p9rtNSgBWmF3KJ3U_01fR7vP_I7y-OnWHiTFxwRZ6bVJ3-emOwkBtcg7nzsmk$ > > > and provide commented, minimal, self-contained, reproducible code. > > > > > > -- > Verif30042020 > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
On Sat, Jul 11, 2020 at 8:04 AM Fox, John <jfox at mcmaster.ca> wrote:> We've had several solutions, and I was curious about their relative efficiency. Here's a testAm I the only person on this mailing list who learnt to program with ASCII...? In theory, the most ***efficient*** solution, is to get the ASCII/UTF8/etc values. Then use a simple (math) formula. No matching, no searching, required ... Here's one possibility: xc <- c ("1", "1a", "1b", "1c", "2", "2a", "2b", "2c") I <- (nchar (xc) == 2) xn <- as.integer (substring (xc, 1, 1) ) xn [I] <- xn [I] + (utf8ToInt (paste (substring (xc [I], 2, 2), collapse="") ) - 96) / 4 xn Unfortunately, this makes R look bad. The corresponding C implementation is simpler and presumably the performance winner.
I'll admit that I cut my teeth on ASCII, but I worried about your reliance on that ancient typographic ordering. I wrote a little function: al2num_sub<-function(x) { xspl<-unlist(strsplit(x,"")) if(length(xspl) > 1) xspl<-paste(xspl[1],which(letters==xspl[2]),sep=".") return(xspl) } unlist(sapply(xc,al2num_sub(xc))) that does the trick with ASCII, but there was a nagging worry that it wouldn't work for any ordering apart from the Roman alphabet. Unfortunately I couldn't find any way to substitute something for "letters" that would allow me to plug in a more general solution like: alpha.set<-c("letters","greek",...) Maybe someone else can crack that one. Jim On Sun, Jul 12, 2020 at 9:07 AM Abby Spurdle <spurdle.a at gmail.com> wrote:> > On Sat, Jul 11, 2020 at 8:04 AM Fox, John <jfox at mcmaster.ca> wrote: > > We've had several solutions, and I was curious about their relative efficiency. Here's a test > > Am I the only person on this mailing list who learnt to program with ASCII...? > > In theory, the most ***efficient*** solution, is to get the > ASCII/UTF8/etc values. > Then use a simple (math) formula. > No matching, no searching, required ... > > Here's one possibility: > > xc <- c ("1", "1a", "1b", "1c", "2", "2a", "2b", "2c") > > I <- (nchar (xc) == 2) > xn <- as.integer (substring (xc, 1, 1) ) > xn [I] <- xn [I] + (utf8ToInt (paste (substring (xc [I], 2, 2), > collapse="") ) - 96) / 4 > xn > > Unfortunately, this makes R look bad. > The corresponding C implementation is simpler and presumably the > performance winner. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.