Suppose I have a vector of strings: c("A1B2","A3C4","B5","C6A7B8") [1] "A1B2" "A3C4" "B5" "C6A7B8" where each string is a sequence of <column><value> pairs (fixed width, in this example both value and name are 1 character, in reality the column name is 6 chars and value is 2 digits). I need to convert it to a data frame: data.frame(A=c(1,3,0,7),B=c(2,0,5,8),C=c(0,4,0,6)) A B C 1 1 2 0 2 3 0 4 3 0 5 0 4 7 8 6 how do I do that? thanks. -- Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 http://mideasttruth.com http://jihadwatch.org http://pmw.org.il http://openvotingconsortium.org http://iris.org.il http://memri.org What's the difference between Apathy & Ignorance? -I don't know and don't care!
To be clear, I can do that with nested for loops: v <- c("A1B2","A3C4","B5","C6A7B8") l <- strsplit(gsub("(.{2})","\\1,",v),",") d <- data.frame(A=vector(length=4,mode="integer"), B=vector(length=4,mode="integer"), C=vector(length=4,mode="integer")) for (i in 1:length(l)) { l1 <- l[[i]] for (j in 1:length(l1)) { d[[substring(l1[j],1,1)]][i] <- as.numeric(substring(l1[j],2,2)) } } but I am afraid that handling 1,000,000 (=length(unlist(l))) strings in a loop will kill me.> * Sam Steingold <fqf at tah.bet> [2012-02-08 15:34:38 -0500]: > > Suppose I have a vector of strings: > c("A1B2","A3C4","B5","C6A7B8") > [1] "A1B2" "A3C4" "B5" "C6A7B8" > where each string is a sequence of <column><value> pairs > (fixed width, in this example both value and name are 1 character, in > reality the column name is 6 chars and value is 2 digits). > I need to convert it to a data frame: > data.frame(A=c(1,3,0,7),B=c(2,0,5,8),C=c(0,4,0,6)) > A B C > 1 1 2 0 > 2 3 0 4 > 3 0 5 0 > 4 7 8 6 > > how do I do that? > thanks.-- Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 http://palestinefacts.org http://iris.org.il http://camera.org http://ffii.org http://www.PetitionOnline.com/tap12009/ An elephant is a mouse with an operating system.
I suspect there are cleverer ways to do it, especially using packages like stringr and gsubfn, but using base tools, you can hack it without too much effort: ?gregexpr is the key. To get started (x is your example vector of character strings):> gregexpr("[[:alpha:]]+[[:digit:]]+",x)[[1]] [1] 1 3 attr(,"match.length") [1] 2 2 attr(,"useBytes") [1] TRUE [[2]] [1] 1 3 attr(,"match.length") [1] 2 2 attr(,"useBytes") [1] TRUE [[3]] [1] 1 attr(,"match.length") [1] 2 attr(,"useBytes") [1] TRUE [[4]] [1] 1 3 5 attr(,"match.length") [1] 2 2 2 attr(,"useBytes") [1] TRUE The components of the result give you indices of the start and stop values for each "entry" in your final matrix/data frame. You can thus lapply() on this list to get the column name-value pairs substrings and decode them. Alternatively, if all your names are really 6 characters and all your values are really two digits, ?nchar and ?substring will get you the name-value substrings directly. I leave the niggling details to you (or to other helpeRs -- especially those who can suggest a more elegant approach). -- Bert On Wed, Feb 8, 2012 at 12:34 PM, Sam Steingold <sds at gnu.org> wrote:> Suppose I have a vector of strings: > c("A1B2","A3C4","B5","C6A7B8") > [1] "A1B2" ? "A3C4" ? "B5" ? ? "C6A7B8" > where each string is a sequence of <column><value> pairs > (fixed width, in this example both value and name are 1 character, in > reality the column name is 6 chars and value is 2 digits). > I need to convert it to a data frame: > data.frame(A=c(1,3,0,7),B=c(2,0,5,8),C=c(0,4,0,6)) > ?A B C > 1 1 2 0 > 2 3 0 4 > 3 0 5 0 > 4 7 8 6 > > how do I do that? > thanks. > > -- > Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 > http://mideasttruth.com http://jihadwatch.org http://pmw.org.il > http://openvotingconsortium.org http://iris.org.il http://memri.org > What's the difference between Apathy & Ignorance? -I don't know and don't care! > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
On Wed, Feb 08, 2012 at 03:56:12PM -0500, Sam Steingold wrote:> To be clear, I can do that with nested for loops: > > v <- c("A1B2","A3C4","B5","C6A7B8") > l <- strsplit(gsub("(.{2})","\\1,",v),",") > d <- data.frame(A=vector(length=4,mode="integer"), > B=vector(length=4,mode="integer"), > C=vector(length=4,mode="integer")) > > for (i in 1:length(l)) { > l1 <- l[[i]] > for (j in 1:length(l1)) { > d[[substring(l1[j],1,1)]][i] <- as.numeric(substring(l1[j],2,2)) > } > }Hi. The inner loop may be vectorized. d <- as.matrix(d) for (i in 1:length(l)) { l1 <- l[[i]] d[i, substring(l1,1,1)] <- as.numeric(substring(l1,2,2)) } A B C 1 1 2 0 2 3 0 4 3 0 5 0 4 7 8 6 If the number of rows of d is large, a matrix is probably more efficient. Hope this helps. Petr Savicky.
When compute time is important it often helps to loop over columns instead of over rows (assuming there are fewer columns than rows, the usual case). E.g., putting your code into a function f0 and the column-looping version into f1: f0 <- function(v) { n <- length(v) d <- data.frame(A=vector(length=n,mode="integer"), B=vector(length=n,mode="integer"), C=vector(length=n,mode="integer")) l <- strsplit(gsub("(.{2})","\\1,",v),",") for (i in seq_along(l)) { l1 <- l[[i]] for (j in seq_along(l1)) { d[[substring(l1[j],1,1)]][i] <- as.integer(substring(l1[j],2,2)) } } d } f1 <- function(v) { n <- length(v) letters <- c("A", "B", "C") names(letters) <- letters data.frame(lapply(letters, function(letter) { retval <- integer(n) hasLetter <- grepl(letter, v) retval[hasLetter] <- as.integer( gsub(sprintf("^.*%s([[:digit:]]+).*$", letter), "\\1", v[hasLetter])) retval })) } I get the following times for a 10,000 long v like yours (and the results are the same):> vv <- rep(v, len=10000) > system.time(r1 <- f1(vv))user system elapsed 0.13 0.00 0.14> system.time(r0 <- f0(vv))user system elapsed 10.75 0.19 10.99> all.equal(r0, r1)[1] TRUE If I double the length of v, your code takes 53 seconds (5x slower, quadratic behavior?) while mine takes 0.17 (less than linear, suggesting that its time is still dominated by the function call overhead for such small input vectors). Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Sam Steingold > Sent: Wednesday, February 08, 2012 12:56 PM > To: r-help at r-project.org > Subject: Re: [R] "unsparse" a vector > > To be clear, I can do that with nested for loops: > > v <- c("A1B2","A3C4","B5","C6A7B8") > l <- strsplit(gsub("(.{2})","\\1,",v),",") > d <- data.frame(A=vector(length=4,mode="integer"), > B=vector(length=4,mode="integer"), > C=vector(length=4,mode="integer")) > > for (i in 1:length(l)) { > l1 <- l[[i]] > for (j in 1:length(l1)) { > d[[substring(l1[j],1,1)]][i] <- as.numeric(substring(l1[j],2,2)) > } > } > > > but I am afraid that handling 1,000,000 (=length(unlist(l))) strings in > a loop will kill me. > > > > * Sam Steingold <fqf at tah.bet> [2012-02-08 15:34:38 -0500]: > > > > Suppose I have a vector of strings: > > c("A1B2","A3C4","B5","C6A7B8") > > [1] "A1B2" "A3C4" "B5" "C6A7B8" > > where each string is a sequence of <column><value> pairs > > (fixed width, in this example both value and name are 1 character, in > > reality the column name is 6 chars and value is 2 digits). > > I need to convert it to a data frame: > > data.frame(A=c(1,3,0,7),B=c(2,0,5,8),C=c(0,4,0,6)) > > A B C > > 1 1 2 0 > > 2 3 0 4 > > 3 0 5 0 > > 4 7 8 6 > > > > how do I do that? > > thanks. > > -- > Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 > http://palestinefacts.org http://iris.org.il http://camera.org > http://ffii.org http://www.PetitionOnline.com/tap12009/ > An elephant is a mouse with an operating system. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.