Dear all, I have a data.frame with a column like the x shown below myDF<-data.frame(cbind(x=c("[[1, 0, 0], [0, 1]]", "[[1, 1, 0], [0, 1]]","[[1, 0, 0], [1, 1]]", "[[0, 0, 1], [0, 1]]")))> myDFx 1 [[1, 0, 0], [0, 1]] 2 [[1, 1, 0], [0, 1]] 3 [[1, 0, 0], [1, 1]] 4 [[0, 0, 1], [0, 1]] As you can see my x column is composed of some strings between [[]], and using colon to separate some "fields". I need to identify the numbers of groups inside the main [ ] and call each group with different sequential string. On the example above I would like to have: A B 1 [1, 0, 0] [0, 1] 2 [1, 1, 0] [0, 1] 3 [1, 0, 0] [1, 1] 4 [0, 0, 1] [0, 1] Although here I have only two groups, my real dataset will have much more (~30). After identify the groups I would like to idenfity the subgroups: A1 A2 A3 B1 B2 1 1 0 0 0 1 2 1 1 0 0 1 3 1 0 0 1 1 4 0 0 1 0 1 Any hint are welcome. milton ribeiro [[alternative HTML version deleted]]
On Thu, Feb 18, 2010 at 8:29 AM, milton ruser <milton.ruser at gmail.com> wrote:> Dear all, > > I have a data.frame with a column like the x shown below > myDF<-data.frame(cbind(x=c("[[1, 0, 0], [0, 1]]", > ? "[[1, 1, 0], [0, 1]]","[[1, 0, 0], [1, 1]]", > ? "[[0, 0, 1], [0, 1]]"))) >> myDF > ? ? ? ? ? ? ? ? ? ?x > 1 [[1, 0, 0], [0, 1]] > 2 [[1, 1, 0], [0, 1]] > 3 [[1, 0, 0], [1, 1]] > 4 [[0, 0, 1], [0, 1]] > > As you can see my x column is composed of some > strings between [[]], and using colon to separate > some "fields". > > I need to identify the numbers of > groups inside the main [ ] and call each > group with different sequential string. > On the example above I would like to have: > > ?A ? ? ? ? B > 1 [1, 0, 0] [0, 1] > 2 [1, 1, 0] [0, 1] > 3 [1, 0, 0] [1, 1] > 4 [0, 0, 1] [0, 1] > Although here I have only two groups, my > real dataset will have much more (~30). > After identify the groups I would like > to idenfity the subgroups: > ?A1 A2 A3 ?B1 B2 > 1 1 ?0 ?0 ? 0 ?1 > 2 1 ?1 ?0 ? 0 ?1 > 3 1 ?0 ?0 ? 1 ?1 > 4 0 ?0 ?1 ? 0 ?1 > > Any hint are welcome. >This looks like the same syntax as JSON, so you might be able to use the fromJSON function from the rjson package:> x="[[1, 0, 0], [0, 1]]" > library(rjson) > fromJSON(x)[[1]] [1] 1 0 0 [[2]] [1] 0 1> unlist(fromJSON(x))[1] 1 0 0 0 1 - so just apply that over your first dataframe and collect it all up in a new dataframe. The plyr package may help. All your data frame columns have to have the same name, so you only need to parse the first one to work out your naming system. In this case you can get it from the length of the list and its elements:> l = fromJSON(x) > unlist(lapply(l,length))[1] 3 2 so you want A1 to A3 and B1 to B2. Not sure what you want when you get to the 27th group.... You can generate this with a bit of rep and paste functionality. Bit early in the day to get my head round that at the moment. But rjson will parse and split up your grouped numbers anyway. Probably other solutions using split and sub and gsub. Barry -- blog: http://geospaced.blogspot.com/ web: http://www.maths.lancs.ac.uk/~rowlings web: http://www.rowlingson.com/ twitter: http://twitter.com/geospacedman pics: http://www.flickr.com/photos/spacedman
On Thu, Feb 18, 2010 at 8:29 AM, milton ruser <milton.ruser at gmail.com> wrote:> Dear all, > > I have a data.frame with a column like the x shown below > myDF<-data.frame(cbind(x=c("[[1, 0, 0], [0, 1]]", > ? "[[1, 1, 0], [0, 1]]","[[1, 0, 0], [1, 1]]", > ? "[[0, 0, 1], [0, 1]]"))) >> myDF> After identify the groups I would like > to idenfity the subgroups: > ?A1 A2 A3 ?B1 B2 > 1 1 ?0 ?0 ? 0 ?1 > 2 1 ?1 ?0 ? 0 ?1 > 3 1 ?0 ?0 ? 1 ?1 > 4 0 ?0 ?1 ? 0 ?1Maybe it's not too early in the morning. Given your myDF above: # how is the first one structured?> lets = unlist(lapply(fromJSON(as.character(myDF[1,])),length))# 3 then 2:> lets[1] 3 2 # make the letters (fails for >26 groups)> rep(LETTERS[1:length(lets)],lets)[1] "A" "A" "A" "B" "B" # handy sequence function makes the numbers:> sequence(lets)[1] 1 2 3 1 2 # splat them together:> paste(rep(LETTERS[1:length(lets)],lets),sequence(lets),sep="")[1] "A1" "A2" "A3" "B1" "B2" then you can just make this the column names of your new dataframe. I think the morning coffee has got through the blood-brain barrier now. Barry
Here is a solution using strapply in the gsubfn package. First we define a proto object p containing a single method, i.e. function, called fun. fun will take one [...] construct and split it into the numeric vector v using strsplit and will also assign it names. strapply has a built in variable, count, that is maintained automatically in the proto object that will be used for determining which letter to use. Using strapply apply fun in p to each substring matching this regexp "\\[([01, ]*)\\]". This regexpr matches [ followed by a string of characters made up of 0, 1, comma and space, followed by ] and applies p$fun to each such occurrence. (Modify the regexp appropriately if the true problem has different characteristics.) Finally, simplify = rbind will cause the resulting vectors to be rbind'ed together. (If the different rows of myDF do not have the same structure then omit the simplify = rbind argument of strapply to get out a list.) p <- proto(fun = function(this, x) { v <- as.numeric(strsplit(x, ",")[[1]]) names(v) <- paste(LETTERS[count], seq_along(v), sep = "") v }) strapply(as.character(myDF[[1]]), "\\[([01, ]*)\\]", p, simplify = rbind) Here is what the output looks like:> strapply(as.character(myDF[[1]]), "\\[([01, ]*)\\]", p, simplify = rbind)A1 A2 A3 B1 B2 [1,] 1 0 0 0 1 [2,] 1 1 0 0 1 [3,] 1 0 0 1 1 [4,] 0 0 1 0 1 See http://gsubfn.googlecode.com and the gsubfn vignette for more info. On Thu, Feb 18, 2010 at 3:29 AM, milton ruser <milton.ruser at gmail.com> wrote:> Dear all, > > I have a data.frame with a column like the x shown below > myDF<-data.frame(cbind(x=c("[[1, 0, 0], [0, 1]]", > ? "[[1, 1, 0], [0, 1]]","[[1, 0, 0], [1, 1]]", > ? "[[0, 0, 1], [0, 1]]"))) >> myDF > ? ? ? ? ? ? ? ? ? ?x > 1 [[1, 0, 0], [0, 1]] > 2 [[1, 1, 0], [0, 1]] > 3 [[1, 0, 0], [1, 1]] > 4 [[0, 0, 1], [0, 1]] > > As you can see my x column is composed of some > strings between [[]], and using colon to separate > some "fields". > > I need to identify the numbers of > groups inside the main [ ] and call each > group with different sequential string. > On the example above I would like to have: > > ?A ? ? ? ? B > 1 [1, 0, 0] [0, 1] > 2 [1, 1, 0] [0, 1] > 3 [1, 0, 0] [1, 1] > 4 [0, 0, 1] [0, 1] > Although here I have only two groups, my > real dataset will have much more (~30). > After identify the groups I would like > to idenfity the subgroups: > ?A1 A2 A3 ?B1 B2 > 1 1 ?0 ?0 ? 0 ?1 > 2 1 ?1 ?0 ? 0 ?1 > 3 1 ?0 ?0 ? 1 ?1 > 4 0 ?0 ?1 ? 0 ?1 > > Any hint are welcome. > > milton ribeiro > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >