Hi, I have a data frame with one column containing string of the form "ABC...|XYZ..." where ABC etc are fields of 6 alphanumeric characters each and XYZ etc are fields of 8 alphanumeric characters each; "|" is a mandatory separator; I do not know in advance how many fields of each kind will each row contain. I need to extract these fields from the string. === How do I do that? first I need to split the string in 2 on '|' - how? then I need to split the two strings by 6/8 characters -- how? then I need to convert each 6/8 character string into an integer base 36 or 64 (depending on the field) - how? === What do I do with them once I extract them? First thing I want to do is to have a count table of them. Then I thought of adding an extra column for each field value and putting 0/1 there, e.g., frame 1,AB 2,BCD will turn into 1,1,1,0,0 2,0,1,1,1 however this would work only if the number of different field values is manageable. What do people do? Can I have a columns of "sets" in data frame? Does R support the "set" data type? Thanks! PS. thanks to Sarah Goslee who answered my previous question in so much detail! -- Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 http://camera.org http://openvotingconsortium.org http://iris.org.il http://mideasttruth.com http://memri.org http://honestreporting.com Don't take life too seriously, you'll never get out of it alive!
Reproducible example, please. This doesn't make a whole lot of sense otherwise. On Fri, Jan 20, 2012 at 1:52 PM, Sam Steingold <sds at gnu.org> wrote:> Hi, > I have a data frame with one column containing string of the form "ABC...|XYZ..." > where ABC etc are fields of 6 alphanumeric characters each > and XYZ etc are fields of 8 alphanumeric characters each; > "|" is a mandatory separator; > I do not know in advance how many fields of each kind will each row contain. > I need to extract these fields from the string.This is already a data frame, so you don't need to import it into R, just process it?> === How do I do that? > > first I need to split the string in 2 on '|' - how?strsplit()> then I need to split the two strings by 6/8 characters -- how?substring() perhaps> then I need to convert each 6/8 character string into an integer base 36 > or 64 (depending on the field) - how?base 36? Really? How are you representing that? Somehow I think you mean something other than what you said. Either way, please clarify.> === What do I do with them once I extract them?I don't know. Save them as a list, most likely.> First thing I want to do is to have a count table of them. > Then I thought of adding an extra column for each field value and > putting 0/1 there, e.g., frame > 1,AB > 2,BCDI thought we had integers at this point?> will turn into > 1,1,1,0,0 > 2,0,1,1,1 > however this would work only if the number of different field values is > manageable.But we have no idea, because you haven't told us.> What do people do? > Can I have a columns of "sets" in data frame? > Does R support the "set" data type?factor() seems to be what you're looking for.> PS. thanks to Sarah Goslee who answered my previous question in so much detail!You're welcome, but you'd be even more welcome if you'd listened to the parts of my reply about reproducible examples, clear problem statements, and reading the posting guide. Sarah -- Sarah Goslee http://www.functionaldiversity.org
Sam: On Fri, Jan 20, 2012 at 10:52 AM, Sam Steingold <sds at gnu.org> wrote:> Hi, > I have a data frame with one column containing string of the form "ABC...|XYZ..." > where ABC etc are fields of 6 alphanumeric characters each > and XYZ etc are fields of 8 alphanumeric characters each; > "|" is a mandatory separator; > I do not know in advance how many fields of each kind will each row contain. > I need to extract these fields from the string. > > === How do I do that? > > first I need to split the string in 2 on '|' - how??strsplit strsplit(thecolumn, "|",fixed=TRUE)> then I need to split the two strings by 6/8 characters -- how?This makes no sense to me. strsplit takes care of this.> then I need to convert each 6/8 character string into an integer base 36 > or 64 (depending on the field) - how?No clue. Depends on the encoding AFAICS. -- Bert> > === What do I do with them once I extract them? > > First thing I want to do is to have a count table of them. > Then I thought of adding an extra column for each field value and > putting 0/1 there, e.g., frame > 1,AB > 2,BCD > will turn into > 1,1,1,0,0 > 2,0,1,1,1 > however this would work only if the number of different field values is > manageable. > What do people do? > Can I have a columns of "sets" in data frame? > Does R support the "set" data type? > > Thanks! > > PS. thanks to Sarah Goslee who answered my previous question in so much detail! > -- > Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000 > http://camera.org http://openvotingconsortium.org http://iris.org.il > http://mideasttruth.com http://memri.org http://honestreporting.com > Don't take life too seriously, you'll never get out of it alive! > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
On Fri, Jan 20, 2012 at 14:05, Sarah Goslee <sarah.goslee at gmail.com> wrote:>> then I need to convert each 6/8 character string into an integer base 36 >> or 64 (depending on the field) - how? > > base 36?10 decimal digits + 26 english characters = 36. ThusThisLongWordWithLettersAndDigitsFrom0to9isAnIntegerBase36 (case insensitive). So, how do I convert the above long word to a bignum? actually, my numbers will fit into int64, no bignum support is necessary. thanks. -- Sam Steingold <http://sds.podval.org>
On Fri, Jan 20, 2012 at 03:14:21PM -0500, Sam Steingold wrote:> On Fri, Jan 20, 2012 at 14:05, Sarah Goslee <sarah.goslee at gmail.com> wrote: > >> then I need to convert each 6/8 character string into an integer base 36 > >> or 64 (depending on the field) - how? > > > > base 36? > > 10 decimal digits + 26 english characters = 36. > ThusThisLongWordWithLettersAndDigitsFrom0to9isAnIntegerBase36 > (case insensitive). > So, how do I convert the above long word to a bignum?Hi. Try the following. x <- tolower("ThusThisLongWordWithLettersAndDigitsFrom0to9isAnIntegerBase36") x <- strsplit(x, "")[[1]] digits <- 0:35 names(digits) <- c(0:9, letters) y <- digits[x] # solution using gmp package library(gmp) b <- as.bigz(36) sum(y * b^(length(y):1 - 1)) [1] "70455190722800243410669999246294410591724807773749367607882253153084991978813070206061584038994 # solution using Rmpfr package library(Rmpfr) b <- mpfr(36, precBits=500) sum(y * b^(length(y):1 - 1)) [1] 70455190722800243410669999246294410591724807773749367607882253153084991978813070206061584038994>actually, my numbers will fit into int64, no bignum support is necessary.The default R numeric data type is double precision, which represents integers up to 53 bits, so the largest exactly representable integer is 2^53. The integer type is 32 bits. Hope this helps. Petr Savicky.