Patrick Connolly
2002-Oct-28  22:32 UTC
[R] subsetting character vector into groups of numerics
I'm sure there's a simple way to do this, but I can only think of complicated ones. I have a number of character vectors that look something like this: "12 78 23 9 76 43 2 15 41 81 92 5(92 12) (81 78 5 76 9 41) (23 2 15 43)" I wish to get it into a list of numerical vectors like this: $Group [1] 12 78 23 9 76 43 2 15 41 81 92 5 $Subgroup1 [1] 92 12 $Subgroup2 [1] 81 78 5 76 9 41 $Subgroup3 [1] 23 2 15 43 I can't rely on the closing parenthesis as the last character in the vector, though the subgroup could be clearly defined without it. Numbers are obvious to the eye, but are not always separated from one another consistently. Part of the reason for this exercise is to check that the Group is made up of the Subgroups with no elements missing, so getting Group is not simply a matter of concatenating the subgroups. Ideas appreciated. -- Patrick Connolly HortResearch Mt Albert Auckland New Zealand Ph: +64-9 815 4200 x 7188 ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~ I have the world`s largest collection of seashells. I keep it on all the beaches of the world ... Perhaps you`ve seen it. ---Steven Wright ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~ ______________________________________________________ The contents of this e-mail are privileged and/or confidential to the named recipient and are not to be used by any other person and/or organisation. If you have received this e-mail in error, please notify the sender and delete all material pertaining to this e-mail. ______________________________________________________ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Peter Dalgaard BSA
2002-Oct-28  23:56 UTC
[R] subsetting character vector into groups of numerics
Patrick Connolly <p.connolly at hortresearch.co.nz> writes:> I'm sure there's a simple way to do this, but I can only think of > complicated ones. > > > I have a number of character vectors that look something like this: > > "12 78 23 9 76 43 2 15 41 81 92 5(92 12) (81 78 5 76 9 41) (23 2 15 43)" > > I wish to get it into a list of numerical vectors like this: > > $Group > [1] 12 78 23 9 76 43 2 15 41 81 92 5 > > $Subgroup1 > [1] 92 12 > > $Subgroup2 > [1] 81 78 5 76 9 41 > > $Subgroup3 > [1] 23 2 15 43 > > I can't rely on the closing parenthesis as the last character in the > vector, though the subgroup could be clearly defined without it. > Numbers are obvious to the eye, but are not always separated from one > another consistently. Part of the reason for this exercise is to > check that the Group is made up of the Subgroups with no elements > missing, so getting Group is not simply a matter of concatenating the > subgroups. > > > Ideas appreciated.Hmm... You seem to be telling us what the format is not. If you want us to come up with something for the machine to do, it's not too useful that things are "obvious to the eye"! If the format is consistently like the above with subgroups in (), then you could start with using some of the deeper magic of gsub() to turn the format into something which would be easier to split into individual vectors, e.g.> gsub("\\(([^)]*)\\)", "/\\1", x)[1] "12 78 23 9 76 43 2 15 41 81 92 5/92 12 /81 78 5 76 9 41 /23 2 15 43" [What was that? Well, "(" is a special grouping operator in regular expressions; it isn't part of the RE as such, but things inside (..) can be referred to with backreferences like \1, which of course needs to be entered as "\\1". \( is an actual left parenthesis, again written with the doubled backslash. [^)]* is a sequence consisting of any character except left parentheses (which is not a grouping operator when it sits within square brackets). So we're finding the bits of text delimited by ( and ) and replacing them with a / and the content of the parentheses. Got it? Don't worry if you don't, I didn't get it right till the 12th try either! The important thing is knowing that this kind of stuff is possible if you stare at it long enough.] Now that it is in an easier format we can use strsplit to get individual parts:> s <- strsplit(gsub("\\(([^)]*)\\)", "/\\1", x),"/") > s[[1]] [1] "12 78 23 9 76 43 2 15 41 81 92 5" "92 12 " [3] "81 78 5 76 9 41 " "23 2 15 43" and once we have those we might use scan() on each string to get the numbers. This requires the use of a text connection, like this> lapply(s[[1]], function(x)scan(textConnection(x)))Read 12 items Read 2 items Read 6 items Read 4 items [[1]] [1] 12 78 23 9 76 43 2 15 41 81 92 5 [[2]] [1] 92 12 [[3]] [1] 81 78 5 76 9 41 [[4]] [1] 23 2 15 43 ... Your turn! -- O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Patrick Connolly
2002-Oct-29  02:29 UTC
[R] subsetting character vector into groups of numerics
>From p.connolly at hortresearch.co.nz Tue Oct 29 15:27:34 2002Date: Tue, 29 Oct 2002 15:27:34 +1300 From: Patrick Connolly <p.connolly at hortresearch.co.nz> To: Peter Dalgaard BSA <p.dalgaard at biostat.ku.dk> Subject: Re: [R] subsetting character vector into groups of numerics Message-ID: <20021029022734.GD27769 at hortresearch.co.nz> References: <20021028223228.GC27769 at hortresearch.co.nz> <x2elaajexj.fsf at biostat.ku.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <x2elaajexj.fsf at biostat.ku.dk> User-Agent: Mutt/1.4i Status: RO Content-Length: 4611 Lines: 133 On Tue, 29-Oct-2002 at 12:56AM +0100, Peter Dalgaard BSA wrote: |> Patrick Connolly <p.connolly at hortresearch.co.nz> writes: |> [...] |> > I can't rely on the closing parenthesis as the last character in the |> > vector, though the subgroup could be clearly defined without it. |> > Numbers are obvious to the eye, but are not always separated from one |> > another consistently. Part of the reason for this exercise is to |> > check that the Group is made up of the Subgroups with no elements |> > missing, so getting Group is not simply a matter of concatenating the |> > subgroups. |> > |> > |> > Ideas appreciated. |> |> Hmm... You seem to be telling us what the format is not. If you want |> us to come up with something for the machine to do, it's not too |> useful that things are "obvious to the eye"! Sorry. Trying to keep down the verbosity, I made it too brief. My main point was that the number of spaces was not always consistent so the method couldn't rely on, say beginning with a '(' character, and the subgroups separated by ') (' with the end defined by a ')'. |> |> If the format is consistently like the above with subgroups in (), |> then you could start with using some of the deeper magic of gsub() to |> turn the format into something which would be easier to split into |> individual vectors, e.g. |> |> > gsub("\\(([^)]*)\\)", "/\\1", x) |> [1] "12 78 23 9 76 43 2 15 41 81 92 5/92 12 /81 78 5 76 9 41 /23 2 15 43" In any case, that method will work for .... 92 5(92 12) (.... and .... 92 5 (92 12) (.... so the space before the "(" character is not critical. I was concerned it would throw a spanner in the works. When I do a check to see that the Group is made up of all the Subgroups, I'll be able to detect if there are any cases of a ')' without a succeeding ')'. It's so hard to get good data-entry help these days. :-) |> |> [What was that? Well, "(" is a special grouping operator in regular |> expressions; it isn't part of the RE as such, but things inside (..) |> can be referred to with backreferences like \1, which of course needs |> to be entered as "\\1". \( is an actual left parenthesis, again |> written with the doubled backslash. [^)]* is a sequence consisting of |> any character except left parentheses (which is not a grouping |> operator when it sits within square brackets). So we're finding the |> bits of text delimited by ( and ) and replacing them with a / and the |> content of the parentheses. Got it? Don't worry if you don't, I didn't |> get it right till the 12th try either! The important thing is knowing |> that this kind of stuff is possible if you stare at it long enough.] In my case, I needed a bit more help. That solution is brilliant. Thanks for the explanation of it too. It covers everything I can think of except the occasion where a '(' or ')' is missing. I know the final ")" is absent in a few places. It's probably easiest for me to do a test and add that character if required before using gsub, then check if the Groups tally with the subgroups to determine if there is anything missing. Those should be rare enough to fix in the data file instead of trying to come up with a generic method of detecting them and making the requisite modifications. |> |> Now that it is in an easier format we can use strsplit to get |> individual parts: |> |> > s <- strsplit(gsub("\\(([^)]*)\\)", "/\\1", x),"/") I probably would have got that if I'd got that far. [...] |> and once we have those we might use scan() on each string to get the |> numbers. This requires the use of a text connection, like this |> |> > lapply(s[[1]], function(x)scan(textConnection(x))) I'd never had occasion to use textConnection before and was completely ignorant of its existence. Certainly simpler than my idea of exporting text files and then using a Perl script and then importing back in. |> Read 12 items |> Read 2 items |> Read 6 items |> Read 4 items |> [[1]] |> [1] 12 78 23 9 76 43 2 15 41 81 92 5 |> |> [[2]] |> [1] 92 12 |> |> [[3]] |> [1] 81 78 5 76 9 41 |> |> [[4]] |> [1] 23 2 15 43 |> |> ... |> |> Your turn! Can't improve on that! It's so close to what I require we could call it a day. Thanks again. best -- Patrick Connolly HortResearch Mt Albert Auckland New Zealand Ph: +64-9 815 4200 x 7188 ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~ I have the world`s largest collection of seashells. I keep it on all the beaches of the world ... Perhaps you`ve seen it. ---Steven Wright ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~ ______________________________________________________ The contents of this e-mail are privileged and/or confidential to the named recipient and are not to be used by any other person and/or organisation. If you have received this e-mail in error, please notify the sender and delete all material pertaining to this e-mail. ______________________________________________________ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._