thr3ads.net - R help - [R] subsetting character vector into groups of numerics [Oct 2002]

If this information is useful, please help other people find it:
Share via:

Patrick Connolly

2002-Oct-28 22:32 UTC

[R] subsetting character vector into groups of numerics

I'm sure there's a simple way to do this, but I can only think of
complicated ones.


I have a number of character vectors that look something like this:

"12 78 23 9 76 43 2 15 41 81 92 5(92 12) (81 78 5 76 9 41) (23 2 15
43)"

I wish to get it into a list of numerical vectors like this:

$Group
[1] 12 78 23 9 76 43 2 15 41 81 92 5

$Subgroup1
[1] 92 12

$Subgroup2
[1] 81 78 5 76 9 41

$Subgroup3
[1] 23 2 15 43

I can't rely on the closing parenthesis as the last character in the
vector, though the subgroup could be clearly defined without it.
Numbers are obvious to the eye, but are not always separated from one
another consistently.  Part of the reason for this exercise is to
check that the Group is made up of the Subgroups with no elements
missing, so getting Group is not simply a matter of concatenating the
subgroups.


Ideas appreciated.



-- 
Patrick Connolly
HortResearch
Mt Albert
Auckland
New Zealand 
Ph: +64-9 815 4200 x 7188
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~
I have the world`s largest collection of seashells. I keep it on all
the beaches of the world ... Perhaps you`ve seen it.  ---Steven Wright 
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~


______________________________________________________
The contents of this e-mail are privileged and/or confidential to the
named recipient and are not to be used by any other person and/or
organisation. If you have received this e-mail in error, please notify 
the sender and delete all material pertaining to this e-mail.
______________________________________________________
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Peter Dalgaard BSA

2002-Oct-28 23:56 UTC

head link

[R] subsetting character vector into groups of numerics

Patrick Connolly <p.connolly at hortresearch.co.nz> writes:
> I'm sure there's a simple way to do this, but I can only think of
> complicated ones.
> 
> 
> I have a number of character vectors that look something like this:
> 
> "12 78 23 9 76 43 2 15 41 81 92 5(92 12) (81 78 5 76 9 41) (23 2 15
43)"
> 
> I wish to get it into a list of numerical vectors like this:
> 
> $Group
> [1] 12 78 23 9 76 43 2 15 41 81 92 5
> 
> $Subgroup1
> [1] 92 12
> 
> $Subgroup2
> [1] 81 78 5 76 9 41
> 
> $Subgroup3
> [1] 23 2 15 43
> 
> I can't rely on the closing parenthesis as the last character in the
> vector, though the subgroup could be clearly defined without it.
> Numbers are obvious to the eye, but are not always separated from one
> another consistently.  Part of the reason for this exercise is to
> check that the Group is made up of the Subgroups with no elements
> missing, so getting Group is not simply a matter of concatenating the
> subgroups.
> 
> 
> Ideas appreciated.
Hmm... You seem to be telling us what the format is not. If you want
us to come up with something for the machine to do, it's not too
useful that things are "obvious to the eye"! 

If the format is consistently like the above with subgroups in (),
then you could start with using some of the deeper magic of gsub() to
turn the format into something which would be easier to split into
individual vectors, e.g.
> gsub("\\(([^)]*)\\)", "/\\1", x)[1] "12 78 23 9 76 43 2 15 41 81 92 5/92 12 /81 78 5 76 9 41 /23 2 15
43"

[What was that? Well, "(" is a special grouping operator in regular
expressions; it isn't part of the RE as such, but things inside (..)
can be referred to with backreferences like \1, which of course needs
to be entered as "\\1". \( is an actual left parenthesis, again
written with the doubled backslash. [^)]* is a sequence consisting of
any character except left parentheses (which is not a grouping
operator when it sits within square brackets). So we're finding the
bits of text delimited by ( and ) and replacing them with a / and the
content of the parentheses. Got it? Don't worry if you don't, I
didn't
get it right till the 12th try either! The important thing is knowing
that this kind of stuff is possible if you stare at it long enough.]

Now that it is in an easier format we can use strsplit to get
individual parts:
> s <- strsplit(gsub("\\(([^)]*)\\)", "/\\1",
x),"/")
> s[[1]]
[1] "12 78 23 9 76 43 2 15 41 81 92 5" "92 12 "
[3] "81 78 5 76 9 41 "                 "23 2 15 43"

and once we have those we might use scan() on each string to get the
numbers. This requires the use of a text connection, like this
> lapply(s[[1]], function(x)scan(textConnection(x)))Read 12 items
Read 2 items
Read 6 items
Read 4 items
[[1]]
 [1] 12 78 23  9 76 43  2 15 41 81 92  5

[[2]]
[1] 92 12

[[3]]
[1] 81 78  5 76  9 41

[[4]]
[1] 23  2 15 43

...

Your turn!

-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Patrick Connolly

2002-Oct-29 02:29 UTC

head link

[R] subsetting character vector into groups of numerics

>From p.connolly at hortresearch.co.nz Tue Oct 29 15:27:34 2002Date: Tue, 29 Oct 2002 15:27:34 +1300
From: Patrick Connolly <p.connolly at hortresearch.co.nz>
To: Peter Dalgaard BSA <p.dalgaard at biostat.ku.dk>
Subject: Re: [R] subsetting character vector into groups of numerics
Message-ID: <20021029022734.GD27769 at hortresearch.co.nz>
References: <20021028223228.GC27769 at hortresearch.co.nz>
<x2elaajexj.fsf at biostat.ku.dk>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <x2elaajexj.fsf at biostat.ku.dk>
User-Agent: Mutt/1.4i
Status: RO
Content-Length: 4611
Lines: 133

On Tue, 29-Oct-2002 at 12:56AM +0100, Peter Dalgaard BSA wrote:

|> Patrick Connolly <p.connolly at hortresearch.co.nz> writes:
|> 

[...]

|> > I can't rely on the closing parenthesis as the last character in
the
|> > vector, though the subgroup could be clearly defined without it.
|> > Numbers are obvious to the eye, but are not always separated from one
|> > another consistently.  Part of the reason for this exercise is to
|> > check that the Group is made up of the Subgroups with no elements
|> > missing, so getting Group is not simply a matter of concatenating the
|> > subgroups.
|> > 
|> > 
|> > Ideas appreciated.
|> 
|> Hmm... You seem to be telling us what the format is not. If you want
|> us to come up with something for the machine to do, it's not too
|> useful that things are "obvious to the eye"! 

Sorry.  Trying to keep down the verbosity, I made it too brief.  My
main point was that the number of spaces was not always consistent so
the method couldn't rely on, say beginning with a '(' character, and
the subgroups separated by ') (' with the end defined by a ')'.

|> 
|> If the format is consistently like the above with subgroups in (),
|> then you could start with using some of the deeper magic of gsub() to
|> turn the format into something which would be easier to split into
|> individual vectors, e.g.
|> 
|> > gsub("\\(([^)]*)\\)", "/\\1", x)
|> [1] "12 78 23 9 76 43 2 15 41 81 92 5/92 12 /81 78 5 76 9 41 /23 2 15
43"

In any case, that method will work for 

.... 92 5(92 12) (....
and
.... 92 5 (92 12) (....

so the space before the "(" character is not critical.  I was
concerned it would throw a spanner in the works.  When I do a check to
see that the Group is made up of all the Subgroups, I'll be able to
detect if there are any cases of a ')' without a succeeding ')'.
It's
so hard to get good data-entry help these days. :-)

|> 
|> [What was that? Well, "(" is a special grouping operator in
regular
|> expressions; it isn't part of the RE as such, but things inside (..)
|> can be referred to with backreferences like \1, which of course needs
|> to be entered as "\\1". \( is an actual left parenthesis, again
|> written with the doubled backslash. [^)]* is a sequence consisting of
|> any character except left parentheses (which is not a grouping
|> operator when it sits within square brackets). So we're finding the
|> bits of text delimited by ( and ) and replacing them with a / and the
|> content of the parentheses. Got it? Don't worry if you don't, I
didn't
|> get it right till the 12th try either! The important thing is knowing
|> that this kind of stuff is possible if you stare at it long enough.]

In my case, I needed a bit more help.  That solution is brilliant.
Thanks for the explanation of it too.  It covers everything I can
think of except the occasion where a '(' or ')' is missing.  I
know
the final ")" is absent in a few places.  It's probably easiest
for me
to do a test and add that character if required before using gsub,
then check if the Groups tally with the subgroups to determine if
there is anything missing.  Those should be rare enough to fix in the
data file instead of trying to come up with a generic method of
detecting them and making the requisite modifications.

|> 
|> Now that it is in an easier format we can use strsplit to get
|> individual parts:
|> 
|> > s <- strsplit(gsub("\\(([^)]*)\\)", "/\\1",
x),"/")

I probably would have got that if I'd got that far.

[...]

|> and once we have those we might use scan() on each string to get the
|> numbers. This requires the use of a text connection, like this
|> 
|> > lapply(s[[1]], function(x)scan(textConnection(x)))

I'd never had occasion to use textConnection before and was completely
ignorant of its existence.  Certainly simpler than my idea of
exporting text files and then using a Perl script and then importing
back in.

|> Read 12 items
|> Read 2 items
|> Read 6 items
|> Read 4 items
|> [[1]]
|>  [1] 12 78 23  9 76 43  2 15 41 81 92  5
|> 
|> [[2]]
|> [1] 92 12
|> 
|> [[3]]
|> [1] 81 78  5 76  9 41
|> 
|> [[4]]
|> [1] 23  2 15 43
|> 
|> ...
|> 
|> Your turn!

Can't improve on that!  It's so close to what I require we could call
it a day.  Thanks again.

best

-- 
Patrick Connolly
HortResearch
Mt Albert
Auckland
New Zealand 
Ph: +64-9 815 4200 x 7188
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~
I have the world`s largest collection of seashells. I keep it on all
the beaches of the world ... Perhaps you`ve seen it.  ---Steven Wright 
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~

______________________________________________________
The contents of this e-mail are privileged and/or confidential to the
named recipient and are not to be used by any other person and/or
organisation. If you have received this e-mail in error, please notify 
the sender and delete all material pertaining to this e-mail.
______________________________________________________
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Apparently Analagous Threads

Search for more seemingly similar threads

R help - Oct 2002 - subsetting character vector into groups of numerics

[R] subsetting character vector into groups of numerics

[R] subsetting character vector into groups of numerics

[R] subsetting character vector into groups of numerics

Apparently Analagous Threads