thr3ads.net - R help - [R] Row exclude [Jan 2022]

If this information is useful, please help other people find it:
Share via:

Avi Gross

2022-Jan-29 18:04 UTC

[R] Row exclude

There are many creative ways to solve problems and some may get you in trouble
if you present them in class while even in some work situations, they may be
hard for most to understand, let alone maintain and make changes.
This group is amorphous enough that we have people who want "help" who
are new to the language, but also people who know plenty and encounter a new
kind of problem, and of course people who want to make use of what they see as
free labor.
Rui presented a very interesting idea and I like some aspects. But if presented
to most people, they might have to start looking up things.?
But I admit I liked some of the ideas he uses and am adding them to my bag of
tricks. Some were overkill for this particular requirement but that also makes
them more general and useful.
First, was the use of locale-independent regular expressions like [[:alpha:]]
that match any combination of [:lower:] and [:upper:] and thus are not
restricted to ASCII characters. Since I do lots of my activities in languages
other than English and well might include names with characters not normally
found in English, or not even using an overlapping? alphabet, I can easily
encounter items in the Name column that might not match [A-Za-z] but will match
with [:alpha:].
I don't know if using [:digit:] has benefits over [0-9] and I do note there
was no requirement to match more complex numbers than integers so no need to
allow periods or scientific notation and so on.
Then there is the use of mapply. The more general version of the problem
presented would include a data.frame with any number of columns, where a subset
of the columns might need to be checked for conditions that vary across the
columns but may include some broad categories of conditions that might be
re-used. If all the conditions are regular expression matches you can build,
then you can extend the list Rui used to have more items and also include
expressions that always match so that some columns are effectively ignored:

? ?regex <- list("[[:digit:]]", "[[:alpha:]]",
"[[:alpha:]]", "[.*])

So this generalizes to N columns as long as you supply exactly N patterns in the
list, albeit mapply does recycle arguments if needed as in the simplest case
where you want all columns checked the same way.
Rui then uses an anonymous function to pass to mapply() and that is a newish
feature added recently to R, I think. It was perhaps meant specifically to be
used with the new pipe symbol, but can be used anywhere but perhaps not in older
versions of R.

? ?\(x, r) grepl(r, x)

I note Rui also uses grepl() which returns a logical vector. I will show my
first attempt at the end where I used grep() to return index numbers of matches
instead. For this context, though, he made use of the fact that mapply in this
case returns a matrix of type logical:
i <- mapply(\(x, r) grepl(r, x), dat1, regex)
> i? ? ? Name? ?Age Weight[1,] FALSE FALSE? ?TRUE[2,] FALSE FALSE? FALSE[3,] FALSE
FALSE? FALSE[4,] FALSE? TRUE? FALSE[5,] FALSE FALSE? FALSE[6,]? TRUE FALSE?
FALSE
And since R treats TRUE as 1 and FALSE as 0, then summing the rows gives you a
small integer between 0 and the number of columns, inclusive, and only rows with
no TRUE in them are wanted for this purpose:

dat1[rowSums(i) == 0L, ]

All I all, nicely done, but not trivial to read without comments, LOL!
And, yes, it could be made even more obscure as a one-liner.
My first attempt was a bit more focused on the specific needs described. I am
not sure how the HTML destroyer in this mailing list might wreck it, but I made
it a two-statement version that is formatted on multiple lines. An explanation
first.
I looked at using grep() on one column at a time to look for what should NOT be
there and ask it to invert the answer so it effectively tells me which rows to
keep. So it tests column 1 ($Name) to see if it has digits in it and returns
FALSE if it finds them which later means toss this row. It returns TRUE if that
entry, so far, makes the row valid. But note since I am not using grepl() it
does not return TRUE/FALSE at all. Rather it returns index numbers of the ones
that now inverted are TRUE. What goes in is a vector of individual items from a
column of the data. What goes out is the indices of which ones I want to keep
that can be used to index the entire data.frame. Based on the ample data, it
returns 1:5 as row 6 has a digit in "Jack3".

? grep("[0-9]", dat1$Name, invert = TRUE)

Similarly, two other grep() statements test if the second and third columns
contain any characters in?"[a-zA-Z]" and return a similar index vector
if they are OK.
What I would then have are three numeric vectors, not a matrix. Each contains a
subset of all the indices:
> grep("[0-9]", dat1$Name, invert = TRUE)[1] 1 2 3 4 5>
grep("[a-zA-Z]", dat1$Age, invert = TRUE)[1] 1 2 3 5 6>
grep("[a-zA-Z]", dat1$Weight, invert = TRUE)[1] 2 3 4 5 6This set of data was designed to toss out one of each column so they all are of
the same length but need not be. Like Rui, my condition for deciding which rows
to keep is that all three of the index vectors have a particular entry. He
summed them as logicals, but my choice has small integers so the way I combine
them to exclude any not in all three is to use a sort of set intersect method.
The one built-in to R only handles two at a time so I nested two calls to
intersect but in a more general case, I would use some package (or build my own
function) that handles intersecting any number of such items.
Here is the full code, minus the initialization.

rows.keep <-intersect(intersect(grep("[0-9]", dat1$Name, invert =
TRUE),? ? ? ? ? ? ? ? ? ? grep("[a-zA-Z]", dat1$Age, invert = TRUE)),?
? ? ? ? grep("[a-zA-Z]", dat1$Weight, invert = TRUE))result <-
dat1[rows.keep,]

-----Original Message-----
From: Rui Barradas <ruipbarradas at sapo.pt>
To: David Carlson <dcarlson at tamu.edu>; Bert Gunter <bgunter.4567 at
gmail.com>
Cc: r-help at R-project.org (r-help at r-project.org) <r-help at
r-project.org>
Sent: Sat, Jan 29, 2022 3:46 am
Subject: Re: [R] Row exclude

Hello,

Getting creative, here is another way with mapply.

regex <- list("[[:digit:]]", "[[:alpha:]]",
"[[:alpha:]]")

i <- mapply(\(x, r) grepl(r, x), dat1, regex)
dat1[rowSums(i) == 0L, ]

#? Name Age Weight
#2?? Bob ? 25 ?? ?? 142
#3 Carol ? 24?? ? ? 120
#5? Katy?? 35?????? 160

Hope this helps,

Rui Barradas

?s 06:30 de 29/01/2022, David Carlson via R-help
escreveu:> Given that you know which columns should be numeric and which should be
> character, finding characters in numeric columns or numbers in character
> columns is not difficult. Your data frame consists of three character
> columns so you can use regular expressions as Bert mentioned. First you
> should strip the whitespace out of your data:
>
> dat1 <-read.table(text="Name, Age, Weight
>? ? Alex,? 20,? 13X
>? ? Bob,? 25,? 142
>? ? Carol, 24,? 120
>? ? John,? 3BC,? 175
>? ? Katy,? 35,? 160
>? ? Jack3, 34,? 140",sep=",", header=TRUE,
stringsAsFactors=FALSE,
> strip.white=TRUE)
>
> Now check to see if all of the fields are character as expected.
>
> sapply(dat1, typeof)
> #? ? ? ? Name? ? ? ? Age? ? ? Weight
> # "character" "character" "character"
>
> Now identify character variables containing numbers and numeric variables
> containing characters:
>
> BadName <- which(grepl("[[:digit:]]", dat1$Name))
> BadAge <- which(grepl("[[:alpha:]]", dat1$Age))
> BadWeight <- which(grepl("[[:alpha:]]", dat1$Weight))
>
> Next remove those rows:
>
> (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ])
> #? ? Name Age Weight
> #? 2? Bob? 25? ? 142
> #? 3 Carol? 24? ? 120
> #? 5? Katy? 35? ? 160
>
> You still need to convert Age and Weight to numeric, e.g. dat2$Age <-
> as.numeric(dat2$Age).
>
> David Carlson
>
>
> On Fri, Jan 28, 2022 at 11:59 PM Bert Gunter <bgunter.4567 at
gmail.com> wrote:
>
>> As character 'polluted' entries will cause a column to be read
in (via
>> read.table and relatives) as factor or character data, this sounds like
a
>> job for regular expressions. If you are not familiar with this subject,
>> time to learn. And, yes, ZjQcmQRYFpfptBannerStart
>> This Message Is From an External Sender
>> This message came from outside your organization.
>> ZjQcmQRYFpfptBannerEnd
>>
>> As character 'polluted' entries will cause a column to be read
in (via
>> read.table and relatives) as factor or character data, this sounds like
a
>> job for regular expressions. If you are not familiar with this subject,
>> time to learn. And, yes, some heavy lifting will be required.
>> See ?regexp for a start maybe? Or the stringr package?
>>
>> Cheers,
>> Bert
>>
>>
>>
>>
>> On Fri, Jan 28, 2022, 7:08 PM Val <valkremk at gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I want to remove rows that contain a character string in an integer
>>> column or a digit in a character column.
>>>
>>> Sample data
>>>
>>> dat1 <-read.table(text="Name, Age, Weight
>>>? Alex,? 20,? 13X
>>>? Bob,? 25,? 142
>>>? Carol, 24,? 120
>>>? John,? 3BC,? 175
>>>? Katy,? 35,? 160
>>>? Jack3, 34,?
140",sep=",",header=TRUE,stringsAsFactors=F)
>>>
>>> If the Age/Weight column contains any character(s) then remove
>>> if the Name? column contains an digit then remove that row
>>> Desired output
>>>
>>>? ? Name? Age weight
>>> 1? Bob? ? 25? ? 142
>>> 2? Carol? 24? ? 120
>>> 3? Katy? ? 35? ? 160
>>>
>>> Thank you,
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>>
https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVXhZB_0c$
>>> PLEASE do read the posting guide
>>>
https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVRmZSfcI$
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>> ??? [[alternative HTML version deleted]]
>>
>> ______________________________________________R-help at r-project.org
mailing list -- To UNSUBSCRIBE and more,
seehttps://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVXhZB_0c$
>> PLEASE do read the posting guide
https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVRmZSfcI$
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
> ??? [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

	[[alternative HTML version deleted]]

Rui Barradas

2022-Jan-29 18:33 UTC

head link

[R] Row exclude

Hello,

Thanks for the comments, a few others inline.

?s 18:04 de 29/01/2022, Avi Gross escreveu:> There are many creative ways to solve problems and some may get you in 
> trouble if you present them in class while even in some work situations, 
> they may be hard for most to understand, let alone maintain and make 
> changes.
> 
> This group is amorphous enough that we have people who want
"help" who
> are new to the language, but also people who know plenty and encounter a 
> new kind of problem, and of course people who want to make use of what 
> they see as free labor.
> 
> Rui presented a very interesting idea and I like some aspects. But if 
> presented to most people, they might have to start looking up things.
> 
> But I admit I liked some of the ideas he uses and am adding them to my 
> bag of tricks. Some were overkill for this particular requirement but 
> that also makes them more general and useful.
> 
> First, was the use of locale-independent regular expressions like 
> [[:alpha:]] that match any combination of [:lower:] and [:upper:] and 
> thus are not restricted to ASCII characters. Since I do lots of my 
> activities in languages other than English and well might include names 
> with characters not normally found in English, or not even using an 
> overlapping? alphabet, I can easily encounter items in the Name column 
> that might not match [A-Za-z] but will match with [:alpha:].
> 
> I don't know if using [:digit:] has benefits over [0-9] and I do note 
> there was no requirement to match more complex numbers than integers so 
> no need to allow periods or scientific notation and so on.
Yes, I used locale-independent regular expressions. It's a habit I 
aquired a while ago. It took some time to stop using character ranges 
but once gone I'm more comfortable with the use of classes like 
[:alpha:] and [:digit:].
[After all my native language, (Portuguese) has 
cedillas(ES)/cedilhas(PT) and accented letters].
> 
> Then there is the use of mapply. The more general version of the problem 
> presented would include a data.frame with any number of columns, where a 
> subset of the columns might need to be checked for conditions that vary 
> across the columns but may include some broad categories of conditions 
> that might be re-used. If all the conditions are regular expression 
> matches you can build, then you can extend the list Rui used to have 
> more items and also include expressions that always match so that some 
> columns are effectively ignored:
> 
> 
> regex <- list("[[:digit:]]", "[[:alpha:]]",
"[[:alpha:]]", "[.*])
> 
> 
> So this generalizes to N columns as long as you supply exactly N 
> patterns in the list, albeit mapply does recycle arguments if needed as 
> in the simplest case where you want all columns checked the same way.
> 
> Rui then uses an anonymous function to pass to mapply() and that is a 
> newish feature added recently to R, I think. It was perhaps meant 
> specifically to be used with the new pipe symbol, but can be used 
> anywhere but perhaps not in older versions of R.
> 
> 
> \(x, r) grepl(r, x)
> 
No, the new anonymous function wasn't specifically meant to be used with 
the new pipe operator, it was meant to be a short-hand notation for 
anonymous functions and used interchangeably with the old notation.

mapply(\(x, r), etc)
mapply(function(x, r) etc)

> 
> I note Rui also uses grepl() which returns a logical vector. I will show 
> my first attempt at the end where I used grep() to return index numbers 
> of matches instead. For this context, though, he made use of the fact 
> that mapply in this case returns a matrix of type logical:
> 
> i <- mapply(\(x, r) grepl(r, x), dat1, regex)
> 
>> i
> 
>  ? ? ? Name? ?Age Weight
> [1,] FALSE FALSE? ?TRUE
> [2,] FALSE FALSE? FALSE
> [3,] FALSE FALSE? FALSE
> [4,] FALSE? TRUE? FALSE
> [5,] FALSE FALSE? FALSE
> [6,]? TRUE FALSE? FALSE
> 
> And since R treats TRUE as 1 and FALSE as 0, then summing the rows gives 
> you a small integer between 0 and the number of columns, inclusive, and 
> only rows with no TRUE in them are wanted for this purpose:
And rowSums is a fast function.> 
> 
> dat1[rowSums(i) == 0L, ]
> 
> All I all, nicely done, but not trivial to read without comments, LOL!
> 
> And, yes, it could be made even more obscure as a one-liner.
> 
> My first attempt was a bit more focused on the specific needs described. 
> I am not sure how the HTML destroyer in this mailing list might wreck 
> it, but I made it a two-statement version that is formatted on multiple 
> lines. An explanation first.
> 
> I looked at using grep() on one column at a time to look for what should 
> NOT be there and ask it to invert the answer so it effectively tells me 
> which rows to keep. So it tests column 1 ($Name) to see if it has digits 
> in it and returns FALSE if it finds them which later means toss this 
> row. It returns TRUE if that entry, so far, makes the row valid. But 
> note since I am not using grepl() it does not return TRUE/FALSE at all. 
> Rather it returns index numbers of the ones that now inverted are TRUE. 
> What goes in is a vector of individual items from a column of the data. 
> What goes out is the indices of which ones I want to keep that can be 
> used to index the entire data.frame. Based on the ample data, it returns 
> 1:5 as row 6 has a digit in "Jack3".
> 
> 
>  ? grep("[0-9]", dat1$Name, invert = TRUE)
> 
> 
> Similarly, two other grep() statements test if the second and third 
> columns contain any characters in "[a-zA-Z]" and return a similar
index
> vector if they are OK.
> 
> What I would then have are three numeric vectors, not a matrix. Each 
> contains a subset of all the indices:
> 
> 
>> grep("[0-9]", dat1$Name, invert = TRUE)
> [1] 1 2 3 4 5
>> grep("[a-zA-Z]", dat1$Age, invert = TRUE)
> [1] 1 2 3 5 6
>> grep("[a-zA-Z]", dat1$Weight, invert = TRUE)
> [1] 2 3 4 5 6
> 
> This set of data was designed to toss out one of each column so they all 
> are of the same length but need not be. Like Rui, my condition for 
> deciding which rows to keep is that all three of the index vectors have 
> a particular entry. He summed them as logicals, but my choice has small 
> integers so the way I combine them to exclude any not in all three is to 
> use a sort of set intersect method. The one built-in to R only handles 
> two at a time so I nested two calls to intersect but in a more general 
> case, I would use some package (or build my own function) that handles 
> intersecting any number of such items.
> 
> Here is the full code, minus the initialization.
> 
> 
> rows.keep <-
> intersect(intersect(grep("[0-9]", dat1$Name, invert = TRUE),
>  ? ? ? ? ? ? ? ? ? ? grep("[a-zA-Z]", dat1$Age, invert = TRUE)),
>  ? ? ? ? ? grep("[a-zA-Z]", dat1$Weight, invert = TRUE))
> result <- dat1[rows.keep,]
> 
> 
Using the same idea, another two options, both with Reduce.

The 1st uses Avi's grep and regex's, the latter could be the character 
classes "[[:alpha:]]" and "[[:digit:]]" but this code is
inspired in
his. The results are put on a list and Reduce intersects the list 
members. Then subsetting is as usual.

The 2nd uses the fact that Mapis a wrapper for mapply that defaults to 
not simplifying its output. grep/invert will find the non-matches and 
Reduce intersects the result list, as above.
 From ?Map:

Map is a simple wrapper to mapply which does not attempt to simplify the 
result, similar to Common Lisp's mapcar (with arguments being recycled, 
however). Future versions may allow some control of the result type.

# 1st
grep_list <- list(
   grep("[0-9]", dat1$Name, invert = TRUE),
   grep("[a-zA-Z]", dat1$Age, invert = TRUE),
   grep("[a-zA-Z]", dat1$Weight, invert = TRUE)
)
keep1 <- Reduce(intersect, grep_list)
dat1[keep1,]

# 2nd
keep2 <- Map(\(x, r) grep(r, x, invert = TRUE), dat1, regex)
keep2 <- Reduce(intersect, keep2)

identical(keep1, keep2)
#[1] TRUE


Hope this helps,

Rui Barradas
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Rui Barradas <ruipbarradas at sapo.pt>
> To: David Carlson <dcarlson at tamu.edu>; Bert Gunter
<bgunter.4567 at gmail.com>
> Cc: r-help at R-project.org (r-help at r-project.org) <r-help at
r-project.org>
> Sent: Sat, Jan 29, 2022 3:46 am
> Subject: Re: [R] Row exclude
> 
> Hello,
> 
> Getting creative, here is another way with mapply.
> 
> 
> regex <- list("[[:digit:]]", "[[:alpha:]]",
"[[:alpha:]]")
> 
> i <- mapply(\(x, r) grepl(r, x), dat1, regex)
> dat1[rowSums(i) == 0L, ]
> 
> #? Name Age Weight
> #2?? Bob ? 25 ?? ?? 142
> #3 Carol ? 24?? ? ? 120
> #5? Katy?? 35?????? 160
> 
> 
> Hope this helps,
> 
> Rui Barradas
> 
> 
> ?s 06:30 de 29/01/2022, David Carlson via R-help escreveu:
>  > Given that you know which columns should be numeric and which should
be
>  > character, finding characters in numeric columns or numbers in
character
>  > columns is not difficult. Your data frame consists of three character
>  > columns so you can use regular expressions as Bert mentioned. First
you
>  > should strip the whitespace out of your data:
>  >
>  > dat1 <-read.table(text="Name, Age, Weight
>  >? ? Alex,? 20,? 13X
>  >? ? Bob,? 25,? 142
>  >? ? Carol, 24,? 120
>  >? ? John,? 3BC,? 175
>  >? ? Katy,? 35,? 160
>  >? ? Jack3, 34,? 140",sep=",", header=TRUE,
stringsAsFactors=FALSE,
>  > strip.white=TRUE)
>  >
>  > Now check to see if all of the fields are character as expected.
>  >
>  > sapply(dat1, typeof)
>  > #? ? ? ? Name? ? ? ? Age? ? ? Weight
>  > # "character" "character" "character"
>  >
>  > Now identify character variables containing numbers and numeric
variables
>  > containing characters:
>  >
>  > BadName <- which(grepl("[[:digit:]]", dat1$Name))
>  > BadAge <- which(grepl("[[:alpha:]]", dat1$Age))
>  > BadWeight <- which(grepl("[[:alpha:]]", dat1$Weight))
>  >
>  > Next remove those rows:
>  >
>  > (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ])
>  > #? ? Name Age Weight
>  > #? 2? Bob? 25? ? 142
>  > #? 3 Carol? 24? ? 120
>  > #? 5? Katy? 35? ? 160
>  >
>  > You still need to convert Age and Weight to numeric, e.g. dat2$Age
<-
>  > as.numeric(dat2$Age).
>  >
>  > David Carlson
>  >
>  >
>  > On Fri, Jan 28, 2022 at 11:59 PM Bert Gunter <bgunter.4567 at
gmail.com
> <mailto:bgunter.4567 at gmail.com>> wrote:
>  >
>  >> As character 'polluted' entries will cause a column to be
read in (via
>  >> read.table and relatives) as factor or character data, this
sounds
> like a
>  >> job for regular expressions. If you are not familiar with this
subject,
>  >> time to learn. And, yes, ZjQcmQRYFpfptBannerStart
>  >> This Message Is From an External Sender
>  >> This message came from outside your organization.
>  >> ZjQcmQRYFpfptBannerEnd
>  >>
>  >> As character 'polluted' entries will cause a column to be
read in (via
>  >> read.table and relatives) as factor or character data, this
sounds
> like a
>  >> job for regular expressions. If you are not familiar with this
subject,
>  >> time to learn. And, yes, some heavy lifting will be required.
>  >> See ?regexp for a start maybe? Or the stringr package?
>  >>
>  >> Cheers,
>  >> Bert
>  >>
>  >>
>  >>
>  >>
>  >> On Fri, Jan 28, 2022, 7:08 PM Val <valkremk at gmail.com 
> <mailto:valkremk at gmail.com>> wrote:
>  >>
>  >>> Hi All,
>  >>>
>  >>> I want to remove rows that contain a character string in an
integer
>  >>> column or a digit in a character column.
>  >>>
>  >>> Sample data
>  >>>
>  >>> dat1 <-read.table(text="Name, Age, Weight
>  >>>? Alex,? 20,? 13X
>  >>>? Bob,? 25,? 142
>  >>>? Carol, 24,? 120
>  >>>? John,? 3BC,? 175
>  >>>? Katy,? 35,? 160
>  >>>? Jack3, 34,?
140",sep=",",header=TRUE,stringsAsFactors=F)
>  >>>
>  >>> If the Age/Weight column contains any character(s) then
remove
>  >>> if the Name? column contains an digit then remove that row
>  >>> Desired output
>  >>>
>  >>>? ? Name? Age weight
>  >>> 1? Bob? ? 25? ? 142
>  >>> 2? Carol? 24? ? 120
>  >>> 3? Katy? ? 35? ? 160
>  >>>
>  >>> Thank you,
>  >>>
>  >>> ______________________________________________
>  >>> R-help at r-project.org <mailto:R-help at
r-project.org> mailing list --
> To UNSUBSCRIBE and more, see
>  >>> 
>
https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVXhZB_0c$
>
<https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVXhZB_0c$>
>  >>> PLEASE do read the posting guide
>  >>> 
>
https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVRmZSfcI$
>
<https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVRmZSfcI$>
>  >>> and provide commented, minimal, self-contained, reproducible
code.
>  >>>
>  >> ??? [[alternative HTML version deleted]]
> 
>  >>
>  >> ______________________________________________R-help at
r-project.org
> <mailto:______________________________________________R-help at
r-project.org>
> mailing list -- To UNSUBSCRIBE and more, 
>
seehttps://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVXhZB_0c$
>  >> PLEASE do read the posting guide 
>
https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVRmZSfcI$
>
<https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVRmZSfcI$>
>  >> and provide commented, minimal, self-contained, reproducible
code.
>  >>
>  >>
>  > ??? [[alternative HTML version deleted]]
>  >
>  > ______________________________________________
>  > R-help at r-project.org <mailto:R-help at r-project.org>
mailing list -- To
> UNSUBSCRIBE and more, see
>  > https://stat.ethz.ch/mailman/listinfo/r-help 
> <https://stat.ethz.ch/mailman/listinfo/r-help>
>  > PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html 
> <http://www.r-project.org/posting-guide.html>
>  > and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org <mailto:R-help at r-project.org> mailing list
-- To
> UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help 
> <https://stat.ethz.ch/mailman/listinfo/r-help>
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html 
> <http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.

Avi Gross

2022-Jan-29 18:40 UTC

head link

[R] Row exclude

[NOTE: This is a re-send. I see it mangled multiple lines of code in sequence
and so I shifted my temporary email sender to not use any form of rich text.
Below would be the message I intended to send including code looking normal. As
many other messages I create benefit from HTML, I may have to flip back and
forth.]

There are many creative ways to solve problems and some may get you in trouble
if you present them in class while even in some work situations, they may be
hard for most to understand, let alone maintain and make changes.

This group is amorphous enough that we have people who want "help" who
are new to the language, but also people who know plenty and encounter a new
kind of problem, and of course people who want to make use of what they see as
free labor.

Rui presented a very interesting idea and I like some aspects. But if presented
to most people, they might have to start looking up things.

But I admit I liked some of the ideas he uses and am adding them to my bag of
tricks. Some were overkill for this particular requirement but that also makes
them more general and useful.

First, was the use of locale-independent regular expressions like [[:alpha:]]
that match any combination of [:lower:] and [:upper:] and thus are not
restricted to ASCII characters. Since I do lots of my activities in languages
other than English and well might include names with characters not normally
found in English, or not even using an overlapping  alphabet, I can easily
encounter items in the Name column that might not match [A-Za-z] but will match
with [:alpha:].

I don't know if using [:digit:] has benefits over [0-9] and I do note there
was no requirement to match more complex numbers than integers so no need to
allow periods or scientific notation and so on.

Then there is the use of mapply. The more general version of the problem
presented would include a data.frame with any number of columns, where a subset
of the columns might need to be checked for conditions that vary across the
columns but may include some broad categories of conditions that might be
re-used. If all the conditions are regular expression matches you can build,
then you can extend the list Rui used to have more items and also include
expressions that always match so that some columns are effectively ignored:


   regex <- list("[[:digit:]]", "[[:alpha:]]",
"[[:alpha:]]", "[.*])


So this generalizes to N columns as long as you supply exactly N patterns in the
list, albeit mapply does recycle arguments if needed as in the simplest case
where you want all columns checked the same way.

Rui then uses an anonymous function to pass to mapply() and that is a newish
feature added recently to R, I think. It was perhaps meant specifically to be
used with the new pipe symbol, but can be used anywhere but perhaps not in older
versions of R.


   \(x, r) grepl(r, x)


I note Rui also uses grepl() which returns a logical vector. I will show my
first attempt at the end where I used grep() to return index numbers of matches
instead. For this context, though, he made use of the fact that mapply in this
case returns a matrix of type logical:

i <- mapply(\(x, r) grepl(r, x), dat1, regex)
> i
      Name   Age Weight
[1,] FALSE FALSE   TRUE
[2,] FALSE FALSE  FALSE
[3,] FALSE FALSE  FALSE
[4,] FALSE  TRUE  FALSE
[5,] FALSE FALSE  FALSE
[6,]  TRUE FALSE  FALSE

And since R treats TRUE as 1 and FALSE as 0, then summing the rows gives you a
small integer between 0 and the number of columns, inclusive, and only rows with
no TRUE in them are wanted for this purpose:


dat1[rowSums(i) == 0L, ]

All I all, nicely done, but not trivial to read without comments, LOL!

And, yes, it could be made even more obscure as a one-liner.

My first attempt was a bit more focused on the specific needs described. I am
not sure how the HTML destroyer in this mailing list might wreck it, but I made
it a two-statement version that is formatted on multiple lines. An explanation
first.

I looked at using grep() on one column at a time to look for what should NOT be
there and ask it to invert the answer so it effectively tells me which rows to
keep. So it tests column 1 ($Name) to see if it has digits in it and returns
FALSE if it finds them which later means toss this row. It returns TRUE if that
entry, so far, makes the row valid. But note since I am not using grepl() it
does not return TRUE/FALSE at all. Rather it returns index numbers of the ones
that now inverted are TRUE. What goes in is a vector of individual items from a
column of the data. What goes out is the indices of which ones I want to keep
that can be used to index the entire data.frame. Based on the ample data, it
returns 1:5 as row 6 has a digit in "Jack3".


  grep("[0-9]", dat1$Name, invert = TRUE)


Similarly, two other grep() statements test if the second and third columns
contain any characters in "[a-zA-Z]" and return a similar index vector
if they are OK.

What I would then have are three numeric vectors, not a matrix. Each contains a
subset of all the indices:

> grep("[0-9]", dat1$Name, invert = TRUE)
[1] 1 2 3 4 5> grep("[a-zA-Z]", dat1$Age, invert = TRUE)
[1] 1 2 3 5 6> grep("[a-zA-Z]", dat1$Weight, invert = TRUE)[1] 2 3 4 5 6

This set of data was designed to toss out one of each column so they all are of
the same length but need not be. Like Rui, my condition for deciding which rows
to keep is that all three of the index vectors have a particular entry. He
summed them as logicals, but my choice has small integers so the way I combine
them to exclude any not in all three is to use a sort of set intersect method.
The one built-in to R only handles two at a time so I nested two calls to
intersect but in a more general case, I would use some package (or build my own
function) that handles intersecting any number of such items.

Here is the full code, minus the initialization.


rows.keep <-
intersect(intersect(grep("[0-9]", dat1$Name, invert = TRUE),
                    grep("[a-zA-Z]", dat1$Age, invert = TRUE)),
          grep("[a-zA-Z]", dat1$Weight, invert = TRUE))
result <- dat1[rows.keep,]




-----Original Message-----
From: Avi Gross via R-help <r-help at r-project.org>
To: ruipbarradas at sapo.pt <ruipbarradas at sapo.pt>; dcarlson at
tamu.edu <dcarlson at tamu.edu>; bgunter.4567 at gmail.com
<bgunter.4567 at gmail.com>
Cc: r-help at r-project.org <r-help at r-project.org>
Sent: Sat, Jan 29, 2022 1:04 pm
Subject: Re: [R] Row exclude

There are many creative ways to solve problems and some may get you in trouble
if you present them in class while even in some work situations, they may be
hard for most to understand, let alone maintain and make changes.
This group is amorphous enough that we have people who want "help" who
are new to the language, but also people who know plenty and encounter a new
kind of problem, and of course people who want to make use of what they see as
free labor.
Rui presented a very interesting idea and I like some aspects. But if presented
to most people, they might have to start looking up things.?
But I admit I liked some of the ideas he uses and am adding them to my bag of
tricks. Some were overkill for this particular requirement but that also makes
them more general and useful.
First, was the use of locale-independent regular expressions like [[:alpha:]]
that match any combination of [:lower:] and [:upper:] and thus are not
restricted to ASCII characters. Since I do lots of my activities in languages
other than English and well might include names with characters not normally
found in English, or not even using an overlapping? alphabet, I can easily
encounter items in the Name column that might not match [A-Za-z] but will match
with [:alpha:].
I don't know if using [:digit:] has benefits over [0-9] and I do note there
was no requirement to match more complex numbers than integers so no need to
allow periods or scientific notation and so on.
Then there is the use of mapply. The more general version of the problem
presented would include a data.frame with any number of columns, where a subset
of the columns might need to be checked for conditions that vary across the
columns but may include some broad categories of conditions that might be
re-used. If all the conditions are regular expression matches you can build,
then you can extend the list Rui used to have more items and also include
expressions that always match so that some columns are effectively ignored:

? ?regex <- list("[[:digit:]]", "[[:alpha:]]",
"[[:alpha:]]", "[.*])


So this generalizes to N columns as long as you supply exactly N patterns in the
list, albeit mapply does recycle arguments if needed as in the simplest case
where you want all columns checked the same way.
Rui then uses an anonymous function to pass to mapply() and that is a newish
feature added recently to R, I think. It was perhaps meant specifically to be
used with the new pipe symbol, but can be used anywhere but perhaps not in older
versions of R.

? ?\(x, r) grepl(r, x)


I note Rui also uses grepl() which returns a logical vector. I will show my
first attempt at the end where I used grep() to return index numbers of matches
instead. For this context, though, he made use of the fact that mapply in this
case returns a matrix of type logical:
i <- mapply(\(x, r) grepl(r, x), dat1, regex)
> i? ? ? Name? ?Age Weight[1,] FALSE FALSE? ?TRUE[2,] FALSE FALSE? FALSE[3,] FALSE
FALSE? FALSE[4,] FALSE? TRUE? FALSE[5,] FALSE FALSE? FALSE[6,]? TRUE FALSE?
FALSE
And since R treats TRUE as 1 and FALSE as 0, then summing the rows gives you a
small integer between 0 and the number of columns, inclusive, and only rows with
no TRUE in them are wanted for this purpose:

dat1[rowSums(i) == 0L, ]

All I all, nicely done, but not trivial to read without comments, LOL!
And, yes, it could be made even more obscure as a one-liner.
My first attempt was a bit more focused on the specific needs described. I am
not sure how the HTML destroyer in this mailing list might wreck it, but I made
it a two-statement version that is formatted on multiple lines. An explanation
first.
I looked at using grep() on one column at a time to look for what should NOT be
there and ask it to invert the answer so it effectively tells me which rows to
keep. So it tests column 1 ($Name) to see if it has digits in it and returns
FALSE if it finds them which later means toss this row. It returns TRUE if that
entry, so far, makes the row valid. But note since I am not using grepl() it
does not return TRUE/FALSE at all. Rather it returns index numbers of the ones
that now inverted are TRUE. What goes in is a vector of individual items from a
column of the data. What goes out is the indices of which ones I want to keep
that can be used to index the entire data.frame. Based on the ample data, it
returns 1:5 as row 6 has a digit in "Jack3".

? grep("[0-9]", dat1$Name, invert = TRUE)


Similarly, two other grep() statements test if the second and third columns
contain any characters in?"[a-zA-Z]" and return a similar index vector
if they are OK.
What I would then have are three numeric vectors, not a matrix. Each contains a
subset of all the indices:
> grep("[0-9]", dat1$Name, invert = TRUE)[1] 1 2 3 4 5>
grep("[a-zA-Z]", dat1$Age, invert = TRUE)[1] 1 2 3 5 6>
grep("[a-zA-Z]", dat1$Weight, invert = TRUE)[1] 2 3 4 5 6This set of data was designed to toss out one of each column so they all are of
the same length but need not be. Like Rui, my condition for deciding which rows
to keep is that all three of the index vectors have a particular entry. He
summed them as logicals, but my choice has small integers so the way I combine
them to exclude any not in all three is to use a sort of set intersect method.
The one built-in to R only handles two at a time so I nested two calls to
intersect but in a more general case, I would use some package (or build my own
function) that handles intersecting any number of such items.
Here is the full code, minus the initialization.

rows.keep <-intersect(intersect(grep("[0-9]", dat1$Name, invert =
TRUE),? ? ? ? ? ? ? ? ? ? grep("[a-zA-Z]", dat1$Age, invert = TRUE)),?
? ? ? ? grep("[a-zA-Z]", dat1$Weight, invert = TRUE))result <-
dat1[rows.keep,]










-----Original Message-----
From: Rui Barradas <ruipbarradas at sapo.pt>
To: David Carlson <dcarlson at tamu.edu>; Bert Gunter <bgunter.4567 at
gmail.com>
Cc: r-help at R-project.org (r-help at r-project.org) <r-help at
r-project.org>
Sent: Sat, Jan 29, 2022 3:46 am
Subject: Re: [R] Row exclude

Hello,

Getting creative, here is another way with mapply.


regex <- list("[[:digit:]]", "[[:alpha:]]",
"[[:alpha:]]")

i <- mapply(\(x, r) grepl(r, x), dat1, regex)
dat1[rowSums(i) == 0L, ]

#? Name Age Weight
#2?? Bob ? 25 ?? ?? 142
#3 Carol ? 24?? ? ? 120
#5? Katy?? 35?????? 160


Hope this helps,

Rui Barradas

R help - Jan 2022 - Row exclude

[R] Row exclude

[R] Row exclude

[R] Row exclude