thr3ads.net - R help - [R] Row exclude [Jan 2022]

If this information is useful, please help other people find it:
Share via:

Val

2022-Jan-30 02:32 UTC

[R] Row exclude

Thank you David for your help.

I just have one question on this. What is the purpose of  using the
"unique" function on this?
  (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ])

I got the same result without using it.
       (dat2 <- dat1[-(c(BadName, BadAge, BadWeight)), ])

My concern is when I am applying this for the large data set the
"unique"
function may consume resources(time  and memory).

Thank you.

On Sat, Jan 29, 2022 at 12:30 AM David Carlson <dcarlson at tamu.edu>
wrote:
> Given that you know which columns should be numeric and which should be
> character, finding characters in numeric columns or numbers in character
> columns is not difficult. Your data frame consists of three character
> columns so you can use regular expressions as Bert mentioned. First you
> should strip the whitespace out of your data:
>
> dat1 <-read.table(text="Name, Age, Weight
>   Alex,  20,  13X
>   Bob,  25,  142
>   Carol, 24,  120
>   John,  3BC,  175
>   Katy,  35,  160
>   Jack3, 34,  140",sep=",", header=TRUE,
stringsAsFactors=FALSE,
> strip.white=TRUE)
>
> Now check to see if all of the fields are character as expected.
>
> sapply(dat1, typeof)
> #        Name         Age      Weight
> # "character" "character" "character"
>
> Now identify character variables containing numbers and numeric variables
> containing characters:
>
> BadName <- which(grepl("[[:digit:]]", dat1$Name))
> BadAge <- which(grepl("[[:alpha:]]", dat1$Age))
> BadWeight <- which(grepl("[[:alpha:]]", dat1$Weight))
>
> Next remove those rows:
>
> (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ])
> #    Name Age Weight
> #  2   Bob  25    142
> #  3 Carol  24    120
> #  5  Katy  35    160
>
> You still need to convert Age and Weight to numeric, e.g. dat2$Age <-
> as.numeric(dat2$Age).
>
> David Carlson
>
>
> On Fri, Jan 28, 2022 at 11:59 PM Bert Gunter <bgunter.4567 at
gmail.com>
> wrote:
>
>> As character 'polluted' entries will cause a column to be read
in (via
>> read.table and relatives) as factor or character data, this sounds like
a
>> job for regular expressions. If you are not familiar with this subject,
>> time to learn. And, yes, ZjQcmQRYFpfptBannerStart
>> This Message Is From an External Sender
>> This message came from outside your organization.
>> ZjQcmQRYFpfptBannerEnd
>>
>> As character 'polluted' entries will cause a column to be read
in (via
>> read.table and relatives) as factor or character data, this sounds like
a
>> job for regular expressions. If you are not familiar with this subject,
>> time to learn. And, yes, some heavy lifting will be required.
>> See ?regexp for a start maybe? Or the stringr package?
>>
>> Cheers,
>> Bert
>>
>>
>>
>>
>> On Fri, Jan 28, 2022, 7:08 PM Val <valkremk at gmail.com> wrote:
>>
>> > Hi All,
>> >
>> > I want to remove rows that contain a character string in an
integer
>> > column or a digit in a character column.
>> >
>> > Sample data
>> >
>> > dat1 <-read.table(text="Name, Age, Weight
>> >  Alex,  20,  13X
>> >  Bob,   25,  142
>> >  Carol, 24,  120
>> >  John,  3BC,  175
>> >  Katy,  35,  160
>> >  Jack3, 34, 
140",sep=",",header=TRUE,stringsAsFactors=F)
>> >
>> > If the Age/Weight column contains any character(s) then remove
>> > if the Name  column contains an digit then remove that row
>> > Desired output
>> >
>> >    Name   Age weight
>> > 1   Bob     25    142
>> > 2   Carol   24    120
>> > 3   Katy    35    160
>> >
>> > Thank you,
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>> >
https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVXhZB_0c$
>> > PLEASE do read the posting guide
>> >
https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVRmZSfcI$
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>> 	[[alternative HTML version deleted]]
>>
>> ______________________________________________R-help at r-project.org
mailing list -- To UNSUBSCRIBE and more,
seehttps://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVXhZB_0c$
>> PLEASE do read the posting guide
https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVRmZSfcI$
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
	[[alternative HTML version deleted]]

Avi Gross

2022-Jan-30 03:56 UTC

head link

[R] Row exclude

Val,

In this special case the unique() is extra overhead of sorts. 

in a normal positive scenario of dat[c(1,2,3,2,3,1)] you would get duplicate
rows shown as many times as they are invoked.

In the similar but different negative scenario,  dat[-c(1,2,3,2,3,1)] the
functionality seems to be different as you can subtract a row once and then
subtract it again and again with no further effect. It is sort of like creating
a set and then subtracting from the set any values if they exist and doing
nothing if they, don't.

Now in your 7 row case, there was one problem in each of three rows and there
was no overlap but the following will remove only rows 2 and 3 no matter how
often it comes up:
> dat1[-c(2,3,3,2,3,2,3,2,3,2),]   Name   Age Weight
1  Alex    20    13X
4  John   3BC    175
5  Katy    35    160
6 Jack3    34    140

Order and even redundancies do not matter, so no unique is needed. I suspect the
overhead of a few duplicates is minor.

But having said that, I believe the implementation of unique does not use tons
of resources as it returns the results in the same order it sees then and dos
not do a sort:
> unique(c(6,2,3,3,2,3,2,3,2,3,2,6))[1] 6 2 3

So in the positive case where you often want one copy of everything, it may make
sense to use and even to sort it afterward.

You were not clear about your need to handle large amounts of data so some
suggested solutions may scale up better than others. You still have not shared
if you plan on using more columns too or if any columns are meant to be ignored
or dealt with yet another way in checking them. You also did not specify what if
anything should be done if a column entry is NA or even possibly a few other
things like Inf.

Any suggestions you got are not offered as working well if you add additional
requirements and constraints. As another example, in R you can enter some
numbers, including as integers, several other ways:

mixed <- c(111, 222L, 0x29A, 1E3)
[1]  111  222  666 1000

I changed your dat1 to look like this:

dat1 <-read.table(text="Name, Age, Weight
Alex,  20,  13
Bob,  25,  222L
Carol, 24, 0x29A
John,  3BC, 1E3
Katy,  35,  160
Jack3, 34,  140",sep=",",header=TRUE,stringsAsFactors=F,
strip.white=TRUE)

I also stripped whitespace and it looks like this:
> dat1   Name Age Weight
1  Alex  20     13
2   Bob  25   222L
3 Carol  24  0x29A
4  John 3BC    1E3
5  Katy  35    160
6 Jack3  34    140

Now look at column 2, Weight. As it is all seen as character, it retained those
constructs but will R be able to convert them if I ask for as.integer?
> as.integer(dat1$Weight)[1]   13   NA  666 1000  160  140
Warning message:
NAs introduced by coercion 

It looks like adding an L fails but the hexadecimal notation works and so does
the scientific notation.

So you have every right to insist the only thing to be entered in a numeric
column are digits from 0 to 9. Realistically, any entering the forms above, as
well as numbers like 3.14, would otherwise work fine and if converted to an
integer, would survive, albeit the latter would be truncated. Writing a regular
expression that matches these is not straightforward but can be done. But the
suggestions people made assume you are restricting it to standard decimal
notation and that is fine.

Now on to size. If you have a million rows in your data, the various algorithms
proposed vary in what they do in between and some make all kinds of data
structures that store integers or logicals in size up to or close to a million.
But the nature of your requirement turns out to be amenable to working in
smaller batches if that makes you happy. You could operate on something like a
thousand rows at a time, and calculate which rows to keep and set them aside.
Repeat for another thousand at a time till done and along the way merge what you
are keeping. The memory used for each batch will be limited and reused regularly
and garbage collection will deal with the rest. You do not even have to do the
concatenation incrementally if you simply keep track of your offset at the start
of each batch and add that back to the vector of indices of rows to keep and
keep extending the vector of rows. At the end, you can use that to index all the
data and shrink it.

There are many such variations if you want to consider efficiency of memory or
CPU time, albeit you probably can handle amounts like a million rows, but not
billions, on most machines. For much larger sets of data, you could read in
lines (or all at once but process one row at a time) and then selectively write
the good ones back to disk or to a database one at a time (or in small batches)
and then you can delete all current uses of memory and read it in again, but
this time asking read.table to read in certain columns as integers, or even
smaller integers, and the text column as character. If lots of rows are being
skipped, you now have a smaller memory footprint for additional operations that
presumably follow.

Good luck.



-----Original Message-----
From: Val <valkremk at gmail.com>
To: David Carlson <dcarlson at tamu.edu>
Cc: r-help at R-project.org (r-help at r-project.org) <r-help at
r-project.org>
Sent: Sat, Jan 29, 2022 9:32 pm
Subject: Re: [R] Row exclude

Thank you David for your help.

I just have one question on this. What is the purpose of? using the
"unique" function on this?
? (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ])

I got the same result without using it.
? ? ?  (dat2 <- dat1[-(c(BadName, BadAge, BadWeight)), ])

My concern is when I am applying this for the large data set the
"unique"
function may consume resources(time? and memory).

Thank you.

de.html
and provide commented, minimal, self-contained, reproducible code.

David Carlson

2022-Jan-30 04:15 UTC

head link

[R] Row exclude

It is possible that there would be errors on the same row for different
columns. This does not happen in your example. If row 4 was "John6, 3BC,
175X" then row 4 would be included 3 times, but we only need to remove it
once. Removing the duplicates is not necessary since R would not get
confused, but length(unique(c(BadName, BadAge, BadWeight)) indicates how
many lines are being removed.

David

On Sat, Jan 29, 2022 at 8:32 PM Val <valkremk at gmail.com> wrote:
> Thank you David for your help. I just have one question on this. What is
> the purpose of  using the "unique" function on this?   (dat2
<-
> dat1[-unique(c(BadName, BadAge, BadWeight)), ])   I got the same result
> without using it. ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
> ZjQcmQRYFpfptBannerEnd
> Thank you David for your help.
>
> I just have one question on this. What is the purpose of  using the
> "unique" function on this?
>   (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ])
>
> I got the same result without using it.
>        (dat2 <- dat1[-(c(BadName, BadAge, BadWeight)), ])
>
> My concern is when I am applying this for the large data set the
"unique"
> function may consume resources(time  and memory).
>
> Thank you.
>
> On Sat, Jan 29, 2022 at 12:30 AM David Carlson <dcarlson at tamu.edu>
wrote:
>
>> Given that you know which columns should be numeric and which should be
>> character, finding characters in numeric columns or numbers in
character
>> columns is not difficult. Your data frame consists of three character
>> columns so you can use regular expressions as Bert mentioned. First you
>> should strip the whitespace out of your data:
>>
>> dat1 <-read.table(text="Name, Age, Weight
>>   Alex,  20,  13X
>>   Bob,  25,  142
>>   Carol, 24,  120
>>   John,  3BC,  175
>>   Katy,  35,  160
>>   Jack3, 34,  140",sep=",", header=TRUE,
stringsAsFactors=FALSE,
>> strip.white=TRUE)
>>
>> Now check to see if all of the fields are character as expected.
>>
>> sapply(dat1, typeof)
>> #        Name         Age      Weight
>> # "character" "character" "character"
>>
>> Now identify character variables containing numbers and numeric
variables
>> containing characters:
>>
>> BadName <- which(grepl("[[:digit:]]", dat1$Name))
>> BadAge <- which(grepl("[[:alpha:]]", dat1$Age))
>> BadWeight <- which(grepl("[[:alpha:]]", dat1$Weight))
>>
>> Next remove those rows:
>>
>> (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ])
>> #    Name Age Weight
>> #  2   Bob  25    142
>> #  3 Carol  24    120
>> #  5  Katy  35    160
>>
>> You still need to convert Age and Weight to numeric, e.g. dat2$Age
<-
>> as.numeric(dat2$Age).
>>
>> David Carlson
>>
>>
>> On Fri, Jan 28, 2022 at 11:59 PM Bert Gunter <bgunter.4567 at
gmail.com>
>> wrote:
>>
>>> As character 'polluted' entries will cause a column to be
read in (via
>>> read.table and relatives) as factor or character data, this sounds
like a
>>> job for regular expressions. If you are not familiar with this
subject,
>>> time to learn. And, yes, ZjQcmQRYFpfptBannerStart
>>> This Message Is From an External Sender
>>> This message came from outside your organization.
>>> ZjQcmQRYFpfptBannerEnd
>>>
>>> As character 'polluted' entries will cause a column to be
read in (via
>>> read.table and relatives) as factor or character data, this sounds
like a
>>> job for regular expressions. If you are not familiar with this
subject,
>>> time to learn. And, yes, some heavy lifting will be required.
>>> See ?regexp for a start maybe? Or the stringr package?
>>>
>>> Cheers,
>>> Bert
>>>
>>>
>>>
>>>
>>> On Fri, Jan 28, 2022, 7:08 PM Val <valkremk at gmail.com>
wrote:
>>>
>>> > Hi All,
>>> >
>>> > I want to remove rows that contain a character string in an
integer
>>> > column or a digit in a character column.
>>> >
>>> > Sample data
>>> >
>>> > dat1 <-read.table(text="Name, Age, Weight
>>> >  Alex,  20,  13X
>>> >  Bob,   25,  142
>>> >  Carol, 24,  120
>>> >  John,  3BC,  175
>>> >  Katy,  35,  160
>>> >  Jack3, 34, 
140",sep=",",header=TRUE,stringsAsFactors=F)
>>> >
>>> > If the Age/Weight column contains any character(s) then remove
>>> > if the Name  column contains an digit then remove that row
>>> > Desired output
>>> >
>>> >    Name   Age weight
>>> > 1   Bob     25    142
>>> > 2   Carol   24    120
>>> > 3   Katy    35    160
>>> >
>>> > Thank you,
>>> >
>>> > ______________________________________________
>>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>> >
https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVXhZB_0c$
>>> > PLEASE do read the posting guide
>>> >
https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVRmZSfcI$
>>> > and provide commented, minimal, self-contained, reproducible
code.
>>> >
>>>
>>> 	[[alternative HTML version deleted]]
>>>
>>> ______________________________________________R-help at
r-project.org mailing list -- To UNSUBSCRIBE and more,
seehttps://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVXhZB_0c$
>>> PLEASE do read the posting guide
https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVRmZSfcI$
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
	[[alternative HTML version deleted]]

R help - Jan 2022 - Row exclude

[R] Row exclude

[R] Row exclude

[R] Row exclude