Thank you David for your help. I just have one question on this. What is the purpose of using the "unique" function on this? (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ]) I got the same result without using it. (dat2 <- dat1[-(c(BadName, BadAge, BadWeight)), ]) My concern is when I am applying this for the large data set the "unique" function may consume resources(time and memory). Thank you. On Sat, Jan 29, 2022 at 12:30 AM David Carlson <dcarlson at tamu.edu> wrote:> Given that you know which columns should be numeric and which should be > character, finding characters in numeric columns or numbers in character > columns is not difficult. Your data frame consists of three character > columns so you can use regular expressions as Bert mentioned. First you > should strip the whitespace out of your data: > > dat1 <-read.table(text="Name, Age, Weight > Alex, 20, 13X > Bob, 25, 142 > Carol, 24, 120 > John, 3BC, 175 > Katy, 35, 160 > Jack3, 34, 140",sep=",", header=TRUE, stringsAsFactors=FALSE, > strip.white=TRUE) > > Now check to see if all of the fields are character as expected. > > sapply(dat1, typeof) > # Name Age Weight > # "character" "character" "character" > > Now identify character variables containing numbers and numeric variables > containing characters: > > BadName <- which(grepl("[[:digit:]]", dat1$Name)) > BadAge <- which(grepl("[[:alpha:]]", dat1$Age)) > BadWeight <- which(grepl("[[:alpha:]]", dat1$Weight)) > > Next remove those rows: > > (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ]) > # Name Age Weight > # 2 Bob 25 142 > # 3 Carol 24 120 > # 5 Katy 35 160 > > You still need to convert Age and Weight to numeric, e.g. dat2$Age <- > as.numeric(dat2$Age). > > David Carlson > > > On Fri, Jan 28, 2022 at 11:59 PM Bert Gunter <bgunter.4567 at gmail.com> > wrote: > >> As character 'polluted' entries will cause a column to be read in (via >> read.table and relatives) as factor or character data, this sounds like a >> job for regular expressions. If you are not familiar with this subject, >> time to learn. And, yes, ZjQcmQRYFpfptBannerStart >> This Message Is From an External Sender >> This message came from outside your organization. >> ZjQcmQRYFpfptBannerEnd >> >> As character 'polluted' entries will cause a column to be read in (via >> read.table and relatives) as factor or character data, this sounds like a >> job for regular expressions. If you are not familiar with this subject, >> time to learn. And, yes, some heavy lifting will be required. >> See ?regexp for a start maybe? Or the stringr package? >> >> Cheers, >> Bert >> >> >> >> >> On Fri, Jan 28, 2022, 7:08 PM Val <valkremk at gmail.com> wrote: >> >> > Hi All, >> > >> > I want to remove rows that contain a character string in an integer >> > column or a digit in a character column. >> > >> > Sample data >> > >> > dat1 <-read.table(text="Name, Age, Weight >> > Alex, 20, 13X >> > Bob, 25, 142 >> > Carol, 24, 120 >> > John, 3BC, 175 >> > Katy, 35, 160 >> > Jack3, 34, 140",sep=",",header=TRUE,stringsAsFactors=F) >> > >> > If the Age/Weight column contains any character(s) then remove >> > if the Name column contains an digit then remove that row >> > Desired output >> > >> > Name Age weight >> > 1 Bob 25 142 >> > 2 Carol 24 120 >> > 3 Katy 35 160 >> > >> > Thank you, >> > >> > ______________________________________________ >> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVXhZB_0c$ >> > PLEASE do read the posting guide >> > https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVRmZSfcI$ >> > and provide commented, minimal, self-contained, reproducible code. >> > >> >> [[alternative HTML version deleted]] >> >> ______________________________________________R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, seehttps://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVXhZB_0c$ >> PLEASE do read the posting guide https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVRmZSfcI$ >> and provide commented, minimal, self-contained, reproducible code. >> >>[[alternative HTML version deleted]]
Val, In this special case the unique() is extra overhead of sorts. in a normal positive scenario of dat[c(1,2,3,2,3,1)] you would get duplicate rows shown as many times as they are invoked. In the similar but different negative scenario, dat[-c(1,2,3,2,3,1)] the functionality seems to be different as you can subtract a row once and then subtract it again and again with no further effect. It is sort of like creating a set and then subtracting from the set any values if they exist and doing nothing if they, don't. Now in your 7 row case, there was one problem in each of three rows and there was no overlap but the following will remove only rows 2 and 3 no matter how often it comes up:> dat1[-c(2,3,3,2,3,2,3,2,3,2),]Name Age Weight 1 Alex 20 13X 4 John 3BC 175 5 Katy 35 160 6 Jack3 34 140 Order and even redundancies do not matter, so no unique is needed. I suspect the overhead of a few duplicates is minor. But having said that, I believe the implementation of unique does not use tons of resources as it returns the results in the same order it sees then and dos not do a sort:> unique(c(6,2,3,3,2,3,2,3,2,3,2,6))[1] 6 2 3 So in the positive case where you often want one copy of everything, it may make sense to use and even to sort it afterward. You were not clear about your need to handle large amounts of data so some suggested solutions may scale up better than others. You still have not shared if you plan on using more columns too or if any columns are meant to be ignored or dealt with yet another way in checking them. You also did not specify what if anything should be done if a column entry is NA or even possibly a few other things like Inf. Any suggestions you got are not offered as working well if you add additional requirements and constraints. As another example, in R you can enter some numbers, including as integers, several other ways: mixed <- c(111, 222L, 0x29A, 1E3) [1] 111 222 666 1000 I changed your dat1 to look like this: dat1 <-read.table(text="Name, Age, Weight Alex, 20, 13 Bob, 25, 222L Carol, 24, 0x29A John, 3BC, 1E3 Katy, 35, 160 Jack3, 34, 140",sep=",",header=TRUE,stringsAsFactors=F, strip.white=TRUE) I also stripped whitespace and it looks like this:> dat1Name Age Weight 1 Alex 20 13 2 Bob 25 222L 3 Carol 24 0x29A 4 John 3BC 1E3 5 Katy 35 160 6 Jack3 34 140 Now look at column 2, Weight. As it is all seen as character, it retained those constructs but will R be able to convert them if I ask for as.integer?> as.integer(dat1$Weight)[1] 13 NA 666 1000 160 140 Warning message: NAs introduced by coercion It looks like adding an L fails but the hexadecimal notation works and so does the scientific notation. So you have every right to insist the only thing to be entered in a numeric column are digits from 0 to 9. Realistically, any entering the forms above, as well as numbers like 3.14, would otherwise work fine and if converted to an integer, would survive, albeit the latter would be truncated. Writing a regular expression that matches these is not straightforward but can be done. But the suggestions people made assume you are restricting it to standard decimal notation and that is fine. Now on to size. If you have a million rows in your data, the various algorithms proposed vary in what they do in between and some make all kinds of data structures that store integers or logicals in size up to or close to a million. But the nature of your requirement turns out to be amenable to working in smaller batches if that makes you happy. You could operate on something like a thousand rows at a time, and calculate which rows to keep and set them aside. Repeat for another thousand at a time till done and along the way merge what you are keeping. The memory used for each batch will be limited and reused regularly and garbage collection will deal with the rest. You do not even have to do the concatenation incrementally if you simply keep track of your offset at the start of each batch and add that back to the vector of indices of rows to keep and keep extending the vector of rows. At the end, you can use that to index all the data and shrink it. There are many such variations if you want to consider efficiency of memory or CPU time, albeit you probably can handle amounts like a million rows, but not billions, on most machines. For much larger sets of data, you could read in lines (or all at once but process one row at a time) and then selectively write the good ones back to disk or to a database one at a time (or in small batches) and then you can delete all current uses of memory and read it in again, but this time asking read.table to read in certain columns as integers, or even smaller integers, and the text column as character. If lots of rows are being skipped, you now have a smaller memory footprint for additional operations that presumably follow. Good luck. -----Original Message----- From: Val <valkremk at gmail.com> To: David Carlson <dcarlson at tamu.edu> Cc: r-help at R-project.org (r-help at r-project.org) <r-help at r-project.org> Sent: Sat, Jan 29, 2022 9:32 pm Subject: Re: [R] Row exclude Thank you David for your help. I just have one question on this. What is the purpose of? using the "unique" function on this? ? (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ]) I got the same result without using it. ? ? ? (dat2 <- dat1[-(c(BadName, BadAge, BadWeight)), ]) My concern is when I am applying this for the large data set the "unique" function may consume resources(time? and memory). Thank you. de.html and provide commented, minimal, self-contained, reproducible code.
It is possible that there would be errors on the same row for different columns. This does not happen in your example. If row 4 was "John6, 3BC, 175X" then row 4 would be included 3 times, but we only need to remove it once. Removing the duplicates is not necessary since R would not get confused, but length(unique(c(BadName, BadAge, BadWeight)) indicates how many lines are being removed. David On Sat, Jan 29, 2022 at 8:32 PM Val <valkremk at gmail.com> wrote:> Thank you David for your help. I just have one question on this. What is > the purpose of using the "unique" function on this? (dat2 <- > dat1[-unique(c(BadName, BadAge, BadWeight)), ]) I got the same result > without using it. ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > ZjQcmQRYFpfptBannerEnd > Thank you David for your help. > > I just have one question on this. What is the purpose of using the > "unique" function on this? > (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ]) > > I got the same result without using it. > (dat2 <- dat1[-(c(BadName, BadAge, BadWeight)), ]) > > My concern is when I am applying this for the large data set the "unique" > function may consume resources(time and memory). > > Thank you. > > On Sat, Jan 29, 2022 at 12:30 AM David Carlson <dcarlson at tamu.edu> wrote: > >> Given that you know which columns should be numeric and which should be >> character, finding characters in numeric columns or numbers in character >> columns is not difficult. Your data frame consists of three character >> columns so you can use regular expressions as Bert mentioned. First you >> should strip the whitespace out of your data: >> >> dat1 <-read.table(text="Name, Age, Weight >> Alex, 20, 13X >> Bob, 25, 142 >> Carol, 24, 120 >> John, 3BC, 175 >> Katy, 35, 160 >> Jack3, 34, 140",sep=",", header=TRUE, stringsAsFactors=FALSE, >> strip.white=TRUE) >> >> Now check to see if all of the fields are character as expected. >> >> sapply(dat1, typeof) >> # Name Age Weight >> # "character" "character" "character" >> >> Now identify character variables containing numbers and numeric variables >> containing characters: >> >> BadName <- which(grepl("[[:digit:]]", dat1$Name)) >> BadAge <- which(grepl("[[:alpha:]]", dat1$Age)) >> BadWeight <- which(grepl("[[:alpha:]]", dat1$Weight)) >> >> Next remove those rows: >> >> (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ]) >> # Name Age Weight >> # 2 Bob 25 142 >> # 3 Carol 24 120 >> # 5 Katy 35 160 >> >> You still need to convert Age and Weight to numeric, e.g. dat2$Age <- >> as.numeric(dat2$Age). >> >> David Carlson >> >> >> On Fri, Jan 28, 2022 at 11:59 PM Bert Gunter <bgunter.4567 at gmail.com> >> wrote: >> >>> As character 'polluted' entries will cause a column to be read in (via >>> read.table and relatives) as factor or character data, this sounds like a >>> job for regular expressions. If you are not familiar with this subject, >>> time to learn. And, yes, ZjQcmQRYFpfptBannerStart >>> This Message Is From an External Sender >>> This message came from outside your organization. >>> ZjQcmQRYFpfptBannerEnd >>> >>> As character 'polluted' entries will cause a column to be read in (via >>> read.table and relatives) as factor or character data, this sounds like a >>> job for regular expressions. If you are not familiar with this subject, >>> time to learn. And, yes, some heavy lifting will be required. >>> See ?regexp for a start maybe? Or the stringr package? >>> >>> Cheers, >>> Bert >>> >>> >>> >>> >>> On Fri, Jan 28, 2022, 7:08 PM Val <valkremk at gmail.com> wrote: >>> >>> > Hi All, >>> > >>> > I want to remove rows that contain a character string in an integer >>> > column or a digit in a character column. >>> > >>> > Sample data >>> > >>> > dat1 <-read.table(text="Name, Age, Weight >>> > Alex, 20, 13X >>> > Bob, 25, 142 >>> > Carol, 24, 120 >>> > John, 3BC, 175 >>> > Katy, 35, 160 >>> > Jack3, 34, 140",sep=",",header=TRUE,stringsAsFactors=F) >>> > >>> > If the Age/Weight column contains any character(s) then remove >>> > if the Name column contains an digit then remove that row >>> > Desired output >>> > >>> > Name Age weight >>> > 1 Bob 25 142 >>> > 2 Carol 24 120 >>> > 3 Katy 35 160 >>> > >>> > Thank you, >>> > >>> > ______________________________________________ >>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVXhZB_0c$ >>> > PLEASE do read the posting guide >>> > https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVRmZSfcI$ >>> > and provide commented, minimal, self-contained, reproducible code. >>> > >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, seehttps://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVXhZB_0c$ >>> PLEASE do read the posting guide https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVRmZSfcI$ >>> and provide commented, minimal, self-contained, reproducible code. >>> >>>[[alternative HTML version deleted]]