Hi all, I have a big data set and want to remove rows conditionally. In my data file each person were recorded for several weeks. Somehow during the recording periods, their last name was misreported. For each person, the last name should be the same. Otherwise remove from the data. Example, in the following data set, Alex was found to have two last names . Alex West Alex Joseph Alex should be removed from the data. if this happens then I want remove all rows with Alex. Here is my data set df <- read.table(header=TRUE, text='first week last Alex 1 West Bob 1 John Cory 1 Jack Cory 2 Jack Bob 2 John Bob 3 John Alex 2 Joseph Alex 3 West Alex 4 West ') Desired output first week last 1 Bob 1 John 2 Bob 2 John 3 Bob 3 John 4 Cory 1 Jack 5 Cory 2 Jack Thank you in advance
Basic stuff! Either subscripting or ?subset. There are many good R tutorials on the web. You should spend some (more?) time with some. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Sat, Feb 11, 2017 at 9:02 PM, Val <valkremk at gmail.com> wrote:> Hi all, > I have a big data set and want to remove rows conditionally. > In my data file each person were recorded for several weeks. Somehow > during the recording periods, their last name was misreported. For > each person, the last name should be the same. Otherwise remove from > the data. Example, in the following data set, Alex was found to have > two last names . > > Alex West > Alex Joseph > > Alex should be removed from the data. if this happens then I want > remove all rows with Alex. Here is my data set > > df <- read.table(header=TRUE, text='first week last > Alex 1 West > Bob 1 John > Cory 1 Jack > Cory 2 Jack > Bob 2 John > Bob 3 John > Alex 2 Joseph > Alex 3 West > Alex 4 West ') > > Desired output > > first week last > 1 Bob 1 John > 2 Bob 2 John > 3 Bob 3 John > 4 Cory 1 Jack > 5 Cory 2 Jack > > Thank you in advance > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Hi Val, The by() function could be used here. With the dataframe dfr: # split the data by first name and check for more than one last name for each first name res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1) # make the result more easily manipulated res <- as.table(res) res # first # Alex Bob Cory # TRUE FALSE FALSE # then use this result to subset the data nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ] # sort if needed nw.dfr[order(nw.dfr$first) , ] first week last 2 Bob 1 John 5 Bob 2 John 6 Bob 3 John 3 Cory 1 Jack 4 Cory 2 Jack Philip On 12/02/2017 4:02 PM, Val wrote:> Hi all, > I have a big data set and want to remove rows conditionally. > In my data file each person were recorded for several weeks. Somehow > during the recording periods, their last name was misreported. For > each person, the last name should be the same. Otherwise remove from > the data. Example, in the following data set, Alex was found to have > two last names . > > Alex West > Alex Joseph > > Alex should be removed from the data. if this happens then I want > remove all rows with Alex. Here is my data set > > df<- read.table(header=TRUE, text='first week last > Alex 1 West > Bob 1 John > Cory 1 Jack > Cory 2 Jack > Bob 2 John > Bob 3 John > Alex 2 Joseph > Alex 3 West > Alex 4 West ') > > Desired output > > first week last > 1 Bob 1 John > 2 Bob 2 John > 3 Bob 3 John > 4 Cory 1 Jack > 5 Cory 2 Jack > > Thank you in advance > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
On 12/02/17 18:36, Bert Gunter wrote:> Basic stuff! > > Either subscripting or ?subset. > > There are many good R tutorials on the web. You should spend some > (more?) time with some.Uh, Bert, perhaps I'm being obtuse (a common occurrence) but it doesn't seem basic to me. The only way that I can see how to go at it is via a for loop: rdln <- function(X) { # Remove discordant last names. ok <- logical(nrow(X)) for(nm in unique(X$first)) { xxx <- unique(X$last[X$first==nm]) if(length(xxx)==1) ok[X$first==nm] <- TRUE } Y <- X[ok,] Y <- Y[order(Y$first),] rownames(Y) <- 1:nrow(Y) Y } Calling the toy data frame "melvin" rather than "df" (since "df" is the name of the built in F density function, it is bad form to use it as the name of another object) I get: > rdln(melvin) first week last 1 Bob 1 John 2 Bob 2 John 3 Bob 3 John 4 Cory 1 Jack 5 Cory 2 Jack which is the desired output. If there is a "basic stuff" way to do this I'd like to see it. Perhaps I will then be toadally embarrassed, but they say that this is good for one. cheers, Rolf -- Technical Editor ANZJS Department of Statistics University of Auckland Phone: +64-9-373-7599 ext. 88276> On Sat, Feb 11, 2017 at 9:02 PM, Val <valkremk at gmail.com> wrote: >> Hi all, >> I have a big data set and want to remove rows conditionally. >> In my data file each person were recorded for several weeks. Somehow >> during the recording periods, their last name was misreported. For >> each person, the last name should be the same. Otherwise remove from >> the data. Example, in the following data set, Alex was found to have >> two last names . >> >> Alex West >> Alex Joseph >> >> Alex should be removed from the data. if this happens then I want >> remove all rows with Alex. Here is my data set >> >> df <- read.table(header=TRUE, text='first week last >> Alex 1 West >> Bob 1 John >> Cory 1 Jack >> Cory 2 Jack >> Bob 2 John >> Bob 3 John >> Alex 2 Joseph >> Alex 3 West >> Alex 4 West ') >> >> Desired output >> >> first week last >> 1 Bob 1 John >> 2 Bob 2 John >> 3 Bob 3 John >> 4 Cory 1 Jack >> 5 Cory 2 Jack
The "by" function aggregates and returns a result with generally fewer rows than the original data. Since you are looking to index the rows in the original data set, the "ave" function is better suited because it always returns a vector that is just as long as the input vector: # I usually work with character data rather than factors if I plan # to modify the data (e.g. removing rows) DF <- read.table( text'first week last Alex 1 West Bob 1 John Cory 1 Jack Cory 2 Jack Bob 2 John Bob 3 John Alex 2 Joseph Alex 3 West Alex 4 West ', header = TRUE, as.is = TRUE ) err <- ave( DF$last , DF[ , "first", drop = FALSE] , FUN = function( lst ) { length( unique( lst ) ) } ) result <- DF[ "1" == err, ] result Notice that the ave function returns a vector of the same type as was given to it, so even though the function returns a numeric the err vector is character. If you wanted to be able to examine more than one other column in determining the keep/reject decision, you could do: err2 <- ave( seq_along( DF$first ) , DF[ , "first", drop = FALSE] , FUN = function( n ) { length( unique( DF[ n, "last" ] ) ) } ) result2 <- DF[ 1 == err2, ] result2 and then you would have the option to re-use the "n" index to look at other columns as well. Finally, here is a dplyr solution: library(dplyr) result3 <- ( DF %>% group_by( first ) # like a prep for ave or by %>% mutate( err = length( unique( last ) ) ) # similar to ave %>% filter( 1 == err ) # drop the rows with too many last names %>% select( -err ) # drop the temporary column %>% as.data.frame # convert back to a plain-jane data frame ) result3 which uses a small set of verbs in a pipeline of functions to go from input to result in one pass. If your data set is really big (running out of memory big) then you might want to investigate the data.table or sqlite packages, either of which can be combined with dplyr to get a standardized syntax for managing larger amounts of data. However, most people actually aren't running out of memory so in most cases the extra horsepower isn't actually needed. On Sun, 12 Feb 2017, P Tennant wrote:> Hi Val, > > The by() function could be used here. With the dataframe dfr: > > # split the data by first name and check for more than one last name for each > first name > res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1) > # make the result more easily manipulated > res <- as.table(res) > res > # first > # Alex Bob Cory > # TRUE FALSE FALSE > > # then use this result to subset the data > nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ] > # sort if needed > nw.dfr[order(nw.dfr$first) , ] > > first week last > 2 Bob 1 John > 5 Bob 2 John > 6 Bob 3 John > 3 Cory 1 Jack > 4 Cory 2 Jack > > > Philip > > On 12/02/2017 4:02 PM, Val wrote: >> Hi all, >> I have a big data set and want to remove rows conditionally. >> In my data file each person were recorded for several weeks. Somehow >> during the recording periods, their last name was misreported. For >> each person, the last name should be the same. Otherwise remove from >> the data. Example, in the following data set, Alex was found to have >> two last names . >> >> Alex West >> Alex Joseph >> >> Alex should be removed from the data. if this happens then I want >> remove all rows with Alex. Here is my data set >> >> df<- read.table(header=TRUE, text='first week last >> Alex 1 West >> Bob 1 John >> Cory 1 Jack >> Cory 2 Jack >> Bob 2 John >> Bob 3 John >> Alex 2 Joseph >> Alex 3 West >> Alex 4 West ') >> >> Desired output >> >> first week last >> 1 Bob 1 John >> 2 Bob 2 John >> 3 Bob 3 John >> 4 Cory 1 Jack >> 5 Cory 2 Jack >> >> Thank you in advance >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >--------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k