Your question mystifies me, since it looks to me like you already know the answer. -- Sent from my phone. Please excuse my brevity. On February 12, 2017 3:30:49 PM PST, Val <valkremk at gmail.com> wrote:>Hi Jeff and all, > How do I get the number of unique first names in the two data sets? > >for the first one, >result2 <- DF[ 1 == err2, ] >length(unique(result2$first)) > > > > >On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller ><jdnewmil at dcn.davis.ca.us> wrote: >> The "by" function aggregates and returns a result with generally >fewer rows >> than the original data. Since you are looking to index the rows in >the >> original data set, the "ave" function is better suited because it >always >> returns a vector that is just as long as the input vector: >> >> # I usually work with character data rather than factors if I plan >> # to modify the data (e.g. removing rows) >> DF <- read.table( text>> 'first week last >> Alex 1 West >> Bob 1 John >> Cory 1 Jack >> Cory 2 Jack >> Bob 2 John >> Bob 3 John >> Alex 2 Joseph >> Alex 3 West >> Alex 4 West >> ', header = TRUE, as.is = TRUE ) >> >> err <- ave( DF$last >> , DF[ , "first", drop = FALSE] >> , FUN = function( lst ) { >> length( unique( lst ) ) >> } >> ) >> result <- DF[ "1" == err, ] >> result >> >> Notice that the ave function returns a vector of the same type as was >given >> to it, so even though the function returns a numeric the err >> vector is character. >> >> If you wanted to be able to examine more than one other column in >> determining the keep/reject decision, you could do: >> >> err2 <- ave( seq_along( DF$first ) >> , DF[ , "first", drop = FALSE] >> , FUN = function( n ) { >> length( unique( DF[ n, "last" ] ) ) >> } >> ) >> result2 <- DF[ 1 == err2, ] >> result2 >> >> and then you would have the option to re-use the "n" index to look at >other >> columns as well. >> >> Finally, here is a dplyr solution: >> >> library(dplyr) >> result3 <- ( DF >> %>% group_by( first ) # like a prep for ave or by >> %>% mutate( err = length( unique( last ) ) ) # similar to >ave >> %>% filter( 1 == err ) # drop the rows with too many last >names >> %>% select( -err ) # drop the temporary column >> %>% as.data.frame # convert back to a plain-jane data >frame >> ) >> result3 >> >> which uses a small set of verbs in a pipeline of functions to go from >input >> to result in one pass. >> >> If your data set is really big (running out of memory big) then you >might >> want to investigate the data.table or sqlite packages, either of >which can >> be combined with dplyr to get a standardized syntax for managing >larger >> amounts of data. However, most people actually aren't running out of >memory >> so in most cases the extra horsepower isn't actually needed. >> >> >> On Sun, 12 Feb 2017, P Tennant wrote: >> >>> Hi Val, >>> >>> The by() function could be used here. With the dataframe dfr: >>> >>> # split the data by first name and check for more than one last name >for >>> each first name >>> res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1) >>> # make the result more easily manipulated >>> res <- as.table(res) >>> res >>> # first >>> # Alex Bob Cory >>> # TRUE FALSE FALSE >>> >>> # then use this result to subset the data >>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ] >>> # sort if needed >>> nw.dfr[order(nw.dfr$first) , ] >>> >>> first week last >>> 2 Bob 1 John >>> 5 Bob 2 John >>> 6 Bob 3 John >>> 3 Cory 1 Jack >>> 4 Cory 2 Jack >>> >>> >>> Philip >>> >>> On 12/02/2017 4:02 PM, Val wrote: >>>> >>>> Hi all, >>>> I have a big data set and want to remove rows conditionally. >>>> In my data file each person were recorded for several weeks. >Somehow >>>> during the recording periods, their last name was misreported. >For >>>> each person, the last name should be the same. Otherwise remove >from >>>> the data. Example, in the following data set, Alex was found to >have >>>> two last names . >>>> >>>> Alex West >>>> Alex Joseph >>>> >>>> Alex should be removed from the data. if this happens then I want >>>> remove all rows with Alex. Here is my data set >>>> >>>> df<- read.table(header=TRUE, text='first week last >>>> Alex 1 West >>>> Bob 1 John >>>> Cory 1 Jack >>>> Cory 2 Jack >>>> Bob 2 John >>>> Bob 3 John >>>> Alex 2 Joseph >>>> Alex 3 West >>>> Alex 4 West ') >>>> >>>> Desired output >>>> >>>> first week last >>>> 1 Bob 1 John >>>> 2 Bob 2 John >>>> 3 Bob 3 John >>>> 4 Cory 1 Jack >>>> 5 Cory 2 Jack >>>> >>>> Thank you in advance >>>> >>>> ______________________________________________ >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> >--------------------------------------------------------------------------- >> Jeff Newmiller The ..... ..... Go >Live... >> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live >Go... >> Live: OO#.. Dead: OO#.. >Playing >> Research Engineer (Solar/Batteries O.O#. #.O#. with >> /Software/Embedded Controllers) .OO#. .OO#. >rocks...1k >> >---------------------------------------------------------------------------
Sorry Jeff, I did not finish my email. I accidentally touched the send button. My question was the when I used this one length(unique(result2$first)) vs dim(result2[!duplicated(result2[,c('first')]),]) [1] I did get different results but now I found out the problem. Thank you!. On Sun, Feb 12, 2017 at 6:31 PM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote:> Your question mystifies me, since it looks to me like you already know the answer. > -- > Sent from my phone. Please excuse my brevity. > > On February 12, 2017 3:30:49 PM PST, Val <valkremk at gmail.com> wrote: >>Hi Jeff and all, >> How do I get the number of unique first names in the two data sets? >> >>for the first one, >>result2 <- DF[ 1 == err2, ] >>length(unique(result2$first)) >> >> >> >> >>On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller >><jdnewmil at dcn.davis.ca.us> wrote: >>> The "by" function aggregates and returns a result with generally >>fewer rows >>> than the original data. Since you are looking to index the rows in >>the >>> original data set, the "ave" function is better suited because it >>always >>> returns a vector that is just as long as the input vector: >>> >>> # I usually work with character data rather than factors if I plan >>> # to modify the data (e.g. removing rows) >>> DF <- read.table( text>>> 'first week last >>> Alex 1 West >>> Bob 1 John >>> Cory 1 Jack >>> Cory 2 Jack >>> Bob 2 John >>> Bob 3 John >>> Alex 2 Joseph >>> Alex 3 West >>> Alex 4 West >>> ', header = TRUE, as.is = TRUE ) >>> >>> err <- ave( DF$last >>> , DF[ , "first", drop = FALSE] >>> , FUN = function( lst ) { >>> length( unique( lst ) ) >>> } >>> ) >>> result <- DF[ "1" == err, ] >>> result >>> >>> Notice that the ave function returns a vector of the same type as was >>given >>> to it, so even though the function returns a numeric the err >>> vector is character. >>> >>> If you wanted to be able to examine more than one other column in >>> determining the keep/reject decision, you could do: >>> >>> err2 <- ave( seq_along( DF$first ) >>> , DF[ , "first", drop = FALSE] >>> , FUN = function( n ) { >>> length( unique( DF[ n, "last" ] ) ) >>> } >>> ) >>> result2 <- DF[ 1 == err2, ] >>> result2 >>> >>> and then you would have the option to re-use the "n" index to look at >>other >>> columns as well. >>> >>> Finally, here is a dplyr solution: >>> >>> library(dplyr) >>> result3 <- ( DF >>> %>% group_by( first ) # like a prep for ave or by >>> %>% mutate( err = length( unique( last ) ) ) # similar to >>ave >>> %>% filter( 1 == err ) # drop the rows with too many last >>names >>> %>% select( -err ) # drop the temporary column >>> %>% as.data.frame # convert back to a plain-jane data >>frame >>> ) >>> result3 >>> >>> which uses a small set of verbs in a pipeline of functions to go from >>input >>> to result in one pass. >>> >>> If your data set is really big (running out of memory big) then you >>might >>> want to investigate the data.table or sqlite packages, either of >>which can >>> be combined with dplyr to get a standardized syntax for managing >>larger >>> amounts of data. However, most people actually aren't running out of >>memory >>> so in most cases the extra horsepower isn't actually needed. >>> >>> >>> On Sun, 12 Feb 2017, P Tennant wrote: >>> >>>> Hi Val, >>>> >>>> The by() function could be used here. With the dataframe dfr: >>>> >>>> # split the data by first name and check for more than one last name >>for >>>> each first name >>>> res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1) >>>> # make the result more easily manipulated >>>> res <- as.table(res) >>>> res >>>> # first >>>> # Alex Bob Cory >>>> # TRUE FALSE FALSE >>>> >>>> # then use this result to subset the data >>>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ] >>>> # sort if needed >>>> nw.dfr[order(nw.dfr$first) , ] >>>> >>>> first week last >>>> 2 Bob 1 John >>>> 5 Bob 2 John >>>> 6 Bob 3 John >>>> 3 Cory 1 Jack >>>> 4 Cory 2 Jack >>>> >>>> >>>> Philip >>>> >>>> On 12/02/2017 4:02 PM, Val wrote: >>>>> >>>>> Hi all, >>>>> I have a big data set and want to remove rows conditionally. >>>>> In my data file each person were recorded for several weeks. >>Somehow >>>>> during the recording periods, their last name was misreported. >>For >>>>> each person, the last name should be the same. Otherwise remove >>from >>>>> the data. Example, in the following data set, Alex was found to >>have >>>>> two last names . >>>>> >>>>> Alex West >>>>> Alex Joseph >>>>> >>>>> Alex should be removed from the data. if this happens then I want >>>>> remove all rows with Alex. Here is my data set >>>>> >>>>> df<- read.table(header=TRUE, text='first week last >>>>> Alex 1 West >>>>> Bob 1 John >>>>> Cory 1 Jack >>>>> Cory 2 Jack >>>>> Bob 2 John >>>>> Bob 3 John >>>>> Alex 2 Joseph >>>>> Alex 3 West >>>>> Alex 4 West ') >>>>> >>>>> Desired output >>>>> >>>>> first week last >>>>> 1 Bob 1 John >>>>> 2 Bob 2 John >>>>> 3 Bob 3 John >>>>> 4 Cory 1 Jack >>>>> 5 Cory 2 Jack >>>>> >>>>> Thank you in advance >>>>> >>>>> ______________________________________________ >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide >>>>> http://www.R-project.org/posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>>> >>>> ______________________________________________ >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>> >>> >>--------------------------------------------------------------------------- >>> Jeff Newmiller The ..... ..... Go >>Live... >>> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live >>Go... >>> Live: OO#.. Dead: OO#.. >>Playing >>> Research Engineer (Solar/Batteries O.O#. #.O#. with >>> /Software/Embedded Controllers) .OO#. .OO#. >>rocks...1k >>> >>---------------------------------------------------------------------------
Hi Jeff and All, When I examined the excluded data, ie., first name with with different last names, I noticed that some last names were not recorded or instance, I modified the data as follows DF <- read.table( text'first week last Alex 1 West Bob 1 John Cory 1 Jack Cory 2 - Bob 2 John Bob 3 John Alex 2 Joseph Alex 3 West Alex 4 West ', header = TRUE, as.is = TRUE ) err2 <- ave( seq_along( DF$first ) , DF[ , "first", drop = FALSE] , FUN = function( n ) { length( unique( DF[ n, "last" ] ) ) } ) result2 <- DF[ 1 == err2, ] result2 first week last 2 Bob 1 John 5 Bob 2 John 6 Bob 3 John However, I want keep Cory's record. It is assumed that not recorded should have the same last name. Final out put should be first week last Bob 1 John Bob 2 John Bob 3 John Cory 1 Jack Cory 2 - Thank you again! On Sun, Feb 12, 2017 at 7:28 PM, Val <valkremk at gmail.com> wrote:> Sorry Jeff, I did not finish my email. I accidentally touched the send button. > My question was the > when I used this one > length(unique(result2$first)) > vs > dim(result2[!duplicated(result2[,c('first')]),]) [1] > > I did get different results but now I found out the problem. > > Thank you!. > > > > > > > > > On Sun, Feb 12, 2017 at 6:31 PM, Jeff Newmiller > <jdnewmil at dcn.davis.ca.us> wrote: >> Your question mystifies me, since it looks to me like you already know the answer. >> -- >> Sent from my phone. Please excuse my brevity. >> >> On February 12, 2017 3:30:49 PM PST, Val <valkremk at gmail.com> wrote: >>>Hi Jeff and all, >>> How do I get the number of unique first names in the two data sets? >>> >>>for the first one, >>>result2 <- DF[ 1 == err2, ] >>>length(unique(result2$first)) >>> >>> >>> >>> >>>On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller >>><jdnewmil at dcn.davis.ca.us> wrote: >>>> The "by" function aggregates and returns a result with generally >>>fewer rows >>>> than the original data. Since you are looking to index the rows in >>>the >>>> original data set, the "ave" function is better suited because it >>>always >>>> returns a vector that is just as long as the input vector: >>>> >>>> # I usually work with character data rather than factors if I plan >>>> # to modify the data (e.g. removing rows) >>>> DF <- read.table( text>>>> 'first week last >>>> Alex 1 West >>>> Bob 1 John >>>> Cory 1 Jack >>>> Cory 2 Jack >>>> Bob 2 John >>>> Bob 3 John >>>> Alex 2 Joseph >>>> Alex 3 West >>>> Alex 4 West >>>> ', header = TRUE, as.is = TRUE ) >>>> >>>> err <- ave( DF$last >>>> , DF[ , "first", drop = FALSE] >>>> , FUN = function( lst ) { >>>> length( unique( lst ) ) >>>> } >>>> ) >>>> result <- DF[ "1" == err, ] >>>> result >>>> >>>> Notice that the ave function returns a vector of the same type as was >>>given >>>> to it, so even though the function returns a numeric the err >>>> vector is character. >>>> >>>> If you wanted to be able to examine more than one other column in >>>> determining the keep/reject decision, you could do: >>>> >>>> err2 <- ave( seq_along( DF$first ) >>>> , DF[ , "first", drop = FALSE] >>>> , FUN = function( n ) { >>>> length( unique( DF[ n, "last" ] ) ) >>>> } >>>> ) >>>> result2 <- DF[ 1 == err2, ] >>>> result2 >>>> >>>> and then you would have the option to re-use the "n" index to look at >>>other >>>> columns as well. >>>> >>>> Finally, here is a dplyr solution: >>>> >>>> library(dplyr) >>>> result3 <- ( DF >>>> %>% group_by( first ) # like a prep for ave or by >>>> %>% mutate( err = length( unique( last ) ) ) # similar to >>>ave >>>> %>% filter( 1 == err ) # drop the rows with too many last >>>names >>>> %>% select( -err ) # drop the temporary column >>>> %>% as.data.frame # convert back to a plain-jane data >>>frame >>>> ) >>>> result3 >>>> >>>> which uses a small set of verbs in a pipeline of functions to go from >>>input >>>> to result in one pass. >>>> >>>> If your data set is really big (running out of memory big) then you >>>might >>>> want to investigate the data.table or sqlite packages, either of >>>which can >>>> be combined with dplyr to get a standardized syntax for managing >>>larger >>>> amounts of data. However, most people actually aren't running out of >>>memory >>>> so in most cases the extra horsepower isn't actually needed. >>>> >>>> >>>> On Sun, 12 Feb 2017, P Tennant wrote: >>>> >>>>> Hi Val, >>>>> >>>>> The by() function could be used here. With the dataframe dfr: >>>>> >>>>> # split the data by first name and check for more than one last name >>>for >>>>> each first name >>>>> res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1) >>>>> # make the result more easily manipulated >>>>> res <- as.table(res) >>>>> res >>>>> # first >>>>> # Alex Bob Cory >>>>> # TRUE FALSE FALSE >>>>> >>>>> # then use this result to subset the data >>>>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ] >>>>> # sort if needed >>>>> nw.dfr[order(nw.dfr$first) , ] >>>>> >>>>> first week last >>>>> 2 Bob 1 John >>>>> 5 Bob 2 John >>>>> 6 Bob 3 John >>>>> 3 Cory 1 Jack >>>>> 4 Cory 2 Jack >>>>> >>>>> >>>>> Philip >>>>> >>>>> On 12/02/2017 4:02 PM, Val wrote: >>>>>> >>>>>> Hi all, >>>>>> I have a big data set and want to remove rows conditionally. >>>>>> In my data file each person were recorded for several weeks. >>>Somehow >>>>>> during the recording periods, their last name was misreported. >>>For >>>>>> each person, the last name should be the same. Otherwise remove >>>from >>>>>> the data. Example, in the following data set, Alex was found to >>>have >>>>>> two last names . >>>>>> >>>>>> Alex West >>>>>> Alex Joseph >>>>>> >>>>>> Alex should be removed from the data. if this happens then I want >>>>>> remove all rows with Alex. Here is my data set >>>>>> >>>>>> df<- read.table(header=TRUE, text='first week last >>>>>> Alex 1 West >>>>>> Bob 1 John >>>>>> Cory 1 Jack >>>>>> Cory 2 Jack >>>>>> Bob 2 John >>>>>> Bob 3 John >>>>>> Alex 2 Joseph >>>>>> Alex 3 West >>>>>> Alex 4 West ') >>>>>> >>>>>> Desired output >>>>>> >>>>>> first week last >>>>>> 1 Bob 1 John >>>>>> 2 Bob 2 John >>>>>> 3 Bob 3 John >>>>>> 4 Cory 1 Jack >>>>>> 5 Cory 2 Jack >>>>>> >>>>>> Thank you in advance >>>>>> >>>>>> ______________________________________________ >>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>> PLEASE do read the posting guide >>>>>> http://www.R-project.org/posting-guide.html >>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>>> >>>>> ______________________________________________ >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide >>>>> http://www.R-project.org/posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>> >>>> >>>--------------------------------------------------------------------------- >>>> Jeff Newmiller The ..... ..... Go >>>Live... >>>> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live >>>Go... >>>> Live: OO#.. Dead: OO#.. >>>Playing >>>> Research Engineer (Solar/Batteries O.O#. #.O#. with >>>> /Software/Embedded Controllers) .OO#. .OO#. >>>rocks...1k >>>> >>>---------------------------------------------------------------------------