Your question mystifies me, since it looks to me like you already know the answer. -- Sent from my phone. Please excuse my brevity. On February 12, 2017 3:30:49 PM PST, Val <valkremk at gmail.com> wrote:>Hi Jeff and all, > How do I get the number of unique first names in the two data sets? > >for the first one, >result2 <- DF[ 1 == err2, ] >length(unique(result2$first)) > > > > >On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller ><jdnewmil at dcn.davis.ca.us> wrote: >> The "by" function aggregates and returns a result with generally >fewer rows >> than the original data. Since you are looking to index the rows in >the >> original data set, the "ave" function is better suited because it >always >> returns a vector that is just as long as the input vector: >> >> # I usually work with character data rather than factors if I plan >> # to modify the data (e.g. removing rows) >> DF <- read.table( text>> 'first week last >> Alex 1 West >> Bob 1 John >> Cory 1 Jack >> Cory 2 Jack >> Bob 2 John >> Bob 3 John >> Alex 2 Joseph >> Alex 3 West >> Alex 4 West >> ', header = TRUE, as.is = TRUE ) >> >> err <- ave( DF$last >> , DF[ , "first", drop = FALSE] >> , FUN = function( lst ) { >> length( unique( lst ) ) >> } >> ) >> result <- DF[ "1" == err, ] >> result >> >> Notice that the ave function returns a vector of the same type as was >given >> to it, so even though the function returns a numeric the err >> vector is character. >> >> If you wanted to be able to examine more than one other column in >> determining the keep/reject decision, you could do: >> >> err2 <- ave( seq_along( DF$first ) >> , DF[ , "first", drop = FALSE] >> , FUN = function( n ) { >> length( unique( DF[ n, "last" ] ) ) >> } >> ) >> result2 <- DF[ 1 == err2, ] >> result2 >> >> and then you would have the option to re-use the "n" index to look at >other >> columns as well. >> >> Finally, here is a dplyr solution: >> >> library(dplyr) >> result3 <- ( DF >> %>% group_by( first ) # like a prep for ave or by >> %>% mutate( err = length( unique( last ) ) ) # similar to >ave >> %>% filter( 1 == err ) # drop the rows with too many last >names >> %>% select( -err ) # drop the temporary column >> %>% as.data.frame # convert back to a plain-jane data >frame >> ) >> result3 >> >> which uses a small set of verbs in a pipeline of functions to go from >input >> to result in one pass. >> >> If your data set is really big (running out of memory big) then you >might >> want to investigate the data.table or sqlite packages, either of >which can >> be combined with dplyr to get a standardized syntax for managing >larger >> amounts of data. However, most people actually aren't running out of >memory >> so in most cases the extra horsepower isn't actually needed. >> >> >> On Sun, 12 Feb 2017, P Tennant wrote: >> >>> Hi Val, >>> >>> The by() function could be used here. With the dataframe dfr: >>> >>> # split the data by first name and check for more than one last name >for >>> each first name >>> res <- by(dfr, dfr['first'], function(x) length(unique(x$last)) > 1) >>> # make the result more easily manipulated >>> res <- as.table(res) >>> res >>> # first >>> # Alex Bob Cory >>> # TRUE FALSE FALSE >>> >>> # then use this result to subset the data >>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ] >>> # sort if needed >>> nw.dfr[order(nw.dfr$first) , ] >>> >>> first week last >>> 2 Bob 1 John >>> 5 Bob 2 John >>> 6 Bob 3 John >>> 3 Cory 1 Jack >>> 4 Cory 2 Jack >>> >>> >>> Philip >>> >>> On 12/02/2017 4:02 PM, Val wrote: >>>> >>>> Hi all, >>>> I have a big data set and want to remove rows conditionally. >>>> In my data file each person were recorded for several weeks. >Somehow >>>> during the recording periods, their last name was misreported. >For >>>> each person, the last name should be the same. Otherwise remove >from >>>> the data. Example, in the following data set, Alex was found to >have >>>> two last names . >>>> >>>> Alex West >>>> Alex Joseph >>>> >>>> Alex should be removed from the data. if this happens then I want >>>> remove all rows with Alex. Here is my data set >>>> >>>> df<- read.table(header=TRUE, text='first week last >>>> Alex 1 West >>>> Bob 1 John >>>> Cory 1 Jack >>>> Cory 2 Jack >>>> Bob 2 John >>>> Bob 3 John >>>> Alex 2 Joseph >>>> Alex 3 West >>>> Alex 4 West ') >>>> >>>> Desired output >>>> >>>> first week last >>>> 1 Bob 1 John >>>> 2 Bob 2 John >>>> 3 Bob 3 John >>>> 4 Cory 1 Jack >>>> 5 Cory 2 Jack >>>> >>>> Thank you in advance >>>> >>>> ______________________________________________ >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> >--------------------------------------------------------------------------- >> Jeff Newmiller The ..... ..... Go >Live... >> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live >Go... >> Live: OO#.. Dead: OO#.. >Playing >> Research Engineer (Solar/Batteries O.O#. #.O#. with >> /Software/Embedded Controllers) .OO#. .OO#. >rocks...1k >> >---------------------------------------------------------------------------
Sorry Jeff, I did not finish my email. I accidentally touched the send button.
My question was the
when I used this one
length(unique(result2$first))
vs
dim(result2[!duplicated(result2[,c('first')]),]) [1]
I did get different results but now I found out the problem.
Thank you!.
On Sun, Feb 12, 2017 at 6:31 PM, Jeff Newmiller
<jdnewmil at dcn.davis.ca.us> wrote:> Your question mystifies me, since it looks to me like you already know the
answer.
> --
> Sent from my phone. Please excuse my brevity.
>
> On February 12, 2017 3:30:49 PM PST, Val <valkremk at gmail.com>
wrote:
>>Hi Jeff and all,
>> How do I get the number of unique first names in the two data sets?
>>
>>for the first one,
>>result2 <- DF[ 1 == err2, ]
>>length(unique(result2$first))
>>
>>
>>
>>
>>On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller
>><jdnewmil at dcn.davis.ca.us> wrote:
>>> The "by" function aggregates and returns a result with
generally
>>fewer rows
>>> than the original data. Since you are looking to index the rows in
>>the
>>> original data set, the "ave" function is better suited
because it
>>always
>>> returns a vector that is just as long as the input vector:
>>>
>>> # I usually work with character data rather than factors if I plan
>>> # to modify the data (e.g. removing rows)
>>> DF <- read.table( text>>> 'first week last
>>> Alex 1 West
>>> Bob 1 John
>>> Cory 1 Jack
>>> Cory 2 Jack
>>> Bob 2 John
>>> Bob 3 John
>>> Alex 2 Joseph
>>> Alex 3 West
>>> Alex 4 West
>>> ', header = TRUE, as.is = TRUE )
>>>
>>> err <- ave( DF$last
>>> , DF[ , "first", drop = FALSE]
>>> , FUN = function( lst ) {
>>> length( unique( lst ) )
>>> }
>>> )
>>> result <- DF[ "1" == err, ]
>>> result
>>>
>>> Notice that the ave function returns a vector of the same type as
was
>>given
>>> to it, so even though the function returns a numeric the err
>>> vector is character.
>>>
>>> If you wanted to be able to examine more than one other column in
>>> determining the keep/reject decision, you could do:
>>>
>>> err2 <- ave( seq_along( DF$first )
>>> , DF[ , "first", drop = FALSE]
>>> , FUN = function( n ) {
>>> length( unique( DF[ n, "last" ] ) )
>>> }
>>> )
>>> result2 <- DF[ 1 == err2, ]
>>> result2
>>>
>>> and then you would have the option to re-use the "n"
index to look at
>>other
>>> columns as well.
>>>
>>> Finally, here is a dplyr solution:
>>>
>>> library(dplyr)
>>> result3 <- ( DF
>>> %>% group_by( first ) # like a prep for ave or by
>>> %>% mutate( err = length( unique( last ) ) ) #
similar to
>>ave
>>> %>% filter( 1 == err ) # drop the rows with too many
last
>>names
>>> %>% select( -err ) # drop the temporary column
>>> %>% as.data.frame # convert back to a plain-jane data
>>frame
>>> )
>>> result3
>>>
>>> which uses a small set of verbs in a pipeline of functions to go
from
>>input
>>> to result in one pass.
>>>
>>> If your data set is really big (running out of memory big) then you
>>might
>>> want to investigate the data.table or sqlite packages, either of
>>which can
>>> be combined with dplyr to get a standardized syntax for managing
>>larger
>>> amounts of data. However, most people actually aren't running
out of
>>memory
>>> so in most cases the extra horsepower isn't actually needed.
>>>
>>>
>>> On Sun, 12 Feb 2017, P Tennant wrote:
>>>
>>>> Hi Val,
>>>>
>>>> The by() function could be used here. With the dataframe dfr:
>>>>
>>>> # split the data by first name and check for more than one last
name
>>for
>>>> each first name
>>>> res <- by(dfr, dfr['first'], function(x)
length(unique(x$last)) > 1)
>>>> # make the result more easily manipulated
>>>> res <- as.table(res)
>>>> res
>>>> # first
>>>> # Alex Bob Cory
>>>> # TRUE FALSE FALSE
>>>>
>>>> # then use this result to subset the data
>>>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
>>>> # sort if needed
>>>> nw.dfr[order(nw.dfr$first) , ]
>>>>
>>>> first week last
>>>> 2 Bob 1 John
>>>> 5 Bob 2 John
>>>> 6 Bob 3 John
>>>> 3 Cory 1 Jack
>>>> 4 Cory 2 Jack
>>>>
>>>>
>>>> Philip
>>>>
>>>> On 12/02/2017 4:02 PM, Val wrote:
>>>>>
>>>>> Hi all,
>>>>> I have a big data set and want to remove rows
conditionally.
>>>>> In my data file each person were recorded for several
weeks.
>>Somehow
>>>>> during the recording periods, their last name was
misreported.
>>For
>>>>> each person, the last name should be the same. Otherwise
remove
>>from
>>>>> the data. Example, in the following data set, Alex was
found to
>>have
>>>>> two last names .
>>>>>
>>>>> Alex West
>>>>> Alex Joseph
>>>>>
>>>>> Alex should be removed from the data. if this happens
then I want
>>>>> remove all rows with Alex. Here is my data set
>>>>>
>>>>> df<- read.table(header=TRUE, text='first week last
>>>>> Alex 1 West
>>>>> Bob 1 John
>>>>> Cory 1 Jack
>>>>> Cory 2 Jack
>>>>> Bob 2 John
>>>>> Bob 3 John
>>>>> Alex 2 Joseph
>>>>> Alex 3 West
>>>>> Alex 4 West ')
>>>>>
>>>>> Desired output
>>>>>
>>>>> first week last
>>>>> 1 Bob 1 John
>>>>> 2 Bob 2 John
>>>>> 3 Bob 3 John
>>>>> 4 Cory 1 Jack
>>>>> 5 Cory 2 Jack
>>>>>
>>>>> Thank you in advance
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained,
reproducible code.
>>>>
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible
code.
>>>>
>>>
>>>
>>---------------------------------------------------------------------------
>>> Jeff Newmiller The ..... ..... Go
>>Live...
>>> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#.
##.#. Live
>>Go...
>>> Live: OO#.. Dead: OO#..
>>Playing
>>> Research Engineer (Solar/Batteries O.O#. #.O#.
with
>>> /Software/Embedded Controllers) .OO#. .OO#.
>>rocks...1k
>>>
>>---------------------------------------------------------------------------
Hi Jeff and All,
When I examined the excluded data, ie., first name with with
different last names, I noticed that some last names were not
recorded
or instance, I modified the data as follows
DF <- read.table( text'first week last
Alex 1 West
Bob 1 John
Cory 1 Jack
Cory 2 -
Bob 2 John
Bob 3 John
Alex 2 Joseph
Alex 3 West
Alex 4 West
', header = TRUE, as.is = TRUE )
err2 <- ave( seq_along( DF$first )
, DF[ , "first", drop = FALSE]
, FUN = function( n ) {
length( unique( DF[ n, "last" ] ) )
}
)
result2 <- DF[ 1 == err2, ]
result2
first week last
2 Bob 1 John
5 Bob 2 John
6 Bob 3 John
However, I want keep Cory's record. It is assumed that not recorded
should have the same last name.
Final out put should be
first week last
Bob 1 John
Bob 2 John
Bob 3 John
Cory 1 Jack
Cory 2 -
Thank you again!
On Sun, Feb 12, 2017 at 7:28 PM, Val <valkremk at gmail.com>
wrote:> Sorry Jeff, I did not finish my email. I accidentally touched the send
button.
> My question was the
> when I used this one
> length(unique(result2$first))
> vs
> dim(result2[!duplicated(result2[,c('first')]),]) [1]
>
> I did get different results but now I found out the problem.
>
> Thank you!.
>
>
>
>
>
>
>
>
> On Sun, Feb 12, 2017 at 6:31 PM, Jeff Newmiller
> <jdnewmil at dcn.davis.ca.us> wrote:
>> Your question mystifies me, since it looks to me like you already know
the answer.
>> --
>> Sent from my phone. Please excuse my brevity.
>>
>> On February 12, 2017 3:30:49 PM PST, Val <valkremk at gmail.com>
wrote:
>>>Hi Jeff and all,
>>> How do I get the number of unique first names in the two data
sets?
>>>
>>>for the first one,
>>>result2 <- DF[ 1 == err2, ]
>>>length(unique(result2$first))
>>>
>>>
>>>
>>>
>>>On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller
>>><jdnewmil at dcn.davis.ca.us> wrote:
>>>> The "by" function aggregates and returns a result
with generally
>>>fewer rows
>>>> than the original data. Since you are looking to index the rows
in
>>>the
>>>> original data set, the "ave" function is better
suited because it
>>>always
>>>> returns a vector that is just as long as the input vector:
>>>>
>>>> # I usually work with character data rather than factors if I
plan
>>>> # to modify the data (e.g. removing rows)
>>>> DF <- read.table( text>>>> 'first week last
>>>> Alex 1 West
>>>> Bob 1 John
>>>> Cory 1 Jack
>>>> Cory 2 Jack
>>>> Bob 2 John
>>>> Bob 3 John
>>>> Alex 2 Joseph
>>>> Alex 3 West
>>>> Alex 4 West
>>>> ', header = TRUE, as.is = TRUE )
>>>>
>>>> err <- ave( DF$last
>>>> , DF[ , "first", drop = FALSE]
>>>> , FUN = function( lst ) {
>>>> length( unique( lst ) )
>>>> }
>>>> )
>>>> result <- DF[ "1" == err, ]
>>>> result
>>>>
>>>> Notice that the ave function returns a vector of the same type
as was
>>>given
>>>> to it, so even though the function returns a numeric the err
>>>> vector is character.
>>>>
>>>> If you wanted to be able to examine more than one other column
in
>>>> determining the keep/reject decision, you could do:
>>>>
>>>> err2 <- ave( seq_along( DF$first )
>>>> , DF[ , "first", drop = FALSE]
>>>> , FUN = function( n ) {
>>>> length( unique( DF[ n, "last" ] ) )
>>>> }
>>>> )
>>>> result2 <- DF[ 1 == err2, ]
>>>> result2
>>>>
>>>> and then you would have the option to re-use the "n"
index to look at
>>>other
>>>> columns as well.
>>>>
>>>> Finally, here is a dplyr solution:
>>>>
>>>> library(dplyr)
>>>> result3 <- ( DF
>>>> %>% group_by( first ) # like a prep for ave or by
>>>> %>% mutate( err = length( unique( last ) ) ) #
similar to
>>>ave
>>>> %>% filter( 1 == err ) # drop the rows with too
many last
>>>names
>>>> %>% select( -err ) # drop the temporary column
>>>> %>% as.data.frame # convert back to a plain-jane
data
>>>frame
>>>> )
>>>> result3
>>>>
>>>> which uses a small set of verbs in a pipeline of functions to
go from
>>>input
>>>> to result in one pass.
>>>>
>>>> If your data set is really big (running out of memory big) then
you
>>>might
>>>> want to investigate the data.table or sqlite packages, either
of
>>>which can
>>>> be combined with dplyr to get a standardized syntax for
managing
>>>larger
>>>> amounts of data. However, most people actually aren't
running out of
>>>memory
>>>> so in most cases the extra horsepower isn't actually
needed.
>>>>
>>>>
>>>> On Sun, 12 Feb 2017, P Tennant wrote:
>>>>
>>>>> Hi Val,
>>>>>
>>>>> The by() function could be used here. With the dataframe
dfr:
>>>>>
>>>>> # split the data by first name and check for more than one
last name
>>>for
>>>>> each first name
>>>>> res <- by(dfr, dfr['first'], function(x)
length(unique(x$last)) > 1)
>>>>> # make the result more easily manipulated
>>>>> res <- as.table(res)
>>>>> res
>>>>> # first
>>>>> # Alex Bob Cory
>>>>> # TRUE FALSE FALSE
>>>>>
>>>>> # then use this result to subset the data
>>>>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
>>>>> # sort if needed
>>>>> nw.dfr[order(nw.dfr$first) , ]
>>>>>
>>>>> first week last
>>>>> 2 Bob 1 John
>>>>> 5 Bob 2 John
>>>>> 6 Bob 3 John
>>>>> 3 Cory 1 Jack
>>>>> 4 Cory 2 Jack
>>>>>
>>>>>
>>>>> Philip
>>>>>
>>>>> On 12/02/2017 4:02 PM, Val wrote:
>>>>>>
>>>>>> Hi all,
>>>>>> I have a big data set and want to remove rows
conditionally.
>>>>>> In my data file each person were recorded for several
weeks.
>>>Somehow
>>>>>> during the recording periods, their last name was
misreported.
>>>For
>>>>>> each person, the last name should be the same.
Otherwise remove
>>>from
>>>>>> the data. Example, in the following data set, Alex was
found to
>>>have
>>>>>> two last names .
>>>>>>
>>>>>> Alex West
>>>>>> Alex Joseph
>>>>>>
>>>>>> Alex should be removed from the data. if this happens
then I want
>>>>>> remove all rows with Alex. Here is my data set
>>>>>>
>>>>>> df<- read.table(header=TRUE, text='first week
last
>>>>>> Alex 1 West
>>>>>> Bob 1 John
>>>>>> Cory 1 Jack
>>>>>> Cory 2 Jack
>>>>>> Bob 2 John
>>>>>> Bob 3 John
>>>>>> Alex 2 Joseph
>>>>>> Alex 3 West
>>>>>> Alex 4 West ')
>>>>>>
>>>>>> Desired output
>>>>>>
>>>>>> first week last
>>>>>> 1 Bob 1 John
>>>>>> 2 Bob 2 John
>>>>>> 3 Bob 3 John
>>>>>> 4 Cory 1 Jack
>>>>>> 5 Cory 2 Jack
>>>>>>
>>>>>> Thank you in advance
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE
and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained,
reproducible code.
>>>>>
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained,
reproducible code.
>>>>>
>>>>
>>>>
>>>---------------------------------------------------------------------------
>>>> Jeff Newmiller The ..... .....
Go
>>>Live...
>>>> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#.
##.#. Live
>>>Go...
>>>> Live: OO#.. Dead: OO#..
>>>Playing
>>>> Research Engineer (Solar/Batteries O.O#. #.O#.
with
>>>> /Software/Embedded Controllers) .OO#. .OO#.
>>>rocks...1k
>>>>
>>>---------------------------------------------------------------------------