thr3ads.net - R help - [R] remove [Feb 2017]

If this information is useful, please help other people find it:
Share via:

Jeff Newmiller

2017-Feb-13 00:31 UTC

[R] remove

Your question mystifies me, since it looks to me like you already know the
answer.
-- 
Sent from my phone. Please excuse my brevity.

On February 12, 2017 3:30:49 PM PST, Val <valkremk at gmail.com>
wrote:>Hi Jeff and all,
> How do I get the  number of unique first names   in the two data sets?
>
>for the first one,
>result2 <- DF[ 1 == err2, ]
>length(unique(result2$first))
>
>
>
>
>On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller
><jdnewmil at dcn.davis.ca.us> wrote:
>> The "by" function aggregates and returns a result with
generally
>fewer rows
>> than the original data. Since you are looking to index the rows in
>the
>> original data set, the "ave" function is better suited
because it
>always
>> returns a vector that is just as long as the input vector:
>>
>> # I usually work with character data rather than factors if I plan
>> # to modify the data (e.g. removing rows)
>> DF <- read.table( text>> 'first  week last
>> Alex    1  West
>> Bob     1  John
>> Cory    1  Jack
>> Cory    2  Jack
>> Bob     2  John
>> Bob     3  John
>> Alex    2  Joseph
>> Alex    3  West
>> Alex    4  West
>> ', header = TRUE, as.is = TRUE )
>>
>> err <- ave( DF$last
>>           , DF[ , "first", drop = FALSE]
>>           , FUN = function( lst ) {
>>               length( unique( lst ) )
>>             }
>>           )
>> result <- DF[ "1" == err, ]
>> result
>>
>> Notice that the ave function returns a vector of the same type as was
>given
>> to it, so even though the function returns a numeric the err
>> vector is character.
>>
>> If you wanted to be able to examine more than one other column in
>> determining the keep/reject decision, you could do:
>>
>> err2 <- ave( seq_along( DF$first )
>>            , DF[ , "first", drop = FALSE]
>>            , FUN = function( n ) {
>>               length( unique( DF[ n, "last" ] ) )
>>              }
>>            )
>> result2 <- DF[ 1 == err2, ]
>> result2
>>
>> and then you would have the option to re-use the "n" index to
look at
>other
>> columns as well.
>>
>> Finally, here is a dplyr solution:
>>
>> library(dplyr)
>> result3 <- (   DF
>>            %>% group_by( first ) # like a prep for ave or by
>>            %>% mutate( err = length( unique( last ) ) ) # similar to
>ave
>>            %>% filter( 1 == err ) # drop the rows with too many last
>names
>>            %>% select( -err ) # drop the temporary column
>>            %>% as.data.frame # convert back to a plain-jane data
>frame
>>            )
>> result3
>>
>> which uses a small set of verbs in a pipeline of functions to go from
>input
>> to result in one pass.
>>
>> If your data set is really big (running out of memory big) then you
>might
>> want to investigate the data.table or sqlite packages, either of
>which can
>> be combined with dplyr to get a standardized syntax for managing
>larger
>> amounts of data. However, most people actually aren't running out
of
>memory
>> so in most cases the extra horsepower isn't actually needed.
>>
>>
>> On Sun, 12 Feb 2017, P Tennant wrote:
>>
>>> Hi Val,
>>>
>>> The by() function could be used here. With the dataframe dfr:
>>>
>>> # split the data by first name and check for more than one last
name
>for
>>> each first name
>>> res <- by(dfr, dfr['first'], function(x)
length(unique(x$last)) > 1)
>>> # make the result more easily manipulated
>>> res <- as.table(res)
>>> res
>>> # first
>>> # Alex   Bob  Cory
>>> # TRUE FALSE FALSE
>>>
>>> # then use this result to subset the data
>>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
>>> # sort if needed
>>> nw.dfr[order(nw.dfr$first) , ]
>>>
>>>  first week last
>>> 2   Bob    1 John
>>> 5   Bob    2 John
>>> 6   Bob    3 John
>>> 3  Cory    1 Jack
>>> 4  Cory    2 Jack
>>>
>>>
>>> Philip
>>>
>>> On 12/02/2017 4:02 PM, Val wrote:
>>>>
>>>> Hi all,
>>>> I have a big data set and want to  remove rows conditionally.
>>>> In my data file  each person were recorded  for several weeks.
>Somehow
>>>> during the recording periods, their last name was misreported.
>For
>>>> each person,   the last name should be the same. Otherwise
remove
>from
>>>> the data. Example, in the following data set, Alex was found to
>have
>>>> two last names .
>>>>
>>>> Alex   West
>>>> Alex   Joseph
>>>>
>>>> Alex should be removed  from the data.  if this happens then I
want
>>>> remove  all rows with Alex. Here is my data set
>>>>
>>>> df<- read.table(header=TRUE, text='first  week last
>>>> Alex    1  West
>>>> Bob     1  John
>>>> Cory    1  Jack
>>>> Cory    2  Jack
>>>> Bob     2  John
>>>> Bob     3  John
>>>> Alex    2  Joseph
>>>> Alex    3  West
>>>> Alex    4  West ')
>>>>
>>>> Desired output
>>>>
>>>>        first  week last
>>>> 1     Bob     1   John
>>>> 2     Bob     2   John
>>>> 3     Bob     3   John
>>>> 4     Cory     1   Jack
>>>> 5     Cory     2   Jack
>>>>
>>>> Thank you in advance
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible
code.
>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>---------------------------------------------------------------------------
>> Jeff Newmiller                        The     .....       .....  Go
>Live...
>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.      
##.#.  Live
>Go...
>>                                       Live:   OO#.. Dead: OO#.. 
>Playing
>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> /Software/Embedded Controllers)               .OO#.       .OO#. 
>rocks...1k
>>
>---------------------------------------------------------------------------

Val

2017-Feb-13 01:28 UTC

head link

[R] remove

Sorry  Jeff, I did not finish my email. I accidentally touched the send button.
My question was the
when I used this one
length(unique(result2$first))
     vs
dim(result2[!duplicated(result2[,c('first')]),]) [1]

I did get different results but now I found out the problem.

Thank you!.








On Sun, Feb 12, 2017 at 6:31 PM, Jeff Newmiller
<jdnewmil at dcn.davis.ca.us> wrote:> Your question mystifies me, since it looks to me like you already know the
answer.
> --
> Sent from my phone. Please excuse my brevity.
>
> On February 12, 2017 3:30:49 PM PST, Val <valkremk at gmail.com>
wrote:
>>Hi Jeff and all,
>> How do I get the  number of unique first names   in the two data sets?
>>
>>for the first one,
>>result2 <- DF[ 1 == err2, ]
>>length(unique(result2$first))
>>
>>
>>
>>
>>On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller
>><jdnewmil at dcn.davis.ca.us> wrote:
>>> The "by" function aggregates and returns a result with
generally
>>fewer rows
>>> than the original data. Since you are looking to index the rows in
>>the
>>> original data set, the "ave" function is better suited
because it
>>always
>>> returns a vector that is just as long as the input vector:
>>>
>>> # I usually work with character data rather than factors if I plan
>>> # to modify the data (e.g. removing rows)
>>> DF <- read.table( text>>> 'first  week last
>>> Alex    1  West
>>> Bob     1  John
>>> Cory    1  Jack
>>> Cory    2  Jack
>>> Bob     2  John
>>> Bob     3  John
>>> Alex    2  Joseph
>>> Alex    3  West
>>> Alex    4  West
>>> ', header = TRUE, as.is = TRUE )
>>>
>>> err <- ave( DF$last
>>>           , DF[ , "first", drop = FALSE]
>>>           , FUN = function( lst ) {
>>>               length( unique( lst ) )
>>>             }
>>>           )
>>> result <- DF[ "1" == err, ]
>>> result
>>>
>>> Notice that the ave function returns a vector of the same type as
was
>>given
>>> to it, so even though the function returns a numeric the err
>>> vector is character.
>>>
>>> If you wanted to be able to examine more than one other column in
>>> determining the keep/reject decision, you could do:
>>>
>>> err2 <- ave( seq_along( DF$first )
>>>            , DF[ , "first", drop = FALSE]
>>>            , FUN = function( n ) {
>>>               length( unique( DF[ n, "last" ] ) )
>>>              }
>>>            )
>>> result2 <- DF[ 1 == err2, ]
>>> result2
>>>
>>> and then you would have the option to re-use the "n"
index to look at
>>other
>>> columns as well.
>>>
>>> Finally, here is a dplyr solution:
>>>
>>> library(dplyr)
>>> result3 <- (   DF
>>>            %>% group_by( first ) # like a prep for ave or by
>>>            %>% mutate( err = length( unique( last ) ) ) #
similar to
>>ave
>>>            %>% filter( 1 == err ) # drop the rows with too many
last
>>names
>>>            %>% select( -err ) # drop the temporary column
>>>            %>% as.data.frame # convert back to a plain-jane data
>>frame
>>>            )
>>> result3
>>>
>>> which uses a small set of verbs in a pipeline of functions to go
from
>>input
>>> to result in one pass.
>>>
>>> If your data set is really big (running out of memory big) then you
>>might
>>> want to investigate the data.table or sqlite packages, either of
>>which can
>>> be combined with dplyr to get a standardized syntax for managing
>>larger
>>> amounts of data. However, most people actually aren't running
out of
>>memory
>>> so in most cases the extra horsepower isn't actually needed.
>>>
>>>
>>> On Sun, 12 Feb 2017, P Tennant wrote:
>>>
>>>> Hi Val,
>>>>
>>>> The by() function could be used here. With the dataframe dfr:
>>>>
>>>> # split the data by first name and check for more than one last
name
>>for
>>>> each first name
>>>> res <- by(dfr, dfr['first'], function(x)
length(unique(x$last)) > 1)
>>>> # make the result more easily manipulated
>>>> res <- as.table(res)
>>>> res
>>>> # first
>>>> # Alex   Bob  Cory
>>>> # TRUE FALSE FALSE
>>>>
>>>> # then use this result to subset the data
>>>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
>>>> # sort if needed
>>>> nw.dfr[order(nw.dfr$first) , ]
>>>>
>>>>  first week last
>>>> 2   Bob    1 John
>>>> 5   Bob    2 John
>>>> 6   Bob    3 John
>>>> 3  Cory    1 Jack
>>>> 4  Cory    2 Jack
>>>>
>>>>
>>>> Philip
>>>>
>>>> On 12/02/2017 4:02 PM, Val wrote:
>>>>>
>>>>> Hi all,
>>>>> I have a big data set and want to  remove rows
conditionally.
>>>>> In my data file  each person were recorded  for several
weeks.
>>Somehow
>>>>> during the recording periods, their last name was
misreported.
>>For
>>>>> each person,   the last name should be the same. Otherwise
remove
>>from
>>>>> the data. Example, in the following data set, Alex was
found to
>>have
>>>>> two last names .
>>>>>
>>>>> Alex   West
>>>>> Alex   Joseph
>>>>>
>>>>> Alex should be removed  from the data.  if this happens
then I want
>>>>> remove  all rows with Alex. Here is my data set
>>>>>
>>>>> df<- read.table(header=TRUE, text='first  week last
>>>>> Alex    1  West
>>>>> Bob     1  John
>>>>> Cory    1  Jack
>>>>> Cory    2  Jack
>>>>> Bob     2  John
>>>>> Bob     3  John
>>>>> Alex    2  Joseph
>>>>> Alex    3  West
>>>>> Alex    4  West ')
>>>>>
>>>>> Desired output
>>>>>
>>>>>        first  week last
>>>>> 1     Bob     1   John
>>>>> 2     Bob     2   John
>>>>> 3     Bob     3   John
>>>>> 4     Cory     1   Jack
>>>>> 5     Cory     2   Jack
>>>>>
>>>>> Thank you in advance
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained,
reproducible code.
>>>>
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible
code.
>>>>
>>>
>>>
>>---------------------------------------------------------------------------
>>> Jeff Newmiller                        The     .....       .....  Go
>>Live...
>>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.      
##.#.  Live
>>Go...
>>>                                       Live:   OO#.. Dead: OO#..
>>Playing
>>> Research Engineer (Solar/Batteries            O.O#.       #.O#. 
with
>>> /Software/Embedded Controllers)               .OO#.       .OO#.
>>rocks...1k
>>>
>>---------------------------------------------------------------------------

Val

2017-Feb-13 04:18 UTC

head link

[R] remove

Hi Jeff and All,

When I examined the excluded  data,  ie.,  first name with  with
different last names, I noticed that  some last names were  not
recorded
or instance, I modified the data as follows
DF <- read.table( text'first  week last
Alex    1  West
Bob     1  John
Cory    1  Jack
Cory    2     -
Bob     2  John
Bob     3  John
Alex    2  Joseph
Alex    3  West
Alex    4  West
', header = TRUE, as.is = TRUE )


err2 <- ave( seq_along( DF$first )
           , DF[ , "first", drop = FALSE]
           , FUN = function( n ) {
              length( unique( DF[ n, "last" ] ) )
             }
           )
result2 <- DF[ 1 == err2, ]
result2

first week last
2   Bob    1 John
5   Bob    2 John
6   Bob    3 John

However, I want keep Cory's record. It is assumed that not recorded
should have the same last name.

Final out put should be

first week last
   Bob    1 John
   Bob    2 John
   Bob    3 John
  Cory    1  Jack
  Cory    2   -

Thank you again!

On Sun, Feb 12, 2017 at 7:28 PM, Val <valkremk at gmail.com>
wrote:> Sorry  Jeff, I did not finish my email. I accidentally touched the send
button.
> My question was the
> when I used this one
> length(unique(result2$first))
>      vs
> dim(result2[!duplicated(result2[,c('first')]),]) [1]
>
> I did get different results but now I found out the problem.
>
> Thank you!.
>
>
>
>
>
>
>
>
> On Sun, Feb 12, 2017 at 6:31 PM, Jeff Newmiller
> <jdnewmil at dcn.davis.ca.us> wrote:
>> Your question mystifies me, since it looks to me like you already know
the answer.
>> --
>> Sent from my phone. Please excuse my brevity.
>>
>> On February 12, 2017 3:30:49 PM PST, Val <valkremk at gmail.com>
wrote:
>>>Hi Jeff and all,
>>> How do I get the  number of unique first names   in the two data
sets?
>>>
>>>for the first one,
>>>result2 <- DF[ 1 == err2, ]
>>>length(unique(result2$first))
>>>
>>>
>>>
>>>
>>>On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller
>>><jdnewmil at dcn.davis.ca.us> wrote:
>>>> The "by" function aggregates and returns a result
with generally
>>>fewer rows
>>>> than the original data. Since you are looking to index the rows
in
>>>the
>>>> original data set, the "ave" function is better
suited because it
>>>always
>>>> returns a vector that is just as long as the input vector:
>>>>
>>>> # I usually work with character data rather than factors if I
plan
>>>> # to modify the data (e.g. removing rows)
>>>> DF <- read.table( text>>>> 'first  week last
>>>> Alex    1  West
>>>> Bob     1  John
>>>> Cory    1  Jack
>>>> Cory    2  Jack
>>>> Bob     2  John
>>>> Bob     3  John
>>>> Alex    2  Joseph
>>>> Alex    3  West
>>>> Alex    4  West
>>>> ', header = TRUE, as.is = TRUE )
>>>>
>>>> err <- ave( DF$last
>>>>           , DF[ , "first", drop = FALSE]
>>>>           , FUN = function( lst ) {
>>>>               length( unique( lst ) )
>>>>             }
>>>>           )
>>>> result <- DF[ "1" == err, ]
>>>> result
>>>>
>>>> Notice that the ave function returns a vector of the same type
as was
>>>given
>>>> to it, so even though the function returns a numeric the err
>>>> vector is character.
>>>>
>>>> If you wanted to be able to examine more than one other column
in
>>>> determining the keep/reject decision, you could do:
>>>>
>>>> err2 <- ave( seq_along( DF$first )
>>>>            , DF[ , "first", drop = FALSE]
>>>>            , FUN = function( n ) {
>>>>               length( unique( DF[ n, "last" ] ) )
>>>>              }
>>>>            )
>>>> result2 <- DF[ 1 == err2, ]
>>>> result2
>>>>
>>>> and then you would have the option to re-use the "n"
index to look at
>>>other
>>>> columns as well.
>>>>
>>>> Finally, here is a dplyr solution:
>>>>
>>>> library(dplyr)
>>>> result3 <- (   DF
>>>>            %>% group_by( first ) # like a prep for ave or by
>>>>            %>% mutate( err = length( unique( last ) ) ) #
similar to
>>>ave
>>>>            %>% filter( 1 == err ) # drop the rows with too
many last
>>>names
>>>>            %>% select( -err ) # drop the temporary column
>>>>            %>% as.data.frame # convert back to a plain-jane
data
>>>frame
>>>>            )
>>>> result3
>>>>
>>>> which uses a small set of verbs in a pipeline of functions to
go from
>>>input
>>>> to result in one pass.
>>>>
>>>> If your data set is really big (running out of memory big) then
you
>>>might
>>>> want to investigate the data.table or sqlite packages, either
of
>>>which can
>>>> be combined with dplyr to get a standardized syntax for
managing
>>>larger
>>>> amounts of data. However, most people actually aren't
running out of
>>>memory
>>>> so in most cases the extra horsepower isn't actually
needed.
>>>>
>>>>
>>>> On Sun, 12 Feb 2017, P Tennant wrote:
>>>>
>>>>> Hi Val,
>>>>>
>>>>> The by() function could be used here. With the dataframe
dfr:
>>>>>
>>>>> # split the data by first name and check for more than one
last name
>>>for
>>>>> each first name
>>>>> res <- by(dfr, dfr['first'], function(x)
length(unique(x$last)) > 1)
>>>>> # make the result more easily manipulated
>>>>> res <- as.table(res)
>>>>> res
>>>>> # first
>>>>> # Alex   Bob  Cory
>>>>> # TRUE FALSE FALSE
>>>>>
>>>>> # then use this result to subset the data
>>>>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
>>>>> # sort if needed
>>>>> nw.dfr[order(nw.dfr$first) , ]
>>>>>
>>>>>  first week last
>>>>> 2   Bob    1 John
>>>>> 5   Bob    2 John
>>>>> 6   Bob    3 John
>>>>> 3  Cory    1 Jack
>>>>> 4  Cory    2 Jack
>>>>>
>>>>>
>>>>> Philip
>>>>>
>>>>> On 12/02/2017 4:02 PM, Val wrote:
>>>>>>
>>>>>> Hi all,
>>>>>> I have a big data set and want to  remove rows
conditionally.
>>>>>> In my data file  each person were recorded  for several
weeks.
>>>Somehow
>>>>>> during the recording periods, their last name was
misreported.
>>>For
>>>>>> each person,   the last name should be the same.
Otherwise remove
>>>from
>>>>>> the data. Example, in the following data set, Alex was
found to
>>>have
>>>>>> two last names .
>>>>>>
>>>>>> Alex   West
>>>>>> Alex   Joseph
>>>>>>
>>>>>> Alex should be removed  from the data.  if this happens
then I want
>>>>>> remove  all rows with Alex. Here is my data set
>>>>>>
>>>>>> df<- read.table(header=TRUE, text='first  week
last
>>>>>> Alex    1  West
>>>>>> Bob     1  John
>>>>>> Cory    1  Jack
>>>>>> Cory    2  Jack
>>>>>> Bob     2  John
>>>>>> Bob     3  John
>>>>>> Alex    2  Joseph
>>>>>> Alex    3  West
>>>>>> Alex    4  West ')
>>>>>>
>>>>>> Desired output
>>>>>>
>>>>>>        first  week last
>>>>>> 1     Bob     1   John
>>>>>> 2     Bob     2   John
>>>>>> 3     Bob     3   John
>>>>>> 4     Cory     1   Jack
>>>>>> 5     Cory     2   Jack
>>>>>>
>>>>>> Thank you in advance
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE
and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained,
reproducible code.
>>>>>
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained,
reproducible code.
>>>>>
>>>>
>>>>
>>>---------------------------------------------------------------------------
>>>> Jeff Newmiller                        The     .....       .....
Go
>>>Live...
>>>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.   
##.#.  Live
>>>Go...
>>>>                                       Live:   OO#.. Dead: OO#..
>>>Playing
>>>> Research Engineer (Solar/Batteries            O.O#.       #.O#.
with
>>>> /Software/Embedded Controllers)               .OO#.       .OO#.
>>>rocks...1k
>>>>
>>>---------------------------------------------------------------------------

R help - Feb 2017 - remove

[R] remove

[R] remove

[R] remove