thr3ads.net - R help - [R] remove [Feb 2017]

If this information is useful, please help other people find it:
Share via:

Jeff Newmiller

2017-Feb-12 06:42 UTC

[R] remove

The "by" function aggregates and returns a result with generally fewer
rows than the original data. Since you are looking to index the rows in 
the original data set, the "ave" function is better suited because it 
always returns a vector that is just as long as the input vector:

# I usually work with character data rather than factors if I plan
# to modify the data (e.g. removing rows)
DF <- read.table( text'first  week last
Alex    1  West
Bob     1  John
Cory    1  Jack
Cory    2  Jack
Bob     2  John
Bob     3  John
Alex    2  Joseph
Alex    3  West
Alex    4  West
', header = TRUE, as.is = TRUE )

err <- ave( DF$last
           , DF[ , "first", drop = FALSE]
           , FUN = function( lst ) {
               length( unique( lst ) )
             }
           )
result <- DF[ "1" == err, ]
result

Notice that the ave function returns a vector of the same type as was 
given to it, so even though the function returns a numeric the err
vector is character.

If you wanted to be able to examine more than one other column in 
determining the keep/reject decision, you could do:

err2 <- ave( seq_along( DF$first )
            , DF[ , "first", drop = FALSE]
            , FUN = function( n ) {
               length( unique( DF[ n, "last" ] ) )
              }
            )
result2 <- DF[ 1 == err2, ]
result2

and then you would have the option to re-use the "n" index to look at 
other columns as well.

Finally, here is a dplyr solution:

library(dplyr)
result3 <- (   DF
            %>% group_by( first ) # like a prep for ave or by
            %>% mutate( err = length( unique( last ) ) ) # similar to ave
            %>% filter( 1 == err ) # drop the rows with too many last names
            %>% select( -err ) # drop the temporary column
            %>% as.data.frame # convert back to a plain-jane data frame
            )
result3

which uses a small set of verbs in a pipeline of functions to go from 
input to result in one pass.

If your data set is really big (running out of memory big) then you might 
want to investigate the data.table or sqlite packages, either of which can 
be combined with dplyr to get a standardized syntax for managing larger 
amounts of data. However, most people actually aren't running out of 
memory so in most cases the extra horsepower isn't actually needed.

On Sun, 12 Feb 2017, P Tennant wrote:
> Hi Val,
>
> The by() function could be used here. With the dataframe dfr:
>
> # split the data by first name and check for more than one last name for
each
> first name
> res <- by(dfr, dfr['first'], function(x) length(unique(x$last))
> 1)
> # make the result more easily manipulated
> res <- as.table(res)
> res
> # first
> # Alex   Bob  Cory
> # TRUE FALSE FALSE
>
> # then use this result to subset the data
> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
> # sort if needed
> nw.dfr[order(nw.dfr$first) , ]
>
>  first week last
> 2   Bob    1 John
> 5   Bob    2 John
> 6   Bob    3 John
> 3  Cory    1 Jack
> 4  Cory    2 Jack
>
>
> Philip
>
> On 12/02/2017 4:02 PM, Val wrote:
>> Hi all,
>> I have a big data set and want to  remove rows conditionally.
>> In my data file  each person were recorded  for several weeks. Somehow
>> during the recording periods, their last name was misreported.   For
>> each person,   the last name should be the same. Otherwise remove from
>> the data. Example, in the following data set, Alex was found to have
>> two last names .
>> 
>> Alex   West
>> Alex   Joseph
>> 
>> Alex should be removed  from the data.  if this happens then I want
>> remove  all rows with Alex. Here is my data set
>> 
>> df<- read.table(header=TRUE, text='first  week last
>> Alex    1  West
>> Bob     1  John
>> Cory    1  Jack
>> Cory    2  Jack
>> Bob     2  John
>> Bob     3  John
>> Alex    2  Joseph
>> Alex    3  West
>> Alex    4  West ')
>> 
>> Desired output
>>
>>        first  week last
>> 1     Bob     1   John
>> 2     Bob     2   John
>> 3     Bob     3   John
>> 4     Cory     1   Jack
>> 5     Cory     2   Jack
>> 
>> Thank you in advance
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k

P Tennant

2017-Feb-12 07:19 UTC

head link

[R] remove

Hi Jeff,

Why do you say ave() is better suited *because* it always returns a 
vector that is just as long as the input vector? Is it because that 
feature (of equal length), allows match() to be avoided, and as a 
result, the subsequent subsetting is faster with very large datasets?

Thanks, Philip


On 12/02/2017 5:42 PM, Jeff Newmiller wrote:> The "by" function aggregates and returns a result with generally
fewer
> rows than the original data. Since you are looking to index the rows 
> in the original data set, the "ave" function is better suited
because
> it always returns a vector that is just as long as the input vector:
>
> # I usually work with character data rather than factors if I plan
> # to modify the data (e.g. removing rows)
> DF <- read.table( text> 'first  week last
> Alex    1  West
> Bob     1  John
> Cory    1  Jack
> Cory    2  Jack
> Bob     2  John
> Bob     3  John
> Alex    2  Joseph
> Alex    3  West
> Alex    4  West
> ', header = TRUE, as.is = TRUE )
>
> err <- ave( DF$last
>           , DF[ , "first", drop = FALSE]
>           , FUN = function( lst ) {
>               length( unique( lst ) )
>             }
>           )
> result <- DF[ "1" == err, ]
> result
>
> Notice that the ave function returns a vector of the same type as was 
> given to it, so even though the function returns a numeric the err
> vector is character.
>
> If you wanted to be able to examine more than one other column in 
> determining the keep/reject decision, you could do:
>
> err2 <- ave( seq_along( DF$first )
>            , DF[ , "first", drop = FALSE]
>            , FUN = function( n ) {
>               length( unique( DF[ n, "last" ] ) )
>              }
>            )
> result2 <- DF[ 1 == err2, ]
> result2
>
> and then you would have the option to re-use the "n" index to
look at
> other columns as well.
>
> Finally, here is a dplyr solution:
>
> library(dplyr)
> result3 <- (   DF
>            %>% group_by( first ) # like a prep for ave or by
>            %>% mutate( err = length( unique( last ) ) ) # similar to ave
>            %>% filter( 1 == err ) # drop the rows with too many last 
> names
>            %>% select( -err ) # drop the temporary column
>            %>% as.data.frame # convert back to a plain-jane data frame
>            )
> result3
>
> which uses a small set of verbs in a pipeline of functions to go from 
> input to result in one pass.
>
> If your data set is really big (running out of memory big) then you 
> might want to investigate the data.table or sqlite packages, either of 
> which can be combined with dplyr to get a standardized syntax for 
> managing larger amounts of data. However, most people actually aren't 
> running out of memory so in most cases the extra horsepower isn't 
> actually needed.
>
> On Sun, 12 Feb 2017, P Tennant wrote:
>
>> Hi Val,
>>
>> The by() function could be used here. With the dataframe dfr:
>>
>> # split the data by first name and check for more than one last name 
>> for each first name
>> res <- by(dfr, dfr['first'], function(x)
length(unique(x$last)) > 1)
>> # make the result more easily manipulated
>> res <- as.table(res)
>> res
>> # first
>> # Alex   Bob  Cory
>> # TRUE FALSE FALSE
>>
>> # then use this result to subset the data
>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
>> # sort if needed
>> nw.dfr[order(nw.dfr$first) , ]
>>
>>  first week last
>> 2   Bob    1 John
>> 5   Bob    2 John
>> 6   Bob    3 John
>> 3  Cory    1 Jack
>> 4  Cory    2 Jack
>>
>>
>> Philip
>>
>> On 12/02/2017 4:02 PM, Val wrote:
>>> Hi all,
>>> I have a big data set and want to  remove rows conditionally.
>>> In my data file  each person were recorded  for several weeks.
Somehow
>>> during the recording periods, their last name was misreported.  
For
>>> each person,   the last name should be the same. Otherwise remove
from
>>> the data. Example, in the following data set, Alex was found to
have
>>> two last names .
>>>
>>> Alex   West
>>> Alex   Joseph
>>>
>>> Alex should be removed  from the data.  if this happens then I want
>>> remove  all rows with Alex. Here is my data set
>>>
>>> df<- read.table(header=TRUE, text='first  week last
>>> Alex    1  West
>>> Bob     1  John
>>> Cory    1  Jack
>>> Cory    2  Jack
>>> Bob     2  John
>>> Bob     3  John
>>> Alex    2  Joseph
>>> Alex    3  West
>>> Alex    4  West ')
>>>
>>> Desired output
>>>
>>>        first  week last
>>> 1     Bob     1   John
>>> 2     Bob     2   John
>>> 3     Bob     3   John
>>> 4     Cory     1   Jack
>>> 5     Cory     2   Jack
>>>
>>> Thank you in advance
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide 
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ---------------------------------------------------------------------------
>
> Jeff Newmiller                        The     .....       .....  Go 
> Live...
> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#. 
Live
> Go...
>                                       Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  
> rocks...1k
> ---------------------------------------------------------------------------
>

Val

2017-Feb-12 15:12 UTC

head link

[R] remove

Jeff, Rolf and Philip.
Thank you very much for your suggestion.

Jeff, you suggested if your data is big then consider data.table ....
My data is "big"  it is more than 200M  records and I will see if this
function works.

Thank you again.


On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller
<jdnewmil at dcn.davis.ca.us> wrote:> The "by" function aggregates and returns a result with generally
fewer rows
> than the original data. Since you are looking to index the rows in the
> original data set, the "ave" function is better suited because it
always
> returns a vector that is just as long as the input vector:
>
> # I usually work with character data rather than factors if I plan
> # to modify the data (e.g. removing rows)
> DF <- read.table( text> 'first  week last
> Alex    1  West
> Bob     1  John
> Cory    1  Jack
> Cory    2  Jack
> Bob     2  John
> Bob     3  John
> Alex    2  Joseph
> Alex    3  West
> Alex    4  West
> ', header = TRUE, as.is = TRUE )
>
> err <- ave( DF$last
>           , DF[ , "first", drop = FALSE]
>           , FUN = function( lst ) {
>               length( unique( lst ) )
>             }
>           )
> result <- DF[ "1" == err, ]
> result
>
> Notice that the ave function returns a vector of the same type as was given
> to it, so even though the function returns a numeric the err
> vector is character.
>
> If you wanted to be able to examine more than one other column in
> determining the keep/reject decision, you could do:
>
> err2 <- ave( seq_along( DF$first )
>            , DF[ , "first", drop = FALSE]
>            , FUN = function( n ) {
>               length( unique( DF[ n, "last" ] ) )
>              }
>            )
> result2 <- DF[ 1 == err2, ]
> result2
>
> and then you would have the option to re-use the "n" index to
look at other
> columns as well.
>
> Finally, here is a dplyr solution:
>
> library(dplyr)
> result3 <- (   DF
>            %>% group_by( first ) # like a prep for ave or by
>            %>% mutate( err = length( unique( last ) ) ) # similar to ave
>            %>% filter( 1 == err ) # drop the rows with too many last
names
>            %>% select( -err ) # drop the temporary column
>            %>% as.data.frame # convert back to a plain-jane data frame
>            )
> result3
>
> which uses a small set of verbs in a pipeline of functions to go from input
> to result in one pass.
>
> If your data set is really big (running out of memory big) then you might
> want to investigate the data.table or sqlite packages, either of which can
> be combined with dplyr to get a standardized syntax for managing larger
> amounts of data. However, most people actually aren't running out of
memory
> so in most cases the extra horsepower isn't actually needed.
>
>
> On Sun, 12 Feb 2017, P Tennant wrote:
>
>> Hi Val,
>>
>> The by() function could be used here. With the dataframe dfr:
>>
>> # split the data by first name and check for more than one last name
for
>> each first name
>> res <- by(dfr, dfr['first'], function(x)
length(unique(x$last)) > 1)
>> # make the result more easily manipulated
>> res <- as.table(res)
>> res
>> # first
>> # Alex   Bob  Cory
>> # TRUE FALSE FALSE
>>
>> # then use this result to subset the data
>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
>> # sort if needed
>> nw.dfr[order(nw.dfr$first) , ]
>>
>>  first week last
>> 2   Bob    1 John
>> 5   Bob    2 John
>> 6   Bob    3 John
>> 3  Cory    1 Jack
>> 4  Cory    2 Jack
>>
>>
>> Philip
>>
>> On 12/02/2017 4:02 PM, Val wrote:
>>>
>>> Hi all,
>>> I have a big data set and want to  remove rows conditionally.
>>> In my data file  each person were recorded  for several weeks.
Somehow
>>> during the recording periods, their last name was misreported.  
For
>>> each person,   the last name should be the same. Otherwise remove
from
>>> the data. Example, in the following data set, Alex was found to
have
>>> two last names .
>>>
>>> Alex   West
>>> Alex   Joseph
>>>
>>> Alex should be removed  from the data.  if this happens then I want
>>> remove  all rows with Alex. Here is my data set
>>>
>>> df<- read.table(header=TRUE, text='first  week last
>>> Alex    1  West
>>> Bob     1  John
>>> Cory    1  Jack
>>> Cory    2  Jack
>>> Bob     2  John
>>> Bob     3  John
>>> Alex    2  Joseph
>>> Alex    3  West
>>> Alex    4  West ')
>>>
>>> Desired output
>>>
>>>        first  week last
>>> 1     Bob     1   John
>>> 2     Bob     2   John
>>> 3     Bob     3   John
>>> 4     Cory     1   Jack
>>> 5     Cory     2   Jack
>>>
>>> Thank you in advance
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ---------------------------------------------------------------------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#. 
Live Go...
>                                       Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> ---------------------------------------------------------------------------

Jeff Newmiller

2017-Feb-12 15:18 UTC

head link

[R] remove

Exactly. Sort of like the optimisation of using which.max instead of max
followed by which, though ideally the only intermediate vector would be the
logical vector that says keep or don't keep.
-- 
Sent from my phone. Please excuse my brevity.

On February 11, 2017 11:19:11 PM PST, P Tennant <philipt900 at
iinet.net.au> wrote:>Hi Jeff,
>
>Why do you say ave() is better suited *because* it always returns a 
>vector that is just as long as the input vector? Is it because that 
>feature (of equal length), allows match() to be avoided, and as a 
>result, the subsequent subsetting is faster with very large datasets?
>
>Thanks, Philip
>
>
>On 12/02/2017 5:42 PM, Jeff Newmiller wrote:
>> The "by" function aggregates and returns a result with
generally
>fewer 
>> rows than the original data. Since you are looking to index the rows 
>> in the original data set, the "ave" function is better suited
because
>
>> it always returns a vector that is just as long as the input vector:
>>
>> # I usually work with character data rather than factors if I plan
>> # to modify the data (e.g. removing rows)
>> DF <- read.table( text>> 'first  week last
>> Alex    1  West
>> Bob     1  John
>> Cory    1  Jack
>> Cory    2  Jack
>> Bob     2  John
>> Bob     3  John
>> Alex    2  Joseph
>> Alex    3  West
>> Alex    4  West
>> ', header = TRUE, as.is = TRUE )
>>
>> err <- ave( DF$last
>>           , DF[ , "first", drop = FALSE]
>>           , FUN = function( lst ) {
>>               length( unique( lst ) )
>>             }
>>           )
>> result <- DF[ "1" == err, ]
>> result
>>
>> Notice that the ave function returns a vector of the same type as was
>
>> given to it, so even though the function returns a numeric the err
>> vector is character.
>>
>> If you wanted to be able to examine more than one other column in 
>> determining the keep/reject decision, you could do:
>>
>> err2 <- ave( seq_along( DF$first )
>>            , DF[ , "first", drop = FALSE]
>>            , FUN = function( n ) {
>>               length( unique( DF[ n, "last" ] ) )
>>              }
>>            )
>> result2 <- DF[ 1 == err2, ]
>> result2
>>
>> and then you would have the option to re-use the "n" index to
look at
>
>> other columns as well.
>>
>> Finally, here is a dplyr solution:
>>
>> library(dplyr)
>> result3 <- (   DF
>>            %>% group_by( first ) # like a prep for ave or by
>>            %>% mutate( err = length( unique( last ) ) ) # similar to
>ave
>>            %>% filter( 1 == err ) # drop the rows with too many last
>> names
>>            %>% select( -err ) # drop the temporary column
>>            %>% as.data.frame # convert back to a plain-jane data
>frame
>>            )
>> result3
>>
>> which uses a small set of verbs in a pipeline of functions to go from
>
>> input to result in one pass.
>>
>> If your data set is really big (running out of memory big) then you 
>> might want to investigate the data.table or sqlite packages, either
>of 
>> which can be combined with dplyr to get a standardized syntax for 
>> managing larger amounts of data. However, most people actually
aren't
>
>> running out of memory so in most cases the extra horsepower isn't 
>> actually needed.
>>
>> On Sun, 12 Feb 2017, P Tennant wrote:
>>
>>> Hi Val,
>>>
>>> The by() function could be used here. With the dataframe dfr:
>>>
>>> # split the data by first name and check for more than one last
name
>
>>> for each first name
>>> res <- by(dfr, dfr['first'], function(x)
length(unique(x$last)) > 1)
>>> # make the result more easily manipulated
>>> res <- as.table(res)
>>> res
>>> # first
>>> # Alex   Bob  Cory
>>> # TRUE FALSE FALSE
>>>
>>> # then use this result to subset the data
>>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
>>> # sort if needed
>>> nw.dfr[order(nw.dfr$first) , ]
>>>
>>>  first week last
>>> 2   Bob    1 John
>>> 5   Bob    2 John
>>> 6   Bob    3 John
>>> 3  Cory    1 Jack
>>> 4  Cory    2 Jack
>>>
>>>
>>> Philip
>>>
>>> On 12/02/2017 4:02 PM, Val wrote:
>>>> Hi all,
>>>> I have a big data set and want to  remove rows conditionally.
>>>> In my data file  each person were recorded  for several weeks.
>Somehow
>>>> during the recording periods, their last name was misreported.
>For
>>>> each person,   the last name should be the same. Otherwise
remove
>from
>>>> the data. Example, in the following data set, Alex was found to
>have
>>>> two last names .
>>>>
>>>> Alex   West
>>>> Alex   Joseph
>>>>
>>>> Alex should be removed  from the data.  if this happens then I
want
>>>> remove  all rows with Alex. Here is my data set
>>>>
>>>> df<- read.table(header=TRUE, text='first  week last
>>>> Alex    1  West
>>>> Bob     1  John
>>>> Cory    1  Jack
>>>> Cory    2  Jack
>>>> Bob     2  John
>>>> Bob     3  John
>>>> Alex    2  Joseph
>>>> Alex    3  West
>>>> Alex    4  West ')
>>>>
>>>> Desired output
>>>>
>>>>        first  week last
>>>> 1     Bob     1   John
>>>> 2     Bob     2   John
>>>> 3     Bob     3   John
>>>> 4     Cory     1   Jack
>>>> 5     Cory     2   Jack
>>>>
>>>> Thank you in advance
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide 
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible
code.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide 
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>---------------------------------------------------------------------------
>
>>
>> Jeff Newmiller                        The     .....       .....  Go 
>> Live...
>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.      
##.#.  Live
>
>> Go...
>>                                       Live:   OO#.. Dead: OO#.. 
>Playing
>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> /Software/Embedded Controllers)               .OO#.       .OO#.  
>> rocks...1k
>>
>---------------------------------------------------------------------------
>
>>

Val

2017-Feb-12 23:30 UTC

head link

[R] remove

Hi Jeff and all,
 How do I get the  number of unique first names   in the two data sets?

for the first one,
result2 <- DF[ 1 == err2, ]
length(unique(result2$first))




On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller
<jdnewmil at dcn.davis.ca.us> wrote:> The "by" function aggregates and returns a result with generally
fewer rows
> than the original data. Since you are looking to index the rows in the
> original data set, the "ave" function is better suited because it
always
> returns a vector that is just as long as the input vector:
>
> # I usually work with character data rather than factors if I plan
> # to modify the data (e.g. removing rows)
> DF <- read.table( text> 'first  week last
> Alex    1  West
> Bob     1  John
> Cory    1  Jack
> Cory    2  Jack
> Bob     2  John
> Bob     3  John
> Alex    2  Joseph
> Alex    3  West
> Alex    4  West
> ', header = TRUE, as.is = TRUE )
>
> err <- ave( DF$last
>           , DF[ , "first", drop = FALSE]
>           , FUN = function( lst ) {
>               length( unique( lst ) )
>             }
>           )
> result <- DF[ "1" == err, ]
> result
>
> Notice that the ave function returns a vector of the same type as was given
> to it, so even though the function returns a numeric the err
> vector is character.
>
> If you wanted to be able to examine more than one other column in
> determining the keep/reject decision, you could do:
>
> err2 <- ave( seq_along( DF$first )
>            , DF[ , "first", drop = FALSE]
>            , FUN = function( n ) {
>               length( unique( DF[ n, "last" ] ) )
>              }
>            )
> result2 <- DF[ 1 == err2, ]
> result2
>
> and then you would have the option to re-use the "n" index to
look at other
> columns as well.
>
> Finally, here is a dplyr solution:
>
> library(dplyr)
> result3 <- (   DF
>            %>% group_by( first ) # like a prep for ave or by
>            %>% mutate( err = length( unique( last ) ) ) # similar to ave
>            %>% filter( 1 == err ) # drop the rows with too many last
names
>            %>% select( -err ) # drop the temporary column
>            %>% as.data.frame # convert back to a plain-jane data frame
>            )
> result3
>
> which uses a small set of verbs in a pipeline of functions to go from input
> to result in one pass.
>
> If your data set is really big (running out of memory big) then you might
> want to investigate the data.table or sqlite packages, either of which can
> be combined with dplyr to get a standardized syntax for managing larger
> amounts of data. However, most people actually aren't running out of
memory
> so in most cases the extra horsepower isn't actually needed.
>
>
> On Sun, 12 Feb 2017, P Tennant wrote:
>
>> Hi Val,
>>
>> The by() function could be used here. With the dataframe dfr:
>>
>> # split the data by first name and check for more than one last name
for
>> each first name
>> res <- by(dfr, dfr['first'], function(x)
length(unique(x$last)) > 1)
>> # make the result more easily manipulated
>> res <- as.table(res)
>> res
>> # first
>> # Alex   Bob  Cory
>> # TRUE FALSE FALSE
>>
>> # then use this result to subset the data
>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
>> # sort if needed
>> nw.dfr[order(nw.dfr$first) , ]
>>
>>  first week last
>> 2   Bob    1 John
>> 5   Bob    2 John
>> 6   Bob    3 John
>> 3  Cory    1 Jack
>> 4  Cory    2 Jack
>>
>>
>> Philip
>>
>> On 12/02/2017 4:02 PM, Val wrote:
>>>
>>> Hi all,
>>> I have a big data set and want to  remove rows conditionally.
>>> In my data file  each person were recorded  for several weeks.
Somehow
>>> during the recording periods, their last name was misreported.  
For
>>> each person,   the last name should be the same. Otherwise remove
from
>>> the data. Example, in the following data set, Alex was found to
have
>>> two last names .
>>>
>>> Alex   West
>>> Alex   Joseph
>>>
>>> Alex should be removed  from the data.  if this happens then I want
>>> remove  all rows with Alex. Here is my data set
>>>
>>> df<- read.table(header=TRUE, text='first  week last
>>> Alex    1  West
>>> Bob     1  John
>>> Cory    1  Jack
>>> Cory    2  Jack
>>> Bob     2  John
>>> Bob     3  John
>>> Alex    2  Joseph
>>> Alex    3  West
>>> Alex    4  West ')
>>>
>>> Desired output
>>>
>>>        first  week last
>>> 1     Bob     1   John
>>> 2     Bob     2   John
>>> 3     Bob     3   John
>>> 4     Cory     1   Jack
>>> 5     Cory     2   Jack
>>>
>>> Thank you in advance
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ---------------------------------------------------------------------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#. 
Live Go...
>                                       Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> ---------------------------------------------------------------------------

Jeff Newmiller

2017-Feb-13 00:31 UTC

head link

[R] remove

Your question mystifies me, since it looks to me like you already know the
answer.
-- 
Sent from my phone. Please excuse my brevity.

On February 12, 2017 3:30:49 PM PST, Val <valkremk at gmail.com>
wrote:>Hi Jeff and all,
> How do I get the  number of unique first names   in the two data sets?
>
>for the first one,
>result2 <- DF[ 1 == err2, ]
>length(unique(result2$first))
>
>
>
>
>On Sun, Feb 12, 2017 at 12:42 AM, Jeff Newmiller
><jdnewmil at dcn.davis.ca.us> wrote:
>> The "by" function aggregates and returns a result with
generally
>fewer rows
>> than the original data. Since you are looking to index the rows in
>the
>> original data set, the "ave" function is better suited
because it
>always
>> returns a vector that is just as long as the input vector:
>>
>> # I usually work with character data rather than factors if I plan
>> # to modify the data (e.g. removing rows)
>> DF <- read.table( text>> 'first  week last
>> Alex    1  West
>> Bob     1  John
>> Cory    1  Jack
>> Cory    2  Jack
>> Bob     2  John
>> Bob     3  John
>> Alex    2  Joseph
>> Alex    3  West
>> Alex    4  West
>> ', header = TRUE, as.is = TRUE )
>>
>> err <- ave( DF$last
>>           , DF[ , "first", drop = FALSE]
>>           , FUN = function( lst ) {
>>               length( unique( lst ) )
>>             }
>>           )
>> result <- DF[ "1" == err, ]
>> result
>>
>> Notice that the ave function returns a vector of the same type as was
>given
>> to it, so even though the function returns a numeric the err
>> vector is character.
>>
>> If you wanted to be able to examine more than one other column in
>> determining the keep/reject decision, you could do:
>>
>> err2 <- ave( seq_along( DF$first )
>>            , DF[ , "first", drop = FALSE]
>>            , FUN = function( n ) {
>>               length( unique( DF[ n, "last" ] ) )
>>              }
>>            )
>> result2 <- DF[ 1 == err2, ]
>> result2
>>
>> and then you would have the option to re-use the "n" index to
look at
>other
>> columns as well.
>>
>> Finally, here is a dplyr solution:
>>
>> library(dplyr)
>> result3 <- (   DF
>>            %>% group_by( first ) # like a prep for ave or by
>>            %>% mutate( err = length( unique( last ) ) ) # similar to
>ave
>>            %>% filter( 1 == err ) # drop the rows with too many last
>names
>>            %>% select( -err ) # drop the temporary column
>>            %>% as.data.frame # convert back to a plain-jane data
>frame
>>            )
>> result3
>>
>> which uses a small set of verbs in a pipeline of functions to go from
>input
>> to result in one pass.
>>
>> If your data set is really big (running out of memory big) then you
>might
>> want to investigate the data.table or sqlite packages, either of
>which can
>> be combined with dplyr to get a standardized syntax for managing
>larger
>> amounts of data. However, most people actually aren't running out
of
>memory
>> so in most cases the extra horsepower isn't actually needed.
>>
>>
>> On Sun, 12 Feb 2017, P Tennant wrote:
>>
>>> Hi Val,
>>>
>>> The by() function could be used here. With the dataframe dfr:
>>>
>>> # split the data by first name and check for more than one last
name
>for
>>> each first name
>>> res <- by(dfr, dfr['first'], function(x)
length(unique(x$last)) > 1)
>>> # make the result more easily manipulated
>>> res <- as.table(res)
>>> res
>>> # first
>>> # Alex   Bob  Cory
>>> # TRUE FALSE FALSE
>>>
>>> # then use this result to subset the data
>>> nw.dfr <- dfr[!dfr$first %in% names(res[res]) , ]
>>> # sort if needed
>>> nw.dfr[order(nw.dfr$first) , ]
>>>
>>>  first week last
>>> 2   Bob    1 John
>>> 5   Bob    2 John
>>> 6   Bob    3 John
>>> 3  Cory    1 Jack
>>> 4  Cory    2 Jack
>>>
>>>
>>> Philip
>>>
>>> On 12/02/2017 4:02 PM, Val wrote:
>>>>
>>>> Hi all,
>>>> I have a big data set and want to  remove rows conditionally.
>>>> In my data file  each person were recorded  for several weeks.
>Somehow
>>>> during the recording periods, their last name was misreported.
>For
>>>> each person,   the last name should be the same. Otherwise
remove
>from
>>>> the data. Example, in the following data set, Alex was found to
>have
>>>> two last names .
>>>>
>>>> Alex   West
>>>> Alex   Joseph
>>>>
>>>> Alex should be removed  from the data.  if this happens then I
want
>>>> remove  all rows with Alex. Here is my data set
>>>>
>>>> df<- read.table(header=TRUE, text='first  week last
>>>> Alex    1  West
>>>> Bob     1  John
>>>> Cory    1  Jack
>>>> Cory    2  Jack
>>>> Bob     2  John
>>>> Bob     3  John
>>>> Alex    2  Joseph
>>>> Alex    3  West
>>>> Alex    4  West ')
>>>>
>>>> Desired output
>>>>
>>>>        first  week last
>>>> 1     Bob     1   John
>>>> 2     Bob     2   John
>>>> 3     Bob     3   John
>>>> 4     Cory     1   Jack
>>>> 5     Cory     2   Jack
>>>>
>>>> Thank you in advance
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible
code.
>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>---------------------------------------------------------------------------
>> Jeff Newmiller                        The     .....       .....  Go
>Live...
>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.      
##.#.  Live
>Go...
>>                                       Live:   OO#.. Dead: OO#.. 
>Playing
>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> /Software/Embedded Controllers)               .OO#.       .OO#. 
>rocks...1k
>>
>---------------------------------------------------------------------------

R help - Feb 2017 - remove

[R] remove

[R] remove

[R] remove

[R] remove

[R] remove

[R] remove