thr3ads.net - R help - [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows [Dec 2024]

If this information is useful, please help other people find it:
Share via:

Bert Gunter

2024-Dec-01 16:30 UTC

[R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

Rui:
"f these two, diff is faster. But of all the solutions posted so far,
Ben Bolker's is the fastest."

But the explicit version of diff is still considerably faster:
> D <- c(rep(1,10),rep(2,6),rep(3,2))
> microbenchmark(c(1L,diff(D)), times = 1000L)Unit: microseconds
           expr   min    lq    mean median    uq    max neval
 c(1L, diff(D)) 3.075 3.198 3.34396   3.28 3.362 29.684  1000
> microbenchmark( as.integer(!duplicated(D)), times =1000L)Unit: microseconds
                       expr   min    lq     mean median   uq  max neval
 as.integer(!duplicated(D)) 1.476 1.558 1.644264  1.599 1.64 16.4  1000
> microbenchmark( D - c(0L, D[-length(D)]), times = 1000L)Unit: nanoseconds  ## note that unit is nanoseconds not microseconds
                     expr min  lq    mean median  uq  max neval
 D - c(0L, D[-length(D)]) 369 410 489.335    492 533 9840  1000

Cheers,
Bert

On Sat, Nov 30, 2024 at 11:05?PM Rui Barradas <ruipbarradas at sapo.pt>
wrote:>
> ?s 02:27 de 01/12/2024, Sorkin, John escreveu:
> > Dear R help folks,
> >
> > First my apologizes for sending several related questions to the list
server. I am trying to learn how to manipulate data in R . . . and am having
difficulty getting my program to work. I greatly appreciate the help and support
list member give!
> >
> > I am trying to write a program that will run through a data frame
organized by ID and for the first line of each new group of data lines that has
the same ID create a new variable first that will be 1 for the first line of the
group and 0 for all other lines.
> >
> > e.g. if my original data is
> >   olddata
> >     ID date
> >      1     1
> >      1     1
> >      1     2
> >      1     2
> >      1     3
> >      1     3
> >      1     4
> >      1     4
> >      1     5
> >      1     5
> >      2     5
> >      2     5
> >      2     5
> >      2     6
> >      2     6
> >      2     6
> >      3   10
> >      3   10
> >
> > the new data will be
> > newdata
> >     ID date  first
> >      1     1       1
> >      1     1       0
> >      1     2       0
> >      1     2       0
> >      1     3       0
> >      1     3       0
> >      1     4       0
> >      1     4       0
> >      1     5       0
> >      1     5       0
> >      2     5       1
> >      2     5       0
> >      2     5       0
> >      2     6       0
> >      2     6       0
> >      2     6       0
> >      3   10       1
> >      3   10       0
> >
> > When I run the program below, I receive the following error:
> > Error in df[, "ID"] : incorrect number of dimensions
> >
> > My code:
> > # Create data.frame
> > ID <- c(rep(1,10),rep(2,6),rep(3,2))
> > date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2),
> >            rep(5,3),rep(6,3),rep(10,2))
> > olddata <- data.frame(ID=ID,date=date)
> > class(olddata)
> > cat("This is the original data frame","\n")
> > print(olddata)
> >
> > # This function is supposed to identify the first row
> > # within each level of ID and, for the first row, set
> > # the variable first to 1, and for all rows other than
> > # the first row set first to 0.
> > mydoit <- function(df){
> >    value <- ifelse (first(df[,"ID"]),1,0)
> >    cat("value=",value,"\n")
> >    df[,"first"] <- value
> > }
> > newdata <- aggregate(olddata,list(olddata[,"ID"]),mydoit)
> >
> > Thank you,
> > John
> >
> >
> > John David Sorkin M.D., Ph.D.
> > Professor of Medicine, University of Maryland School of Medicine;
> > Associate Director for Biostatistics and Informatics, Baltimore VA
Medical Center Geriatrics Research, Education, and Clinical Center;
> > PI Biostatistics and Informatics Core, University of Maryland School
of Medicine Claude D. Pepper Older Americans Independence Center;
> > Senior Statistician University of Maryland Center for Vascular
Research;
> >
> > Division of Gerontology and Paliative Care,
> > 10 North Greene Street
> > GRECC (BT/18/GR)
> > Baltimore, MD 21201-1524
> > Cell phone 443-418-5382
> >
> >
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
https://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> Hello,
>
> And here are two other solutions.
>
>
> olddata$first <- with(olddata, ave(seq_along(ID), ID, FUN = \(x) x =>
x[1L]))
>
> olddata$first <- c(1L, diff(olddata$ID))
>
>
> Of these two, diff is faster. But of all the solutions posted so far,
> Ben Bolker's is the fastest. And it can be made a little faster if
> as.integer substitutes for as.numeric.
> And dplyr::mutate now has a .by argument, which avoids explicit the call
> to group_by, with a performance gain.
>
>
> library(microbenchmark)
>
> mb <- microbenchmark(
>    ave = with(olddata, ave(seq_along(ID), ID, FUN = \(x) x == x[1L])),
>    dup_num = as.numeric(! duplicated(olddata$ID)),
>    dup_int = as.integer(! duplicated(olddata$ID)),
>    diff = diff = c(1L, diff(olddata$ID)),
>    dplyr_grp = olddata %>% group_by(ID) %>% mutate(first >
as.integer(row_number() == 1)),
>    dplyr = olddata %>% mutate(first = as.integer(row_number() == 1), .by
> = ID)
> )
> print(mb, order = "median")
>
>
>
> However, note that dplyr operates in entire data.frames and therefore is
> expected to be slower when tested against instructions that process one
> column only.
>
>
> Hope this helps,
>
> Rui Barradas
>
>
> --
> Este e-mail foi analisado pelo software antiv?rus AVG para verificar a
presen?a de v?rus.
> www.avg.com
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Sorkin, John

2024-Dec-02 04:18 UTC

head link

[R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

Dear Colleagues,

I am grateful to all of you for helping me with my question, how to write R code
that will identify the first row of each ID within a data frame, create a
variable first=1 for the first row and first=0 for all repeats of the ID.

WOW!!!
I just saw Boris Steipe's answer to my question:
olddata$first <- as.numeric(! duplicated(olddata$ID))
The solution is elegant, short, easy to understand, and it uses base R! All
important characteristics of a good solution, at least for me. While I want to
learn solutions using packages that extend base R, I believe that a good
programmer learns how to do something using the base language and once that is
learned, explores way to solve a programing problem using advanced packages.

Each and every one of you (I hope I did not miss anyone in my list of email
addresses) took the time to read my emails and respond to me. Your collective
help is invaluable, and I am in your collect debt.

Many, many thanks,
John

John David Sorkin M.D., Ph.D.
Professor of Medicine, University of Maryland School of Medicine;
Associate Director for Biostatistics and Informatics, Baltimore VA Medical
Center Geriatrics Research, Education, and Clinical Center;
PI Biostatistics and Informatics Core, University of Maryland School of Medicine
Claude D. Pepper Older Americans Independence Center;
Senior Statistician University of Maryland Center for Vascular Research;

Division of Gerontology and Paliative Care,
10 North Greene Street
GRECC (BT/18/GR)
Baltimore, MD 21201-1524
Cell phone 443-418-5382




________________________________________
From: Bert Gunter <bgunter.4567 at gmail.com>
Sent: Sunday, December 1, 2024 11:30 AM
To: Rui Barradas
Cc: Sorkin, John; r-help at r-project.org (r-help at r-project.org)
Subject: Re: [R] Identify first row of each ID within a data frame, create a
variable first =1 for the first row and first=0 of all other rows

Rui:
"f these two, diff is faster. But of all the solutions posted so far,
Ben Bolker's is the fastest."

But the explicit version of diff is still considerably faster:
> D <- c(rep(1,10),rep(2,6),rep(3,2))
> microbenchmark(c(1L,diff(D)), times = 1000L)Unit: microseconds
           expr   min    lq    mean median    uq    max neval
 c(1L, diff(D)) 3.075 3.198 3.34396   3.28 3.362 29.684  1000
> microbenchmark( as.integer(!duplicated(D)), times =1000L)Unit: microseconds
                       expr   min    lq     mean median   uq  max neval
 as.integer(!duplicated(D)) 1.476 1.558 1.644264  1.599 1.64 16.4  1000
> microbenchmark( D - c(0L, D[-length(D)]), times = 1000L)Unit: nanoseconds  ## note that unit is nanoseconds not microseconds
                     expr min  lq    mean median  uq  max neval
 D - c(0L, D[-length(D)]) 369 410 489.335    492 533 9840  1000

Cheers,
Bert

On Sat, Nov 30, 2024 at 11:05?PM Rui Barradas <ruipbarradas at sapo.pt>
wrote:>
> ?s 02:27 de 01/12/2024, Sorkin, John escreveu:
> > Dear R help folks,
> >
> > First my apologizes for sending several related questions to the list
server. I am trying to learn how to manipulate data in R . . . and am having
difficulty getting my program to work. I greatly appreciate the help and support
list member give!
> >
> > I am trying to write a program that will run through a data frame
organized by ID and for the first line of each new group of data lines that has
the same ID create a new variable first that will be 1 for the first line of the
group and 0 for all other lines.
> >
> > e.g. if my original data is
> >   olddata
> >     ID date
> >      1     1
> >      1     1
> >      1     2
> >      1     2
> >      1     3
> >      1     3
> >      1     4
> >      1     4
> >      1     5
> >      1     5
> >      2     5
> >      2     5
> >      2     5
> >      2     6
> >      2     6
> >      2     6
> >      3   10
> >      3   10
> >
> > the new data will be
> > newdata
> >     ID date  first
> >      1     1       1
> >      1     1       0
> >      1     2       0
> >      1     2       0
> >      1     3       0
> >      1     3       0
> >      1     4       0
> >      1     4       0
> >      1     5       0
> >      1     5       0
> >      2     5       1
> >      2     5       0
> >      2     5       0
> >      2     6       0
> >      2     6       0
> >      2     6       0
> >      3   10       1
> >      3   10       0
> >
> > When I run the program below, I receive the following error:
> > Error in df[, "ID"] : incorrect number of dimensions
> >
> > My code:
> > # Create data.frame
> > ID <- c(rep(1,10),rep(2,6),rep(3,2))
> > date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2),
> >            rep(5,3),rep(6,3),rep(10,2))
> > olddata <- data.frame(ID=ID,date=date)
> > class(olddata)
> > cat("This is the original data frame","\n")
> > print(olddata)
> >
> > # This function is supposed to identify the first row
> > # within each level of ID and, for the first row, set
> > # the variable first to 1, and for all rows other than
> > # the first row set first to 0.
> > mydoit <- function(df){
> >    value <- ifelse (first(df[,"ID"]),1,0)
> >    cat("value=",value,"\n")
> >    df[,"first"] <- value
> > }
> > newdata <- aggregate(olddata,list(olddata[,"ID"]),mydoit)
> >
> > Thank you,
> > John
> >
> >
> > John David Sorkin M.D., Ph.D.
> > Professor of Medicine, University of Maryland School of Medicine;
> > Associate Director for Biostatistics and Informatics, Baltimore VA
Medical Center Geriatrics Research, Education, and Clinical Center;
> > PI Biostatistics and Informatics Core, University of Maryland School
of Medicine Claude D. Pepper Older Americans Independence Center;
> > Senior Statistician University of Maryland Center for Vascular
Research;
> >
> > Division of Gerontology and Paliative Care,
> > 10 North Greene Street
> > GRECC (BT/18/GR)
> > Baltimore, MD 21201-1524
> > Cell phone 443-418-5382
> >
> >
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
https://www.r-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> Hello,
>
> And here are two other solutions.
>
>
> olddata$first <- with(olddata, ave(seq_along(ID), ID, FUN = \(x) x =>
x[1L]))
>
> olddata$first <- c(1L, diff(olddata$ID))
>
>
> Of these two, diff is faster. But of all the solutions posted so far,
> Ben Bolker's is the fastest. And it can be made a little faster if
> as.integer substitutes for as.numeric.
> And dplyr::mutate now has a .by argument, which avoids explicit the call
> to group_by, with a performance gain.
>
>
> library(microbenchmark)
>
> mb <- microbenchmark(
>    ave = with(olddata, ave(seq_along(ID), ID, FUN = \(x) x == x[1L])),
>    dup_num = as.numeric(! duplicated(olddata$ID)),
>    dup_int = as.integer(! duplicated(olddata$ID)),
>    diff = diff = c(1L, diff(olddata$ID)),
>    dplyr_grp = olddata %>% group_by(ID) %>% mutate(first >
as.integer(row_number() == 1)),
>    dplyr = olddata %>% mutate(first = as.integer(row_number() == 1), .by
> = ID)
> )
> print(mb, order = "median")
>
>
>
> However, note that dplyr operates in entire data.frames and therefore is
> expected to be slower when tested against instructions that process one
> column only.
>
>
> Hope this helps,
>
> Rui Barradas
>
>
> --
> Este e-mail foi analisado pelo software antiv?rus AVG para verificar a
presen?a de v?rus.
> http://www.avg.com/
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
https://www.r-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

@vi@e@gross m@iii@g oii gm@ii@com

2024-Dec-02 05:39 UTC

head link

[R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

John,

Thanks for enlightening us so we better understand.

I won't argue with your wish to learn to do things in base R first. I
started that way, myself, and found lots of the commands not particularly easy
to fit into a single worldview. Many functions I read about were promptly
forgotten, especially those without great documentation and not enough examples
of real world usage.

This is why some packages that came later are important as they generally try to
come up with a somewhat consistent set of tools that often are also faster and
more flexible. There is often a set of reasons various packages are created in
the first place to meet real needs. And, I note that some may be subtle.
Original R was often inconsistent in the order of command arguments while the
dplyr and other tidyverse command try as much as possible to make the first
argument be the one normally passed through a pipeline. R fairly recently added
a native pipe operator that may be faster than the magrittr pipe but in some
ways makes some functionality harder. The rest of R has not really been changed
to make using commands in pipelines easy.

You seem to have also looked at data.table and given you may have large amounts
of data, it may be designed in ways that might also be beneficial.

But as I do not want to relearn lots of R functions I never use, I will bow out
from further discussion as what I would offer these days would probably not be
what you want.

My personal opinion is that proper use of R can actually be far easier and more
flexible than you had with the proprietary software that may largely consist of
canned reports often used.

I do want to point out a few things to consider.

When you go grouping, you may want to consider grouping (as well as sorting) by
multipole variables. You mention a variable with about 500 possibilities and
then another variable with an ID number but did not say the ID number was unique
across them all.

And, I want to note you may want to also look into testing the sanity of your
data. That is a wide area too. Things like duplicates, for example.

I do not know how many steps you can handle but there are sometimes designs that
make an algorithm work differently.

Consider your request to find  the first row in each grouping and add a column
with a 1, and 0 for all others. If that is what you need, fine.

But, what if instead you just added a row number. Some rows would have a 1, and
some may have a 2, 3, or 4.

When you wanted  to so something to just the rows with a 1, you can filter out a
subset of the data easily enough or apply a command only to those rows. But if
you want to test if any entry has more than 4 rows, this could allow you to
detect an error. Other ideas might be possible if that is how the data was
saved.

And, if it really is a 0/1 choice, fine, but consider the advantages or
disadvantages of what you save in the new column. Storing a numeric or an int
can take up space when storing a Boolean or TRUE/FALSE is what you need. R gives
you lots of flexibility which perhaps you did not have to think about before.

All I know is that so much of what you want to do is easily enough done with a
pipeline or two in dplyr. But this is your task and you choose what makes sense.
It specializes in group analysis and generates reports and so on. It may not be
how you think.

-----Original Message-----
From: Sorkin, John <jsorkin at som.umaryland.edu> 
Sent: Sunday, December 1, 2024 11:19 PM
To: Bert Gunter <bgunter.4567 at gmail.com>; Rui Barradas <ruipbarradas
at sapo.pt>; twoolman at ontargettek.com; tebert at ufl.edu; Bert Gunter
<bgunter.4567 at gmail.com>; jdnewmil at dcn.davis.ca.us; avi.e.gross at
gmail.com; therneau at mayo.edu; dwinsemius at comcast.net; tebert at ufl.edu;
rmh at temple.edu; ken.knoblauch at inserm.fr; boris.steipe at utoronto.ca
Cc: r-help at r-project.org (r-help at r-project.org) <r-help at
r-project.org>; kimmo.elo at uef.fi
Subject: Re: [R] Identify first row of each ID within a data frame, create a
variable first =1 for the first row and first=0 of all other rows

Dear Colleagues,

I am grateful to all of you for helping me with my question, how to write R code
that will identify the first row of each ID within a data frame, create a
variable first=1 for the first row and first=0 for all repeats of the ID.

WOW!!!
I just saw Boris Steipe's answer to my question:
olddata$first <- as.numeric(! duplicated(olddata$ID))
The solution is elegant, short, easy to understand, and it uses base R! All
important characteristics of a good solution, at least for me. While I want to
learn solutions using packages that extend base R, I believe that a good
programmer learns how to do something using the base language and once that is
learned, explores way to solve a programing problem using advanced packages.

Each and every one of you (I hope I did not miss anyone in my list of email
addresses) took the time to read my emails and respond to me. Your collective
help is invaluable, and I am in your collect debt.

Many, many thanks,
John

John David Sorkin M.D., Ph.D.
Professor of Medicine, University of Maryland School of Medicine;
Associate Director for Biostatistics and Informatics, Baltimore VA Medical
Center Geriatrics Research, Education, and Clinical Center;
PI Biostatistics and Informatics Core, University of Maryland School of Medicine
Claude D. Pepper Older Americans Independence Center;
Senior Statistician University of Maryland Center for Vascular Research;

Division of Gerontology and Paliative Care,
10 North Greene Street
GRECC (BT/18/GR)
Baltimore, MD 21201-1524
Cell phone 443-418-5382

________________________________________
From: Bert Gunter <bgunter.4567 at gmail.com>
Sent: Sunday, December 1, 2024 11:30 AM
To: Rui Barradas
Cc: Sorkin, John; r-help at r-project.org (r-help at r-project.org)
Subject: Re: [R] Identify first row of each ID within a data frame, create a
variable first =1 for the first row and first=0 of all other rows

Rui:
"f these two, diff is faster. But of all the solutions posted so far,
Ben Bolker's is the fastest."

But the explicit version of diff is still considerably faster:
> D <- c(rep(1,10),rep(2,6),rep(3,2))
> microbenchmark(c(1L,diff(D)), times = 1000L)Unit: microseconds
           expr   min    lq    mean median    uq    max neval
 c(1L, diff(D)) 3.075 3.198 3.34396   3.28 3.362 29.684  1000
> microbenchmark( as.integer(!duplicated(D)), times =1000L)Unit: microseconds
                       expr   min    lq     mean median   uq  max neval
 as.integer(!duplicated(D)) 1.476 1.558 1.644264  1.599 1.64 16.4  1000
> microbenchmark( D - c(0L, D[-length(D)]), times = 1000L)Unit: nanoseconds  ## note that unit is nanoseconds not microseconds
                     expr min  lq    mean median  uq  max neval
 D - c(0L, D[-length(D)]) 369 410 489.335    492 533 9840  1000

Cheers,
Bert

On Sat, Nov 30, 2024 at 11:05?PM Rui Barradas <ruipbarradas at sapo.pt>
wrote:>
> ?s 02:27 de 01/12/2024, Sorkin, John escreveu:
> > Dear R help folks,
> >
> > First my apologizes for sending several related questions to the list
server. I am trying to learn how to manipulate data in R . . . and am having
difficulty getting my program to work. I greatly appreciate the help and support
list member give!
> >
> > I am trying to write a program that will run through a data frame
organized by ID and for the first line of each new group of data lines that has
the same ID create a new variable first that will be 1 for the first line of the
group and 0 for all other lines.
> >
> > e.g. if my original data is
> >   olddata
> >     ID date
> >      1     1
> >      1     1
> >      1     2
> >      1     2
> >      1     3
> >      1     3
> >      1     4
> >      1     4
> >      1     5
> >      1     5
> >      2     5
> >      2     5
> >      2     5
> >      2     6
> >      2     6
> >      2     6
> >      3   10
> >      3   10
> >
> > the new data will be
> > newdata
> >     ID date  first
> >      1     1       1
> >      1     1       0
> >      1     2       0
> >      1     2       0
> >      1     3       0
> >      1     3       0
> >      1     4       0
> >      1     4       0
> >      1     5       0
> >      1     5       0
> >      2     5       1
> >      2     5       0
> >      2     5       0
> >      2     6       0
> >      2     6       0
> >      2     6       0
> >      3   10       1
> >      3   10       0
> >
> > When I run the program below, I receive the following error:
> > Error in df[, "ID"] : incorrect number of dimensions
> >
> > My code:
> > # Create data.frame
> > ID <- c(rep(1,10),rep(2,6),rep(3,2))
> > date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2),
> >            rep(5,3),rep(6,3),rep(10,2))
> > olddata <- data.frame(ID=ID,date=date)
> > class(olddata)
> > cat("This is the original data frame","\n")
> > print(olddata)
> >
> > # This function is supposed to identify the first row
> > # within each level of ID and, for the first row, set
> > # the variable first to 1, and for all rows other than
> > # the first row set first to 0.
> > mydoit <- function(df){
> >    value <- ifelse (first(df[,"ID"]),1,0)
> >    cat("value=",value,"\n")
> >    df[,"first"] <- value
> > }
> > newdata <- aggregate(olddata,list(olddata[,"ID"]),mydoit)
> >
> > Thank you,
> > John
> >
> >
> > John David Sorkin M.D., Ph.D.
> > Professor of Medicine, University of Maryland School of Medicine;
> > Associate Director for Biostatistics and Informatics, Baltimore VA
Medical Center Geriatrics Research, Education, and Clinical Center;
> > PI Biostatistics and Informatics Core, University of Maryland School
of Medicine Claude D. Pepper Older Americans Independence Center;
> > Senior Statistician University of Maryland Center for Vascular
Research;
> >
> > Division of Gerontology and Paliative Care,
> > 10 North Greene Street
> > GRECC (BT/18/GR)
> > Baltimore, MD 21201-1524
> > Cell phone 443-418-5382
> >
> >
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
https://www.r-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> Hello,
>
> And here are two other solutions.
>
>
> olddata$first <- with(olddata, ave(seq_along(ID), ID, FUN = \(x) x =>
x[1L]))
>
> olddata$first <- c(1L, diff(olddata$ID))
>
>
> Of these two, diff is faster. But of all the solutions posted so far,
> Ben Bolker's is the fastest. And it can be made a little faster if
> as.integer substitutes for as.numeric.
> And dplyr::mutate now has a .by argument, which avoids explicit the call
> to group_by, with a performance gain.
>
>
> library(microbenchmark)
>
> mb <- microbenchmark(
>    ave = with(olddata, ave(seq_along(ID), ID, FUN = \(x) x == x[1L])),
>    dup_num = as.numeric(! duplicated(olddata$ID)),
>    dup_int = as.integer(! duplicated(olddata$ID)),
>    diff = diff = c(1L, diff(olddata$ID)),
>    dplyr_grp = olddata %>% group_by(ID) %>% mutate(first >
as.integer(row_number() == 1)),
>    dplyr = olddata %>% mutate(first = as.integer(row_number() == 1), .by
> = ID)
> )
> print(mb, order = "median")
>
>
>
> However, note that dplyr operates in entire data.frames and therefore is
> expected to be slower when tested against instructions that process one
> column only.
>
>
> Hope this helps,
>
> Rui Barradas
>
>
> --
> Este e-mail foi analisado pelo software antiv?rus AVG para verificar a
presen?a de v?rus.
> http://www.avg.com/
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
https://www.r-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Possibly Parallel Threads

Search for more seemingly similar threads

R help - Dec 2024 - Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

[R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

[R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

[R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

Possibly Parallel Threads