Rui Barradas
2024-Dec-01 07:05 UTC
[R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows
?s 02:27 de 01/12/2024, Sorkin, John escreveu:> Dear R help folks, > > First my apologizes for sending several related questions to the list server. I am trying to learn how to manipulate data in R . . . and am having difficulty getting my program to work. I greatly appreciate the help and support list member give! > > I am trying to write a program that will run through a data frame organized by ID and for the first line of each new group of data lines that has the same ID create a new variable first that will be 1 for the first line of the group and 0 for all other lines. > > e.g. if my original data is > olddata > ID date > 1 1 > 1 1 > 1 2 > 1 2 > 1 3 > 1 3 > 1 4 > 1 4 > 1 5 > 1 5 > 2 5 > 2 5 > 2 5 > 2 6 > 2 6 > 2 6 > 3 10 > 3 10 > > the new data will be > newdata > ID date first > 1 1 1 > 1 1 0 > 1 2 0 > 1 2 0 > 1 3 0 > 1 3 0 > 1 4 0 > 1 4 0 > 1 5 0 > 1 5 0 > 2 5 1 > 2 5 0 > 2 5 0 > 2 6 0 > 2 6 0 > 2 6 0 > 3 10 1 > 3 10 0 > > When I run the program below, I receive the following error: > Error in df[, "ID"] : incorrect number of dimensions > > My code: > # Create data.frame > ID <- c(rep(1,10),rep(2,6),rep(3,2)) > date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2), > rep(5,3),rep(6,3),rep(10,2)) > olddata <- data.frame(ID=ID,date=date) > class(olddata) > cat("This is the original data frame","\n") > print(olddata) > > # This function is supposed to identify the first row > # within each level of ID and, for the first row, set > # the variable first to 1, and for all rows other than > # the first row set first to 0. > mydoit <- function(df){ > value <- ifelse (first(df[,"ID"]),1,0) > cat("value=",value,"\n") > df[,"first"] <- value > } > newdata <- aggregate(olddata,list(olddata[,"ID"]),mydoit) > > Thank you, > John > > > John David Sorkin M.D., Ph.D. > Professor of Medicine, University of Maryland School of Medicine; > Associate Director for Biostatistics and Informatics, Baltimore VA Medical Center Geriatrics Research, Education, and Clinical Center; > PI?Biostatistics and Informatics Core, University of Maryland School of Medicine Claude D. Pepper Older Americans Independence Center; > Senior Statistician University of Maryland Center for Vascular Research; > > Division of Gerontology and Paliative Care, > 10 North Greene Street > GRECC (BT/18/GR) > Baltimore, MD 21201-1524 > Cell phone 443-418-5382 > > > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide https://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.Hello, And here are two other solutions. olddata$first <- with(olddata, ave(seq_along(ID), ID, FUN = \(x) x == x[1L])) olddata$first <- c(1L, diff(olddata$ID)) Of these two, diff is faster. But of all the solutions posted so far, Ben Bolker's is the fastest. And it can be made a little faster if as.integer substitutes for as.numeric. And dplyr::mutate now has a .by argument, which avoids explicit the call to group_by, with a performance gain. library(microbenchmark) mb <- microbenchmark( ave = with(olddata, ave(seq_along(ID), ID, FUN = \(x) x == x[1L])), dup_num = as.numeric(! duplicated(olddata$ID)), dup_int = as.integer(! duplicated(olddata$ID)), diff = diff = c(1L, diff(olddata$ID)), dplyr_grp = olddata %>% group_by(ID) %>% mutate(first = as.integer(row_number() == 1)), dplyr = olddata %>% mutate(first = as.integer(row_number() == 1), .by = ID) ) print(mb, order = "median") However, note that dplyr operates in entire data.frames and therefore is expected to be slower when tested against instructions that process one column only. Hope this helps, Rui Barradas -- Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a de v?rus. www.avg.com
Bert Gunter
2024-Dec-01 16:30 UTC
[R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows
Rui: "f these two, diff is faster. But of all the solutions posted so far, Ben Bolker's is the fastest." But the explicit version of diff is still considerably faster:> D <- c(rep(1,10),rep(2,6),rep(3,2))> microbenchmark(c(1L,diff(D)), times = 1000L)Unit: microseconds expr min lq mean median uq max neval c(1L, diff(D)) 3.075 3.198 3.34396 3.28 3.362 29.684 1000> microbenchmark( as.integer(!duplicated(D)), times =1000L)Unit: microseconds expr min lq mean median uq max neval as.integer(!duplicated(D)) 1.476 1.558 1.644264 1.599 1.64 16.4 1000> microbenchmark( D - c(0L, D[-length(D)]), times = 1000L)Unit: nanoseconds ## note that unit is nanoseconds not microseconds expr min lq mean median uq max neval D - c(0L, D[-length(D)]) 369 410 489.335 492 533 9840 1000 Cheers, Bert On Sat, Nov 30, 2024 at 11:05?PM Rui Barradas <ruipbarradas at sapo.pt> wrote:> > ?s 02:27 de 01/12/2024, Sorkin, John escreveu: > > Dear R help folks, > > > > First my apologizes for sending several related questions to the list server. I am trying to learn how to manipulate data in R . . . and am having difficulty getting my program to work. I greatly appreciate the help and support list member give! > > > > I am trying to write a program that will run through a data frame organized by ID and for the first line of each new group of data lines that has the same ID create a new variable first that will be 1 for the first line of the group and 0 for all other lines. > > > > e.g. if my original data is > > olddata > > ID date > > 1 1 > > 1 1 > > 1 2 > > 1 2 > > 1 3 > > 1 3 > > 1 4 > > 1 4 > > 1 5 > > 1 5 > > 2 5 > > 2 5 > > 2 5 > > 2 6 > > 2 6 > > 2 6 > > 3 10 > > 3 10 > > > > the new data will be > > newdata > > ID date first > > 1 1 1 > > 1 1 0 > > 1 2 0 > > 1 2 0 > > 1 3 0 > > 1 3 0 > > 1 4 0 > > 1 4 0 > > 1 5 0 > > 1 5 0 > > 2 5 1 > > 2 5 0 > > 2 5 0 > > 2 6 0 > > 2 6 0 > > 2 6 0 > > 3 10 1 > > 3 10 0 > > > > When I run the program below, I receive the following error: > > Error in df[, "ID"] : incorrect number of dimensions > > > > My code: > > # Create data.frame > > ID <- c(rep(1,10),rep(2,6),rep(3,2)) > > date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2), > > rep(5,3),rep(6,3),rep(10,2)) > > olddata <- data.frame(ID=ID,date=date) > > class(olddata) > > cat("This is the original data frame","\n") > > print(olddata) > > > > # This function is supposed to identify the first row > > # within each level of ID and, for the first row, set > > # the variable first to 1, and for all rows other than > > # the first row set first to 0. > > mydoit <- function(df){ > > value <- ifelse (first(df[,"ID"]),1,0) > > cat("value=",value,"\n") > > df[,"first"] <- value > > } > > newdata <- aggregate(olddata,list(olddata[,"ID"]),mydoit) > > > > Thank you, > > John > > > > > > John David Sorkin M.D., Ph.D. > > Professor of Medicine, University of Maryland School of Medicine; > > Associate Director for Biostatistics and Informatics, Baltimore VA Medical Center Geriatrics Research, Education, and Clinical Center; > > PI Biostatistics and Informatics Core, University of Maryland School of Medicine Claude D. Pepper Older Americans Independence Center; > > Senior Statistician University of Maryland Center for Vascular Research; > > > > Division of Gerontology and Paliative Care, > > 10 North Greene Street > > GRECC (BT/18/GR) > > Baltimore, MD 21201-1524 > > Cell phone 443-418-5382 > > > > > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide https://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > Hello, > > And here are two other solutions. > > > olddata$first <- with(olddata, ave(seq_along(ID), ID, FUN = \(x) x => x[1L])) > > olddata$first <- c(1L, diff(olddata$ID)) > > > Of these two, diff is faster. But of all the solutions posted so far, > Ben Bolker's is the fastest. And it can be made a little faster if > as.integer substitutes for as.numeric. > And dplyr::mutate now has a .by argument, which avoids explicit the call > to group_by, with a performance gain. > > > library(microbenchmark) > > mb <- microbenchmark( > ave = with(olddata, ave(seq_along(ID), ID, FUN = \(x) x == x[1L])), > dup_num = as.numeric(! duplicated(olddata$ID)), > dup_int = as.integer(! duplicated(olddata$ID)), > diff = diff = c(1L, diff(olddata$ID)), > dplyr_grp = olddata %>% group_by(ID) %>% mutate(first > as.integer(row_number() == 1)), > dplyr = olddata %>% mutate(first = as.integer(row_number() == 1), .by > = ID) > ) > print(mb, order = "median") > > > > However, note that dplyr operates in entire data.frames and therefore is > expected to be slower when tested against instructions that process one > column only. > > > Hope this helps, > > Rui Barradas > > > -- > Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a de v?rus. > www.avg.com > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide https://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Possibly Parallel Threads
- Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows
- Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows
- R Processing dataframe by group - equivalent to SAS by group processing with a first. and retain statments
- R Processing dataframe by group - equivalent to SAS by group processing with a first. and retain statments
- R Processing dataframe by group - equivalent to SAS by group processing with a first. and retain statments