Bert Gunter
2024-Dec-01 16:30 UTC
[R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows
Rui: "f these two, diff is faster. But of all the solutions posted so far, Ben Bolker's is the fastest." But the explicit version of diff is still considerably faster:> D <- c(rep(1,10),rep(2,6),rep(3,2))> microbenchmark(c(1L,diff(D)), times = 1000L)Unit: microseconds expr min lq mean median uq max neval c(1L, diff(D)) 3.075 3.198 3.34396 3.28 3.362 29.684 1000> microbenchmark( as.integer(!duplicated(D)), times =1000L)Unit: microseconds expr min lq mean median uq max neval as.integer(!duplicated(D)) 1.476 1.558 1.644264 1.599 1.64 16.4 1000> microbenchmark( D - c(0L, D[-length(D)]), times = 1000L)Unit: nanoseconds ## note that unit is nanoseconds not microseconds expr min lq mean median uq max neval D - c(0L, D[-length(D)]) 369 410 489.335 492 533 9840 1000 Cheers, Bert On Sat, Nov 30, 2024 at 11:05?PM Rui Barradas <ruipbarradas at sapo.pt> wrote:> > ?s 02:27 de 01/12/2024, Sorkin, John escreveu: > > Dear R help folks, > > > > First my apologizes for sending several related questions to the list server. I am trying to learn how to manipulate data in R . . . and am having difficulty getting my program to work. I greatly appreciate the help and support list member give! > > > > I am trying to write a program that will run through a data frame organized by ID and for the first line of each new group of data lines that has the same ID create a new variable first that will be 1 for the first line of the group and 0 for all other lines. > > > > e.g. if my original data is > > olddata > > ID date > > 1 1 > > 1 1 > > 1 2 > > 1 2 > > 1 3 > > 1 3 > > 1 4 > > 1 4 > > 1 5 > > 1 5 > > 2 5 > > 2 5 > > 2 5 > > 2 6 > > 2 6 > > 2 6 > > 3 10 > > 3 10 > > > > the new data will be > > newdata > > ID date first > > 1 1 1 > > 1 1 0 > > 1 2 0 > > 1 2 0 > > 1 3 0 > > 1 3 0 > > 1 4 0 > > 1 4 0 > > 1 5 0 > > 1 5 0 > > 2 5 1 > > 2 5 0 > > 2 5 0 > > 2 6 0 > > 2 6 0 > > 2 6 0 > > 3 10 1 > > 3 10 0 > > > > When I run the program below, I receive the following error: > > Error in df[, "ID"] : incorrect number of dimensions > > > > My code: > > # Create data.frame > > ID <- c(rep(1,10),rep(2,6),rep(3,2)) > > date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2), > > rep(5,3),rep(6,3),rep(10,2)) > > olddata <- data.frame(ID=ID,date=date) > > class(olddata) > > cat("This is the original data frame","\n") > > print(olddata) > > > > # This function is supposed to identify the first row > > # within each level of ID and, for the first row, set > > # the variable first to 1, and for all rows other than > > # the first row set first to 0. > > mydoit <- function(df){ > > value <- ifelse (first(df[,"ID"]),1,0) > > cat("value=",value,"\n") > > df[,"first"] <- value > > } > > newdata <- aggregate(olddata,list(olddata[,"ID"]),mydoit) > > > > Thank you, > > John > > > > > > John David Sorkin M.D., Ph.D. > > Professor of Medicine, University of Maryland School of Medicine; > > Associate Director for Biostatistics and Informatics, Baltimore VA Medical Center Geriatrics Research, Education, and Clinical Center; > > PI Biostatistics and Informatics Core, University of Maryland School of Medicine Claude D. Pepper Older Americans Independence Center; > > Senior Statistician University of Maryland Center for Vascular Research; > > > > Division of Gerontology and Paliative Care, > > 10 North Greene Street > > GRECC (BT/18/GR) > > Baltimore, MD 21201-1524 > > Cell phone 443-418-5382 > > > > > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide https://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > Hello, > > And here are two other solutions. > > > olddata$first <- with(olddata, ave(seq_along(ID), ID, FUN = \(x) x => x[1L])) > > olddata$first <- c(1L, diff(olddata$ID)) > > > Of these two, diff is faster. But of all the solutions posted so far, > Ben Bolker's is the fastest. And it can be made a little faster if > as.integer substitutes for as.numeric. > And dplyr::mutate now has a .by argument, which avoids explicit the call > to group_by, with a performance gain. > > > library(microbenchmark) > > mb <- microbenchmark( > ave = with(olddata, ave(seq_along(ID), ID, FUN = \(x) x == x[1L])), > dup_num = as.numeric(! duplicated(olddata$ID)), > dup_int = as.integer(! duplicated(olddata$ID)), > diff = diff = c(1L, diff(olddata$ID)), > dplyr_grp = olddata %>% group_by(ID) %>% mutate(first > as.integer(row_number() == 1)), > dplyr = olddata %>% mutate(first = as.integer(row_number() == 1), .by > = ID) > ) > print(mb, order = "median") > > > > However, note that dplyr operates in entire data.frames and therefore is > expected to be slower when tested against instructions that process one > column only. > > > Hope this helps, > > Rui Barradas > > > -- > Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a de v?rus. > www.avg.com > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide https://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Sorkin, John
2024-Dec-02 04:18 UTC
[R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows
Dear Colleagues, I am grateful to all of you for helping me with my question, how to write R code that will identify the first row of each ID within a data frame, create a variable first=1 for the first row and first=0 for all repeats of the ID. WOW!!! I just saw Boris Steipe's answer to my question: olddata$first <- as.numeric(! duplicated(olddata$ID)) The solution is elegant, short, easy to understand, and it uses base R! All important characteristics of a good solution, at least for me. While I want to learn solutions using packages that extend base R, I believe that a good programmer learns how to do something using the base language and once that is learned, explores way to solve a programing problem using advanced packages. Each and every one of you (I hope I did not miss anyone in my list of email addresses) took the time to read my emails and respond to me. Your collective help is invaluable, and I am in your collect debt. Many, many thanks, John John David Sorkin M.D., Ph.D. Professor of Medicine, University of Maryland School of Medicine; Associate Director for Biostatistics and Informatics, Baltimore VA Medical Center Geriatrics Research, Education, and Clinical Center; PI Biostatistics and Informatics Core, University of Maryland School of Medicine Claude D. Pepper Older Americans Independence Center; Senior Statistician University of Maryland Center for Vascular Research; Division of Gerontology and Paliative Care, 10 North Greene Street GRECC (BT/18/GR) Baltimore, MD 21201-1524 Cell phone 443-418-5382 ________________________________________ From: Bert Gunter <bgunter.4567 at gmail.com> Sent: Sunday, December 1, 2024 11:30 AM To: Rui Barradas Cc: Sorkin, John; r-help at r-project.org (r-help at r-project.org) Subject: Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows Rui: "f these two, diff is faster. But of all the solutions posted so far, Ben Bolker's is the fastest." But the explicit version of diff is still considerably faster:> D <- c(rep(1,10),rep(2,6),rep(3,2))> microbenchmark(c(1L,diff(D)), times = 1000L)Unit: microseconds expr min lq mean median uq max neval c(1L, diff(D)) 3.075 3.198 3.34396 3.28 3.362 29.684 1000> microbenchmark( as.integer(!duplicated(D)), times =1000L)Unit: microseconds expr min lq mean median uq max neval as.integer(!duplicated(D)) 1.476 1.558 1.644264 1.599 1.64 16.4 1000> microbenchmark( D - c(0L, D[-length(D)]), times = 1000L)Unit: nanoseconds ## note that unit is nanoseconds not microseconds expr min lq mean median uq max neval D - c(0L, D[-length(D)]) 369 410 489.335 492 533 9840 1000 Cheers, Bert On Sat, Nov 30, 2024 at 11:05?PM Rui Barradas <ruipbarradas at sapo.pt> wrote:> > ?s 02:27 de 01/12/2024, Sorkin, John escreveu: > > Dear R help folks, > > > > First my apologizes for sending several related questions to the list server. I am trying to learn how to manipulate data in R . . . and am having difficulty getting my program to work. I greatly appreciate the help and support list member give! > > > > I am trying to write a program that will run through a data frame organized by ID and for the first line of each new group of data lines that has the same ID create a new variable first that will be 1 for the first line of the group and 0 for all other lines. > > > > e.g. if my original data is > > olddata > > ID date > > 1 1 > > 1 1 > > 1 2 > > 1 2 > > 1 3 > > 1 3 > > 1 4 > > 1 4 > > 1 5 > > 1 5 > > 2 5 > > 2 5 > > 2 5 > > 2 6 > > 2 6 > > 2 6 > > 3 10 > > 3 10 > > > > the new data will be > > newdata > > ID date first > > 1 1 1 > > 1 1 0 > > 1 2 0 > > 1 2 0 > > 1 3 0 > > 1 3 0 > > 1 4 0 > > 1 4 0 > > 1 5 0 > > 1 5 0 > > 2 5 1 > > 2 5 0 > > 2 5 0 > > 2 6 0 > > 2 6 0 > > 2 6 0 > > 3 10 1 > > 3 10 0 > > > > When I run the program below, I receive the following error: > > Error in df[, "ID"] : incorrect number of dimensions > > > > My code: > > # Create data.frame > > ID <- c(rep(1,10),rep(2,6),rep(3,2)) > > date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2), > > rep(5,3),rep(6,3),rep(10,2)) > > olddata <- data.frame(ID=ID,date=date) > > class(olddata) > > cat("This is the original data frame","\n") > > print(olddata) > > > > # This function is supposed to identify the first row > > # within each level of ID and, for the first row, set > > # the variable first to 1, and for all rows other than > > # the first row set first to 0. > > mydoit <- function(df){ > > value <- ifelse (first(df[,"ID"]),1,0) > > cat("value=",value,"\n") > > df[,"first"] <- value > > } > > newdata <- aggregate(olddata,list(olddata[,"ID"]),mydoit) > > > > Thank you, > > John > > > > > > John David Sorkin M.D., Ph.D. > > Professor of Medicine, University of Maryland School of Medicine; > > Associate Director for Biostatistics and Informatics, Baltimore VA Medical Center Geriatrics Research, Education, and Clinical Center; > > PI Biostatistics and Informatics Core, University of Maryland School of Medicine Claude D. Pepper Older Americans Independence Center; > > Senior Statistician University of Maryland Center for Vascular Research; > > > > Division of Gerontology and Paliative Care, > > 10 North Greene Street > > GRECC (BT/18/GR) > > Baltimore, MD 21201-1524 > > Cell phone 443-418-5382 > > > > > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide https://www.r-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > Hello, > > And here are two other solutions. > > > olddata$first <- with(olddata, ave(seq_along(ID), ID, FUN = \(x) x => x[1L])) > > olddata$first <- c(1L, diff(olddata$ID)) > > > Of these two, diff is faster. But of all the solutions posted so far, > Ben Bolker's is the fastest. And it can be made a little faster if > as.integer substitutes for as.numeric. > And dplyr::mutate now has a .by argument, which avoids explicit the call > to group_by, with a performance gain. > > > library(microbenchmark) > > mb <- microbenchmark( > ave = with(olddata, ave(seq_along(ID), ID, FUN = \(x) x == x[1L])), > dup_num = as.numeric(! duplicated(olddata$ID)), > dup_int = as.integer(! duplicated(olddata$ID)), > diff = diff = c(1L, diff(olddata$ID)), > dplyr_grp = olddata %>% group_by(ID) %>% mutate(first > as.integer(row_number() == 1)), > dplyr = olddata %>% mutate(first = as.integer(row_number() == 1), .by > = ID) > ) > print(mb, order = "median") > > > > However, note that dplyr operates in entire data.frames and therefore is > expected to be slower when tested against instructions that process one > column only. > > > Hope this helps, > > Rui Barradas > > > -- > Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a de v?rus. > http://www.avg.com/ > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide https://www.r-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
@vi@e@gross m@iii@g oii gm@ii@com
2024-Dec-02 05:39 UTC
[R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows
John, Thanks for enlightening us so we better understand. I won't argue with your wish to learn to do things in base R first. I started that way, myself, and found lots of the commands not particularly easy to fit into a single worldview. Many functions I read about were promptly forgotten, especially those without great documentation and not enough examples of real world usage. This is why some packages that came later are important as they generally try to come up with a somewhat consistent set of tools that often are also faster and more flexible. There is often a set of reasons various packages are created in the first place to meet real needs. And, I note that some may be subtle. Original R was often inconsistent in the order of command arguments while the dplyr and other tidyverse command try as much as possible to make the first argument be the one normally passed through a pipeline. R fairly recently added a native pipe operator that may be faster than the magrittr pipe but in some ways makes some functionality harder. The rest of R has not really been changed to make using commands in pipelines easy. You seem to have also looked at data.table and given you may have large amounts of data, it may be designed in ways that might also be beneficial. But as I do not want to relearn lots of R functions I never use, I will bow out from further discussion as what I would offer these days would probably not be what you want. My personal opinion is that proper use of R can actually be far easier and more flexible than you had with the proprietary software that may largely consist of canned reports often used. I do want to point out a few things to consider. When you go grouping, you may want to consider grouping (as well as sorting) by multipole variables. You mention a variable with about 500 possibilities and then another variable with an ID number but did not say the ID number was unique across them all. And, I want to note you may want to also look into testing the sanity of your data. That is a wide area too. Things like duplicates, for example. I do not know how many steps you can handle but there are sometimes designs that make an algorithm work differently. Consider your request to find the first row in each grouping and add a column with a 1, and 0 for all others. If that is what you need, fine. But, what if instead you just added a row number. Some rows would have a 1, and some may have a 2, 3, or 4. When you wanted to so something to just the rows with a 1, you can filter out a subset of the data easily enough or apply a command only to those rows. But if you want to test if any entry has more than 4 rows, this could allow you to detect an error. Other ideas might be possible if that is how the data was saved. And, if it really is a 0/1 choice, fine, but consider the advantages or disadvantages of what you save in the new column. Storing a numeric or an int can take up space when storing a Boolean or TRUE/FALSE is what you need. R gives you lots of flexibility which perhaps you did not have to think about before. All I know is that so much of what you want to do is easily enough done with a pipeline or two in dplyr. But this is your task and you choose what makes sense. It specializes in group analysis and generates reports and so on. It may not be how you think. -----Original Message----- From: Sorkin, John <jsorkin at som.umaryland.edu> Sent: Sunday, December 1, 2024 11:19 PM To: Bert Gunter <bgunter.4567 at gmail.com>; Rui Barradas <ruipbarradas at sapo.pt>; twoolman at ontargettek.com; tebert at ufl.edu; Bert Gunter <bgunter.4567 at gmail.com>; jdnewmil at dcn.davis.ca.us; avi.e.gross at gmail.com; therneau at mayo.edu; dwinsemius at comcast.net; tebert at ufl.edu; rmh at temple.edu; ken.knoblauch at inserm.fr; boris.steipe at utoronto.ca Cc: r-help at r-project.org (r-help at r-project.org) <r-help at r-project.org>; kimmo.elo at uef.fi Subject: Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows Dear Colleagues, I am grateful to all of you for helping me with my question, how to write R code that will identify the first row of each ID within a data frame, create a variable first=1 for the first row and first=0 for all repeats of the ID. WOW!!! I just saw Boris Steipe's answer to my question: olddata$first <- as.numeric(! duplicated(olddata$ID)) The solution is elegant, short, easy to understand, and it uses base R! All important characteristics of a good solution, at least for me. While I want to learn solutions using packages that extend base R, I believe that a good programmer learns how to do something using the base language and once that is learned, explores way to solve a programing problem using advanced packages. Each and every one of you (I hope I did not miss anyone in my list of email addresses) took the time to read my emails and respond to me. Your collective help is invaluable, and I am in your collect debt. Many, many thanks, John John David Sorkin M.D., Ph.D. Professor of Medicine, University of Maryland School of Medicine; Associate Director for Biostatistics and Informatics, Baltimore VA Medical Center Geriatrics Research, Education, and Clinical Center; PI Biostatistics and Informatics Core, University of Maryland School of Medicine Claude D. Pepper Older Americans Independence Center; Senior Statistician University of Maryland Center for Vascular Research; Division of Gerontology and Paliative Care, 10 North Greene Street GRECC (BT/18/GR) Baltimore, MD 21201-1524 Cell phone 443-418-5382 ________________________________________ From: Bert Gunter <bgunter.4567 at gmail.com> Sent: Sunday, December 1, 2024 11:30 AM To: Rui Barradas Cc: Sorkin, John; r-help at r-project.org (r-help at r-project.org) Subject: Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows Rui: "f these two, diff is faster. But of all the solutions posted so far, Ben Bolker's is the fastest." But the explicit version of diff is still considerably faster:> D <- c(rep(1,10),rep(2,6),rep(3,2))> microbenchmark(c(1L,diff(D)), times = 1000L)Unit: microseconds expr min lq mean median uq max neval c(1L, diff(D)) 3.075 3.198 3.34396 3.28 3.362 29.684 1000> microbenchmark( as.integer(!duplicated(D)), times =1000L)Unit: microseconds expr min lq mean median uq max neval as.integer(!duplicated(D)) 1.476 1.558 1.644264 1.599 1.64 16.4 1000> microbenchmark( D - c(0L, D[-length(D)]), times = 1000L)Unit: nanoseconds ## note that unit is nanoseconds not microseconds expr min lq mean median uq max neval D - c(0L, D[-length(D)]) 369 410 489.335 492 533 9840 1000 Cheers, Bert On Sat, Nov 30, 2024 at 11:05?PM Rui Barradas <ruipbarradas at sapo.pt> wrote:> > ?s 02:27 de 01/12/2024, Sorkin, John escreveu: > > Dear R help folks, > > > > First my apologizes for sending several related questions to the list server. I am trying to learn how to manipulate data in R . . . and am having difficulty getting my program to work. I greatly appreciate the help and support list member give! > > > > I am trying to write a program that will run through a data frame organized by ID and for the first line of each new group of data lines that has the same ID create a new variable first that will be 1 for the first line of the group and 0 for all other lines. > > > > e.g. if my original data is > > olddata > > ID date > > 1 1 > > 1 1 > > 1 2 > > 1 2 > > 1 3 > > 1 3 > > 1 4 > > 1 4 > > 1 5 > > 1 5 > > 2 5 > > 2 5 > > 2 5 > > 2 6 > > 2 6 > > 2 6 > > 3 10 > > 3 10 > > > > the new data will be > > newdata > > ID date first > > 1 1 1 > > 1 1 0 > > 1 2 0 > > 1 2 0 > > 1 3 0 > > 1 3 0 > > 1 4 0 > > 1 4 0 > > 1 5 0 > > 1 5 0 > > 2 5 1 > > 2 5 0 > > 2 5 0 > > 2 6 0 > > 2 6 0 > > 2 6 0 > > 3 10 1 > > 3 10 0 > > > > When I run the program below, I receive the following error: > > Error in df[, "ID"] : incorrect number of dimensions > > > > My code: > > # Create data.frame > > ID <- c(rep(1,10),rep(2,6),rep(3,2)) > > date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2), > > rep(5,3),rep(6,3),rep(10,2)) > > olddata <- data.frame(ID=ID,date=date) > > class(olddata) > > cat("This is the original data frame","\n") > > print(olddata) > > > > # This function is supposed to identify the first row > > # within each level of ID and, for the first row, set > > # the variable first to 1, and for all rows other than > > # the first row set first to 0. > > mydoit <- function(df){ > > value <- ifelse (first(df[,"ID"]),1,0) > > cat("value=",value,"\n") > > df[,"first"] <- value > > } > > newdata <- aggregate(olddata,list(olddata[,"ID"]),mydoit) > > > > Thank you, > > John > > > > > > John David Sorkin M.D., Ph.D. > > Professor of Medicine, University of Maryland School of Medicine; > > Associate Director for Biostatistics and Informatics, Baltimore VA Medical Center Geriatrics Research, Education, and Clinical Center; > > PI Biostatistics and Informatics Core, University of Maryland School of Medicine Claude D. Pepper Older Americans Independence Center; > > Senior Statistician University of Maryland Center for Vascular Research; > > > > Division of Gerontology and Paliative Care, > > 10 North Greene Street > > GRECC (BT/18/GR) > > Baltimore, MD 21201-1524 > > Cell phone 443-418-5382 > > > > > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide https://www.r-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > Hello, > > And here are two other solutions. > > > olddata$first <- with(olddata, ave(seq_along(ID), ID, FUN = \(x) x => x[1L])) > > olddata$first <- c(1L, diff(olddata$ID)) > > > Of these two, diff is faster. But of all the solutions posted so far, > Ben Bolker's is the fastest. And it can be made a little faster if > as.integer substitutes for as.numeric. > And dplyr::mutate now has a .by argument, which avoids explicit the call > to group_by, with a performance gain. > > > library(microbenchmark) > > mb <- microbenchmark( > ave = with(olddata, ave(seq_along(ID), ID, FUN = \(x) x == x[1L])), > dup_num = as.numeric(! duplicated(olddata$ID)), > dup_int = as.integer(! duplicated(olddata$ID)), > diff = diff = c(1L, diff(olddata$ID)), > dplyr_grp = olddata %>% group_by(ID) %>% mutate(first > as.integer(row_number() == 1)), > dplyr = olddata %>% mutate(first = as.integer(row_number() == 1), .by > = ID) > ) > print(mb, order = "median") > > > > However, note that dplyr operates in entire data.frames and therefore is > expected to be slower when tested against instructions that process one > column only. > > > Hope this helps, > > Rui Barradas > > > -- > Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a de v?rus. > http://www.avg.com/ > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide https://www.r-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Possibly Parallel Threads
- Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows
- Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows
- Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows
- R Processing dataframe by group - equivalent to SAS by group processing with a first. and retain statments
- R Processing dataframe by group - equivalent to SAS by group processing with a first. and retain statments