Ravi Teja
2015-Sep-19 21:09 UTC
[R] what is the effective method to apply the below logic for ~1.2 million records in R
Hi, I am trying to apply the below logic to generate flag_1 column on a data set consisting of ~1.2 million records in R. Code : for(i in 1: nrows) { if(A$customer[i]==A$customer[i+1]) { if(is.na(A$Time_Diff[i])) A$flag_1[i] <- 1 else if (A$Time_Diff[i] > 12) A$flag_1[i] <- 1 else A$flag_1[i] <- A$flag_1[i-1]+1 } else { if(is.na(A$Time_Diff[i])) A$flag_1[i] <- 1 else if (A$Time_Diff[i] > 12) A$flag_1[i] <- 1 else A$flag_1[i] <- A$flag_1[i-1]+1 } } Resultant dataset should look like Customer Time_diff flag_1 1 NA 1 1 10 2 1 8 3 1 15 1 1 9 2 1 10 3 2 NA 1 2 2 2 2 5 3 The above logic will take approximately 60 hours to generate the flag_1 column on a dataset consisting of ~1.2 million records. Is there any effective way in R to implement this logic in R ? Appreciate your help. Thanks, Ravi [[alternative HTML version deleted]]
David Winsemius
2015-Sep-20 02:25 UTC
[R] what is the effective method to apply the below logic for ~1.2 million records in R
On Sep 19, 2015, at 2:09 PM, Ravi Teja wrote:> Hi, > > I am trying to apply the below logic to generate flag_1 column on a data > set consisting of ~1.2 million records in R. > > Code : > > for(i in 1: nrows) > { > if(A$customer[i]==A$customer[i+1]) > { > > if(is.na(A$Time_Diff[i])) > A$flag_1[i] <- 1 > else if (A$Time_Diff[i] > 12) > A$flag_1[i] <- 1 > else > A$flag_1[i] <- A$flag_1[i-1]+1 > > } > > else > { > > if(is.na(A$Time_Diff[i])) > A$flag_1[i] <- 1 > else if (A$Time_Diff[i] > 12) > A$flag_1[i] <- 1 > else > A$flag_1[i] <- A$flag_1[i-1]+1 > > } > }The inner logic of the consequent and alternative appear identical. Vectorized approaches would surely be faster. You should post some code that matches the data. In R customer is not the same as Customer, and Time_diff is not Time_Diff, and my patience for this code review has expired. Post the output from and do include code to create `nrows`: dput( head (A, 20) )> > Resultant dataset should look like > > Customer Time_diff flag_1 > 1 NA 1 > 1 10 2 > 1 8 3 > 1 15 1 > 1 9 2 > 1 10 3 > 2 NA 1 > 2 2 2 > 2 5 3 > > The above logic will take approximately 60 hours to generate the flag_1 > column on a dataset consisting of ~1.2 million records. Is there any > effective way in R to implement this logic in R ? > > Appreciate your help. > > Thanks, > Ravi > > [[alternative HTML version deleted]]AND R-help is a plain text only mailing list.> > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius Alameda, CA, USA
Ista Zahn
2015-Sep-20 02:48 UTC
[R] what is the effective method to apply the below logic for ~1.2 million records in R
This assumes that the data are sorted by customer, and that only the first value of Time_Diff is missing for each customer (and that the first value is always missing for each customer). If those assumptions hold you can do something like A <- read.table(text = "customer Time_Diff flag_1 1 NA 1 1 10 2 1 8 3 1 15 1 1 9 2 1 10 3 2 NA 1 2 2 2 2 5 3", header = TRUE) A$flag_1 <- NULL library(data.table) A <- as.data.table(A) A[ , g15 := cumsum(c(0, ifelse(is.na(diff(Time_Diff > 12)), 0, diff(Time_Diff > 12) > 0)))] ## I'm not proud of the previous line, probably there is a cleaner way A[ , flag_1 := 1:.N, by = c("customer", "g15")] A[ , g15 := NULL] Best, Ista On Sat, Sep 19, 2015 at 5:09 PM, Ravi Teja <raviteja2504 at gmail.com> wrote:> Hi, > > I am trying to apply the below logic to generate flag_1 column on a data > set consisting of ~1.2 million records in R. > > Code : > > for(i in 1: nrows) > { > if(A$customer[i]==A$customer[i+1]) > { > > if(is.na(A$Time_Diff[i])) > A$flag_1[i] <- 1 > else if (A$Time_Diff[i] > 12) > A$flag_1[i] <- 1 > else > A$flag_1[i] <- A$flag_1[i-1]+1 > > } > > else > { > > if(is.na(A$Time_Diff[i])) > A$flag_1[i] <- 1 > else if (A$Time_Diff[i] > 12) > A$flag_1[i] <- 1 > else > A$flag_1[i] <- A$flag_1[i-1]+1 > > } > } > > > Resultant dataset should look like > > Customer Time_diff flag_1 > 1 NA 1 > 1 10 2 > 1 8 3 > 1 15 1 > 1 9 2 > 1 10 3 > 2 NA 1 > 2 2 2 > 2 5 3 > > The above logic will take approximately 60 hours to generate the flag_1 > column on a dataset consisting of ~1.2 million records. Is there any > effective way in R to implement this logic in R ? > > Appreciate your help. > > Thanks, > Ravi > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Jim Lemon
2015-Sep-20 03:31 UTC
[R] what is the effective method to apply the below logic for ~1.2 million records in R
Hi Ravi, Try this: current_customer<-0 for(row in 1:dim(A)[1]) { if(current_customer == A$Customer[row]) { if(A$Time_Diff[row] > 12) A$flag_1[row]<-1 else A$flag_1[row]<-A$flag_1[row-1]+1 } else { current_customer<-A$Customer[row] A$flag_1[row]<-1 } } Jim On Sun, Sep 20, 2015 at 12:25 PM, David Winsemius <dwinsemius at comcast.net> wrote:> > On Sep 19, 2015, at 2:09 PM, Ravi Teja wrote: > > > Hi, > > > > I am trying to apply the below logic to generate flag_1 column on a data > > set consisting of ~1.2 million records in R. > > > > Code : > > > > for(i in 1: nrows) > > { > > if(A$customer[i]==A$customer[i+1]) > > { > > > > if(is.na(A$Time_Diff[i])) > > A$flag_1[i] <- 1 > > else if (A$Time_Diff[i] > 12) > > A$flag_1[i] <- 1 > > else > > A$flag_1[i] <- A$flag_1[i-1]+1 > > > > } > > > > else > > { > > > > if(is.na(A$Time_Diff[i])) > > A$flag_1[i] <- 1 > > else if (A$Time_Diff[i] > 12) > > A$flag_1[i] <- 1 > > else > > A$flag_1[i] <- A$flag_1[i-1]+1 > > > > } > > } > > The inner logic of the consequent and alternative appear identical. > Vectorized approaches would surely be faster. You should post some code > that matches the data. In R customer is not the same as Customer, and > Time_diff is not Time_Diff, and my patience for this code review has > expired. > > Post the output from and do include code to create `nrows`: > > dput( head (A, 20) ) > > > > > > Resultant dataset should look like > > > > Customer Time_diff flag_1 > > 1 NA 1 > > 1 10 2 > > 1 8 3 > > 1 15 1 > > 1 9 2 > > 1 10 3 > > 2 NA 1 > > 2 2 2 > > 2 5 3 > > > > The above logic will take approximately 60 hours to generate the flag_1 > > column on a dataset consisting of ~1.2 million records. Is there any > > effective way in R to implement this logic in R ? > > > > Appreciate your help. > > > > Thanks, > > Ravi > > > > [[alternative HTML version deleted]] > > AND R-help is a plain text only mailing list. > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > David Winsemius > Alameda, CA, USA > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Ista Zahn
2015-Sep-20 12:42 UTC
[R] what is the effective method to apply the below logic for ~1.2 million records in R
Hi Ravi, Did you try fixing the problem? What did you try and what went wrong? The answer is probably A <- as.data.table(A) A[ , g15 := cumsum(ifelse(is.na(Time_Diff > 12), 0, Time_Diff > 12))] A[ , flag_1 := 1:.N, by = c("customer", "g15")] A[ , g15 := NULL] but you would have learned more if you had at least tried getting there yourself. Best, Ista On Sun, Sep 20, 2015 at 6:19 AM, Ravi Teja <raviteja2504 at gmail.com> wrote:> Hi Ista. > > Thanks a ton for the response and your assumptions were right. > > f the Time_Diff is missing then flag_1 value should be 1 > if the Time_Diff is > 12 then flag_1 value should be 1 > if the Time_Diff is < 12 the flag_1 value should be (if the current row is i > then flag_1 value should be (flag_1[i-1] + 1) ) > > When I tried to apply the logic you had shared, the results are deviating > from the expected results. > > I think the logic you had shared will not function if there are two > successive rows with Time_Diff values > 12 > > I have attached a sample of my original data set and the expected flag_1 > column to this mail. > > Please help in tweaking your code to generate the attached result. > > Awaiting for your reply > > Thanks, > Ravi > > On Sun, Sep 20, 2015 at 8:18 AM, Ista Zahn <istazahn at gmail.com> wrote: >> >> This assumes that the data are sorted by customer, and that only the >> first value of Time_Diff is missing for each customer (and that the >> first value is always missing for each customer). If those assumptions >> hold you can do something like >> >> A <- read.table(text = "customer Time_Diff flag_1 >> 1 NA 1 >> 1 10 2 >> 1 8 3 >> 1 15 1 >> 1 9 2 >> 1 10 3 >> 2 NA 1 >> 2 2 2 >> 2 5 3", >> header = TRUE) >> >> A$flag_1 <- NULL >> >> library(data.table) >> >> A <- as.data.table(A) >> A[ , g15 := cumsum(c(0, ifelse(is.na(diff(Time_Diff > 12)), 0, >> diff(Time_Diff > 12) > 0)))] >> ## I'm not proud of the previous line, probably there is a cleaner way >> A[ , flag_1 := 1:.N, by = c("customer", "g15")] >> A[ , g15 := NULL] >> >> Best, >> Ista >> >> On Sat, Sep 19, 2015 at 5:09 PM, Ravi Teja <raviteja2504 at gmail.com> wrote: >> > Hi, >> > >> > I am trying to apply the below logic to generate flag_1 column on a data >> > set consisting of ~1.2 million records in R. >> > >> > Code : >> > >> > for(i in 1: nrows) >> > { >> > if(A$customer[i]==A$customer[i+1]) >> > { >> > >> > if(is.na(A$Time_Diff[i])) >> > A$flag_1[i] <- 1 >> > else if (A$Time_Diff[i] > 12) >> > A$flag_1[i] <- 1 >> > else >> > A$flag_1[i] <- A$flag_1[i-1]+1 >> > >> > } >> > >> > else >> > { >> > >> > if(is.na(A$Time_Diff[i])) >> > A$flag_1[i] <- 1 >> > else if (A$Time_Diff[i] > 12) >> > A$flag_1[i] <- 1 >> > else >> > A$flag_1[i] <- A$flag_1[i-1]+1 >> > >> > } >> > } >> > >> > >> > Resultant dataset should look like >> > >> > Customer Time_diff flag_1 >> > 1 NA 1 >> > 1 10 2 >> > 1 8 3 >> > 1 15 1 >> > 1 9 2 >> > 1 10 3 >> > 2 NA 1 >> > 2 2 2 >> > 2 5 3 >> > >> > The above logic will take approximately 60 hours to generate the flag_1 >> > column on a dataset consisting of ~1.2 million records. Is there any >> > effective way in R to implement this logic in R ? >> > >> > Appreciate your help. >> > >> > Thanks, >> > Ravi >> > >> > [[alternative HTML version deleted]] >> > >> > ______________________________________________ >> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >> > http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. > > > > > -- > raviteja