thr3ads.net - R help - [R] what is the effective method to apply the below logic for ~1.2 million records in R [Sep 2015]

If this information is useful, please help other people find it:
Share via:

Ravi Teja

2015-Sep-19 21:09 UTC

[R] what is the effective method to apply the below logic for ~1.2 million records in R

Hi,

I am trying to apply the below logic to generate flag_1 column on a data
set consisting of ~1.2 million records in R.

Code :

for(i in 1: nrows)
  {
              if(A$customer[i]==A$customer[i+1])
                {

                  if(is.na(A$Time_Diff[i]))
                     A$flag_1[i] <- 1
                     else if (A$Time_Diff[i] > 12)
                     A$flag_1[i] <- 1
                     else
                     A$flag_1[i] <- A$flag_1[i-1]+1

               }

            else
            {

              if(is.na(A$Time_Diff[i]))
                     A$flag_1[i] <- 1
                     else if (A$Time_Diff[i] > 12)
                     A$flag_1[i] <- 1
                     else
                     A$flag_1[i] <- A$flag_1[i-1]+1

               }
}


Resultant dataset should look like

Customer   Time_diff    flag_1
1                   NA           1
1                   10             2
1                    8              3
1                    15            1
1                    9               2
1                    10              3
2                     NA            1
2                      2               2
2                      5               3

The above logic will take approximately 60 hours to generate the flag_1
column on a dataset consisting of ~1.2 million records. Is there any
effective way in R to implement this logic in R ?

Appreciate your help.

Thanks,
Ravi

	[[alternative HTML version deleted]]

David Winsemius

2015-Sep-20 02:25 UTC

head link

[R] what is the effective method to apply the below logic for ~1.2 million records in R

On Sep 19, 2015, at 2:09 PM, Ravi Teja wrote:
> Hi,
> 
> I am trying to apply the below logic to generate flag_1 column on a data
> set consisting of ~1.2 million records in R.
> 
> Code :
> 
> for(i in 1: nrows)
>  {
>              if(A$customer[i]==A$customer[i+1])
>                {
> 
>                  if(is.na(A$Time_Diff[i]))
>                     A$flag_1[i] <- 1
>                     else if (A$Time_Diff[i] > 12)
>                     A$flag_1[i] <- 1
>                     else
>                     A$flag_1[i] <- A$flag_1[i-1]+1
> 
>               }
> 
>            else
>            {
> 
>              if(is.na(A$Time_Diff[i]))
>                     A$flag_1[i] <- 1
>                     else if (A$Time_Diff[i] > 12)
>                     A$flag_1[i] <- 1
>                     else
>                     A$flag_1[i] <- A$flag_1[i-1]+1
> 
>               }
> }
The inner logic of the consequent and alternative appear identical.  Vectorized
approaches would surely be faster. You should post some code that matches the
data. In R customer is not the same as Customer, and Time_diff is not Time_Diff,
and my patience for this code review has expired.

Post the output from and do include code to create `nrows`:

 dput( head (A, 20) )

> 
> Resultant dataset should look like
> 
> Customer   Time_diff    flag_1
> 1                   NA           1
> 1                   10             2
> 1                    8              3
> 1                    15            1
> 1                    9               2
> 1                    10              3
> 2                     NA            1
> 2                      2               2
> 2                      5               3
> 
> The above logic will take approximately 60 hours to generate the flag_1
> column on a dataset consisting of ~1.2 million records. Is there any
> effective way in R to implement this logic in R ?
> 
> Appreciate your help.
> 
> Thanks,
> Ravi
> 
> 	[[alternative HTML version deleted]]
AND R-help is a plain text only mailing list.> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius
Alameda, CA, USA

Ista Zahn

2015-Sep-20 02:48 UTC

head link

[R] what is the effective method to apply the below logic for ~1.2 million records in R

This assumes that the data are sorted by customer, and that only the
first value of Time_Diff is missing for each customer (and that the
first value is always missing for each customer). If those assumptions
hold you can do something like

A <- read.table(text = "customer   Time_Diff    flag_1
1                   NA           1
1                   10           2
1                    8           3
1                   15           1
1                    9           2
1                   10           3
2                   NA           1
2                    2           2
2                    5           3",
header = TRUE)

A$flag_1 <- NULL

library(data.table)

A <- as.data.table(A)
A[ , g15 := cumsum(c(0, ifelse(is.na(diff(Time_Diff > 12)), 0,
diff(Time_Diff > 12) > 0)))]
## I'm not proud of the previous line, probably there is a cleaner way
A[ , flag_1 := 1:.N, by = c("customer", "g15")]
A[ , g15 := NULL]

Best,
Ista

On Sat, Sep 19, 2015 at 5:09 PM, Ravi Teja <raviteja2504 at gmail.com>
wrote:> Hi,
>
> I am trying to apply the below logic to generate flag_1 column on a data
> set consisting of ~1.2 million records in R.
>
> Code :
>
> for(i in 1: nrows)
>   {
>               if(A$customer[i]==A$customer[i+1])
>                 {
>
>                   if(is.na(A$Time_Diff[i]))
>                      A$flag_1[i] <- 1
>                      else if (A$Time_Diff[i] > 12)
>                      A$flag_1[i] <- 1
>                      else
>                      A$flag_1[i] <- A$flag_1[i-1]+1
>
>                }
>
>             else
>             {
>
>               if(is.na(A$Time_Diff[i]))
>                      A$flag_1[i] <- 1
>                      else if (A$Time_Diff[i] > 12)
>                      A$flag_1[i] <- 1
>                      else
>                      A$flag_1[i] <- A$flag_1[i-1]+1
>
>                }
> }
>
>
> Resultant dataset should look like
>
> Customer   Time_diff    flag_1
> 1                   NA           1
> 1                   10             2
> 1                    8              3
> 1                    15            1
> 1                    9               2
> 1                    10              3
> 2                     NA            1
> 2                      2               2
> 2                      5               3
>
> The above logic will take approximately 60 hours to generate the flag_1
> column on a dataset consisting of ~1.2 million records. Is there any
> effective way in R to implement this logic in R ?
>
> Appreciate your help.
>
> Thanks,
> Ravi
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Jim Lemon

2015-Sep-20 03:31 UTC

head link

[R] what is the effective method to apply the below logic for ~1.2 million records in R

Hi Ravi,
Try this:

current_customer<-0
for(row in 1:dim(A)[1]) {
 if(current_customer == A$Customer[row]) {
  if(A$Time_Diff[row] > 12) A$flag_1[row]<-1
  else A$flag_1[row]<-A$flag_1[row-1]+1
 }
 else {
  current_customer<-A$Customer[row]
  A$flag_1[row]<-1
 }
}

Jim

On Sun, Sep 20, 2015 at 12:25 PM, David Winsemius <dwinsemius at
comcast.net>
wrote:
>
> On Sep 19, 2015, at 2:09 PM, Ravi Teja wrote:
>
> > Hi,
> >
> > I am trying to apply the below logic to generate flag_1 column on a
data
> > set consisting of ~1.2 million records in R.
> >
> > Code :
> >
> > for(i in 1: nrows)
> >  {
> >              if(A$customer[i]==A$customer[i+1])
> >                {
> >
> >                  if(is.na(A$Time_Diff[i]))
> >                     A$flag_1[i] <- 1
> >                     else if (A$Time_Diff[i] > 12)
> >                     A$flag_1[i] <- 1
> >                     else
> >                     A$flag_1[i] <- A$flag_1[i-1]+1
> >
> >               }
> >
> >            else
> >            {
> >
> >              if(is.na(A$Time_Diff[i]))
> >                     A$flag_1[i] <- 1
> >                     else if (A$Time_Diff[i] > 12)
> >                     A$flag_1[i] <- 1
> >                     else
> >                     A$flag_1[i] <- A$flag_1[i-1]+1
> >
> >               }
> > }
>
> The inner logic of the consequent and alternative appear identical.
> Vectorized approaches would surely be faster. You should post some code
> that matches the data. In R customer is not the same as Customer, and
> Time_diff is not Time_Diff,  and my patience for this code review has
> expired.
>
> Post the output from and do include code to create `nrows`:
>
>  dput( head (A, 20) )
>
>
> >
> > Resultant dataset should look like
> >
> > Customer   Time_diff    flag_1
> > 1                   NA           1
> > 1                   10             2
> > 1                    8              3
> > 1                    15            1
> > 1                    9               2
> > 1                    10              3
> > 2                     NA            1
> > 2                      2               2
> > 2                      5               3
> >
> > The above logic will take approximately 60 hours to generate the
flag_1
> > column on a dataset consisting of ~1.2 million records. Is there any
> > effective way in R to implement this logic in R ?
> >
> > Appreciate your help.
> >
> > Thanks,
> > Ravi
> >
> >       [[alternative HTML version deleted]]
>
> AND R-help is a plain text only mailing list.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius
> Alameda, CA, USA
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Ista Zahn

2015-Sep-20 12:42 UTC

head link

[R] what is the effective method to apply the below logic for ~1.2 million records in R

Hi Ravi,

Did you try fixing the problem? What did you try and what went wrong?

The answer is probably

A <- as.data.table(A)
A[ , g15 := cumsum(ifelse(is.na(Time_Diff > 12), 0, Time_Diff > 12))]
A[ , flag_1 := 1:.N, by = c("customer", "g15")]
A[ , g15 := NULL]

but you would have learned more if you had at least tried getting
there yourself.

Best,
Ista

On Sun, Sep 20, 2015 at 6:19 AM, Ravi Teja <raviteja2504 at gmail.com>
wrote:> Hi Ista.
>
> Thanks a ton for the response and your assumptions were right.
>
> f the Time_Diff is missing then flag_1 value should be 1
> if the Time_Diff is > 12 then flag_1 value should be 1
> if the Time_Diff is < 12 the flag_1 value should be (if the current row
is i
> then flag_1 value should be (flag_1[i-1] + 1) )
>
> When I tried to apply the logic you had shared, the results are deviating
> from the expected results.
>
> I think the logic you had shared will not function if there are two
> successive rows with Time_Diff values > 12
>
> I have attached a sample of my original data set and the expected flag_1
> column to this mail.
>
> Please help in tweaking your code to generate the attached result.
>
> Awaiting for your reply
>
> Thanks,
> Ravi
>
> On Sun, Sep 20, 2015 at 8:18 AM, Ista Zahn <istazahn at gmail.com>
wrote:
>>
>> This assumes that the data are sorted by customer, and that only the
>> first value of Time_Diff is missing for each customer (and that the
>> first value is always missing for each customer). If those assumptions
>> hold you can do something like
>>
>> A <- read.table(text = "customer   Time_Diff    flag_1
>> 1                   NA           1
>> 1                   10           2
>> 1                    8           3
>> 1                   15           1
>> 1                    9           2
>> 1                   10           3
>> 2                   NA           1
>> 2                    2           2
>> 2                    5           3",
>> header = TRUE)
>>
>> A$flag_1 <- NULL
>>
>> library(data.table)
>>
>> A <- as.data.table(A)
>> A[ , g15 := cumsum(c(0, ifelse(is.na(diff(Time_Diff > 12)), 0,
>> diff(Time_Diff > 12) > 0)))]
>> ## I'm not proud of the previous line, probably there is a cleaner
way
>> A[ , flag_1 := 1:.N, by = c("customer", "g15")]
>> A[ , g15 := NULL]
>>
>> Best,
>> Ista
>>
>> On Sat, Sep 19, 2015 at 5:09 PM, Ravi Teja <raviteja2504 at
gmail.com> wrote:
>> > Hi,
>> >
>> > I am trying to apply the below logic to generate flag_1 column on
a data
>> > set consisting of ~1.2 million records in R.
>> >
>> > Code :
>> >
>> > for(i in 1: nrows)
>> >   {
>> >               if(A$customer[i]==A$customer[i+1])
>> >                 {
>> >
>> >                   if(is.na(A$Time_Diff[i]))
>> >                      A$flag_1[i] <- 1
>> >                      else if (A$Time_Diff[i] > 12)
>> >                      A$flag_1[i] <- 1
>> >                      else
>> >                      A$flag_1[i] <- A$flag_1[i-1]+1
>> >
>> >                }
>> >
>> >             else
>> >             {
>> >
>> >               if(is.na(A$Time_Diff[i]))
>> >                      A$flag_1[i] <- 1
>> >                      else if (A$Time_Diff[i] > 12)
>> >                      A$flag_1[i] <- 1
>> >                      else
>> >                      A$flag_1[i] <- A$flag_1[i-1]+1
>> >
>> >                }
>> > }
>> >
>> >
>> > Resultant dataset should look like
>> >
>> > Customer   Time_diff    flag_1
>> > 1                   NA           1
>> > 1                   10             2
>> > 1                    8              3
>> > 1                    15            1
>> > 1                    9               2
>> > 1                    10              3
>> > 2                     NA            1
>> > 2                      2               2
>> > 2                      5               3
>> >
>> > The above logic will take approximately 60 hours to generate the
flag_1
>> > column on a dataset consisting of ~1.2 million records. Is there
any
>> > effective way in R to implement this logic in R ?
>> >
>> > Appreciate your help.
>> >
>> > Thanks,
>> > Ravi
>> >
>> >         [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>
>
>
>
> --
> raviteja

R help - Sep 2015 - what is the effective method to apply the below logic for ~1.2 million records in R

[R] what is the effective method to apply the below logic for ~1.2 million records in R

[R] what is the effective method to apply the below logic for ~1.2 million records in R

[R] what is the effective method to apply the below logic for ~1.2 million records in R

[R] what is the effective method to apply the below logic for ~1.2 million records in R

[R] what is the effective method to apply the below logic for ~1.2 million records in R