thr3ads.net - R help - [R] Programming R to avoid loops [Apr 2015]

If this information is useful, please help other people find it:
Share via:

Brant Inman

2015-Apr-18 04:14 UTC

[R] Programming R to avoid loops

I have two large data frames with the following structure:
> df1  id       date test1.result
1  a 2009-08-28      1
2  a 2009-09-16      1
3  b 2008-08-06      0
4  c 2012-02-02      1
5  c 2010-08-03      1
6  c 2012-08-02      0
> df2  id       date test2.result
1  a 2011-02-03      1
2  b 2011-09-27      0
3  b 2011-09-01      1
4  c 2009-07-16      0
5  c 2009-04-15      0
6  c 2010-08-10      1

I need to match items in df2 to those in df1 with specific matching criteria. I
have written a looped matching algorithm that works, but it is very slow with my
large datasets. I am requesting help on making a version of this code that is
faster and ?vectorized" so to speak.

My algorithm is currently something like this code. It works but is damn slow.

findTestPairs <- function(test1, id1, date1, test2, id2, date2, predays=-30, 
                          lagdays=30){
  # Function to find, within subjects, two tests that occur with a timeframe
  #
  # test1 = the reference test result for which matching second tests are sought
  # test2 = the second test result
  # date1 = the date of test1
  # date2 = the date of test2
  # id1   = unique identifier for subject undergoing test 1
  # id2   = unique identifier for subject undergoing test 2
  # predays  = maximum number of days prior to test1 date that test2 date might
occur
  # lagdays  = maximum number of days after test1 date that test2 date might
occur
    
  result <- data.frame(matrix(ncol=5, nrow=length(test1)))
    colnames(result) <-
c('id','test1','date','test2count',?test2lag.result')
    result$id    <- id1
    result$test1 <- test1
    result$date  <- date1
    
  for(i in 1:length(test1)){
    l <- 0    # Counter of test2 results that matches test1 within lag
interval
    m <- NA   # Indicator of positive test2 within lag interval
        
    for(j in 1:length(test2)){
      if(id1[i] == id2[j]){               # STEP1: Match IDs
        interval <- date2[j] - date1[i]
        intmatch <- ifelse(interval >= predays && interval <=
lagdays, 1, 0)

        if(intmatch == 1){                # STEP2: Does test2 fall within lag
interval?
          l <- l+1                        # If test2 within lag interval,
count it

          if(test2[j] == 1) {             # STEP3: Is test 2 positive?
            m <- 1                        # If test2 is positive, set
indicator to 1
          } else {
            m <- 0
          }
        }
      }
    }  
    result$test2count[i] <- l
    result$test2lag.result[i] <- m
  }  
  return(result)
}  

I would appreciate help on building a faster matching algorithm. I am pretty
certain that R functions can be used to do this but I do not have a good grasp
of how to make it work.

Brant Inman
	[[alternative HTML version deleted]]

Jim Lemon

2015-Apr-18 07:24 UTC

head link

[R] Programming R to avoid loops

Hi Brant,
I'm a bit confused about which data frame is the one to match to, but
the following, while still including loops, should run much faster
than the above as it only matches dates within id matches.

df1<-read.table(text="id date test1.result
  a 2009-08-28      1
  a 2009-09-16      1
  b 2008-08-06      0
  c 2012-02-02      1
  c 2010-08-03      1
  c 2012-08-02      0",header=TRUE)
df2<-read.table(text="id date test2.result
  a 2011-02-03      1
  b 2011-09-27      0
  b 2011-09-01      1
  c 2009-07-16      0
  c 2009-04-15      0
  c 2010-08-10      1",header=TRUE)

bi.match<-function(x1,x2,maxdaydiff=30) {
 # convert the character strings to dates (may not be necessary)
 x1$dates<-as.Date(x1$date,"%Y-%m-%d")
 x2$dates<-as.Date(x2$date,"%Y-%m-%d")
 # initialize the l and m variables
 x1$l<-x1$m<-0
 # get all the id codes
 ids<-unique(x2$id)
 # step through the id codes
 for(id1 in ids) {
  x1ind<-which(x1$id == id1)
  x2ind<-which(x2$id == id1)
  for(id2 in 1:length(x1ind)) {
   # get the indices of the x2 dates that are within maxdaydiff days
of this x1 date
   diffok<-which(abs(x1$dates[x1ind[id2]]-x2$dates[x2ind])<=30)
   # set the date diff match indicator to 1
   x1$l[x1ind[id2]]<-length(diffok) > 0
   # set the positive test indicator to 1
   x1$m[x1ind[id2]]<-any(x2$test2.result[x2ind[diffok]] > 0)
  }
 }
 return(x1)
}

bi.match(df1,df2)

Jim


On Sat, Apr 18, 2015 at 2:14 PM, Brant Inman <brant.inman at me.com>
wrote:> I have two large data frames with the following structure:
>
>> df1
>   id       date test1.result
> 1  a 2009-08-28      1
> 2  a 2009-09-16      1
> 3  b 2008-08-06      0
> 4  c 2012-02-02      1
> 5  c 2010-08-03      1
> 6  c 2012-08-02      0
>
>> df2
>   id       date test2.result
> 1  a 2011-02-03      1
> 2  b 2011-09-27      0
> 3  b 2011-09-01      1
> 4  c 2009-07-16      0
> 5  c 2009-04-15      0
> 6  c 2010-08-10      1
>
> I need to match items in df2 to those in df1 with specific matching
criteria. I have written a looped matching algorithm that works, but it is very
slow with my large datasets. I am requesting help on making a version of this
code that is faster and ?vectorized" so to speak.
>
> My algorithm is currently something like this code. It works but is damn
slow.
>
> findTestPairs <- function(test1, id1, date1, test2, id2, date2,
predays=-30,
>                           lagdays=30){
>   # Function to find, within subjects, two tests that occur with a
timeframe
>   #
>   # test1 = the reference test result for which matching second tests are
sought
>   # test2 = the second test result
>   # date1 = the date of test1
>   # date2 = the date of test2
>   # id1   = unique identifier for subject undergoing test 1
>   # id2   = unique identifier for subject undergoing test 2
>   # predays  = maximum number of days prior to test1 date that test2 date
might occur
>   # lagdays  = maximum number of days after test1 date that test2 date
might occur
>
>   result <- data.frame(matrix(ncol=5, nrow=length(test1)))
>     colnames(result) <-
c('id','test1','date','test2count',?test2lag.result')
>     result$id    <- id1
>     result$test1 <- test1
>     result$date  <- date1
>
>   for(i in 1:length(test1)){
>     l <- 0    # Counter of test2 results that matches test1 within lag
interval
>     m <- NA   # Indicator of positive test2 within lag interval
>
>     for(j in 1:length(test2)){
>       if(id1[i] == id2[j]){               # STEP1: Match IDs
>         interval <- date2[j] - date1[i]
>         intmatch <- ifelse(interval >= predays && interval
<= lagdays, 1, 0)
>
>         if(intmatch == 1){                # STEP2: Does test2 fall within
lag interval?
>           l <- l+1                        # If test2 within lag
interval, count it
>
>           if(test2[j] == 1) {             # STEP3: Is test 2 positive?
>             m <- 1                        # If test2 is positive, set
indicator to 1
>           } else {
>             m <- 0
>           }
>         }
>       }
>     }
>     result$test2count[i] <- l
>     result$test2lag.result[i] <- m
>   }
>   return(result)
> }
>
> I would appreciate help on building a faster matching algorithm. I am
pretty certain that R functions can be used to do this but I do not have a good
grasp of how to make it work.
>
> Brant Inman
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Charles C. Berry

2015-Apr-18 17:48 UTC

head link

[R] Programming R to avoid loops

On Sat, 18 Apr 2015, Brant Inman wrote:
> I have two large data frames with the following structure:
>
>> df1
>  id       date test1.result
> 1  a 2009-08-28      1
> 2  a 2009-09-16      1
> 3  b 2008-08-06      0
> 4  c 2012-02-02      1
> 5  c 2010-08-03      1
> 6  c 2012-08-02      0
>
>> df2
>  id       date test2.result
> 1  a 2011-02-03      1
> 2  b 2011-09-27      0
> 3  b 2011-09-01      1
> 4  c 2009-07-16      0
> 5  c 2009-04-15      0
> 6  c 2010-08-10      1
>
> I need to match items in df2 to those in df1 with specific matching 
> criteria. I have written a looped matching algorithm that works, but it 
> is very slow with my large datasets. I am requesting help on making a 
> version of this code that is faster and ?vectorized" so to speak.
As I see in your posted code, you match id's exactly, dates according to a 
range, and count the number of positive test result in the second 
data.frame.

For this, the countOverlaps() function of the GenomicRanges package will 
do the trick with suitably defined GRanges objects. Something like:

require(GenomicRanges)

date1 <- as.integer( as.Date( df1$date, "%Y-%m-%d" ))
date2 <- as.integer( as.Date( df2$date, "%Y-%m-%d" ))

lagdays <- 30L
predays <- -30L

gr1 <- GRanges(seqnames=df1$id,
IRanges(start=date1,width=1),strand="*")

gr2 <- GRanges(seqnames=df2$id,
                IRanges(start=date2+predays,end=date2+lagdays),
                strand="*")[ df2$test2.result==1,]

df1$test2.count <- countOverlaps(gr1,gr2)


For the example data.frames (as rendered by Jim Lemon's code), this yields
> df1   id       date test1.result test2.count
1  a 2009-08-28            1           0
2  a 2009-09-16            1           0
3  b 2008-08-06            0           0
4  c 2012-02-02            1           0
5  c 2010-08-03            1           1
6  c 2012-08-02            0           0

The GenomicRanges package is at

http://www.bioconductor.org/packages/release/bioc/html/GenomicRanges.html

where you will find installation instructions and links to vignettes.

HTH,

Chuck

Jim Mankin

2015-Apr-18 17:55 UTC

head link

[R] Programming R to avoid loops

Jim Mankin liked your message with Boxer. On April 18, 2015 at 10:48:17 AM MST,
Charles C. Berry <ccberry at ucsd.edu> wrote:On Sat, 18 Apr 2015, Brant
Inman wrote:> I have two large data frames with the following
structure:>>> df1> id date test1.result> 1 a 2009-08-28 1> 2 a
2009-09-16 1> 3 b 2008-08-06 0> 4 c 2012-02-02 1> 5 c 2010-08-03 1>
6 c 2012-08-02 0>>> df2> id date test2.result> 1 a 2011-02-03
1> 2 b 2011-09-27 0> 3 b 2011-09-01 1> 4 c 2009-07-16 0> 5 c
2009-04-15 0> 6 c 2010-08-10 1>> I need to match items in df2 to those
in df1 with specific matching > criteria. I have written a looped matching
algorithm that works, but it > is very slow with my large datasets. I am
requesting help on making a > version of this code that is faster and
?vectorized" so to speak.As I see in your posted code, you match id's
exactly, dates according to a range, and count the number of positive test
result in the second data.frame.For this, the countOverlaps() function of the
GenomicRanges package will do the trick with suitably defined GRanges objects.
Something like:require(GenomicRanges)date1 date2 lagdays predays gr1 gr2 
IRanges(start=date2+predays,end=date2+lagdays), strand="*")[
df2$test2.result==1,]df1$test2.count For the example data.frames (as rendered by
Jim Lemon's code), this yields> df1 id date test1.result test2.count1 a
2009-08-28 1 02 a 2009-09-16 1 03 b 2008-08-06 0 04 c 2012-02-02 1 05 c
2010-08-03 1 16 c 2012-08-02 0 0The GenomicRanges package is
athttp://www.bioconductor.org/packages/release/bioc/html/GenomicRanges.htmlwhere
you will find installation instructions and links to
vignettes.HTH,Chuck______________________________________________R-help at
r-project.org mailing list -- To UNSUBSCRIBE and more,
seehttps://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide
http://www.R-project.org/posting-guide.htmland provide commented, minimal,
self-contained, reproducible code.
	[[alternative HTML version deleted]]

R help - Apr 2015 - Programming R to avoid loops

[R] Programming R to avoid loops

[R] Programming R to avoid loops

[R] Programming R to avoid loops

[R] Programming R to avoid loops