Gaius Augustus
2016-Jan-30 17:50 UTC
[R] Efficient way to create new column based on comparison with another dataframe
I'll look into the Intervals idea. The data.table code posted might not work (because I don't believe it would put the rows in the correct order if the chromosomes are interspersed), however, it did make me think about possibly assigning based on values... Something like: mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position c(3000, 6000, 1000), key = "Chr") Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End = c(5000, 10000), key = "Chr") for(i in 1:nrow(Chr.Arms)){ cur.row <- Chr.Arms[i, ] mapfile[ Chr == cur.row$Chr & Position >= cur.row$Start & Position <cur.row$End] <- Chr.Arms$Arm } This might take out the need for the intermediate table/vector. Not sure yet if it'll work, but we'll see. I'm interested to know if anyone else has any ideas, too. Thanks, Gaius On Fri, Jan 29, 2016 at 11:34 PM, Ulrik Stervbo <ulrik.stervbo at gmail.com> wrote:> Hi Gaius, > > Could you use data.table and loop over the small Chr.arms? > > library(data.table) > mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position > c(3000, 6000, 1000), key = "Chr") > Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End > = c(5000, 10000), key = "Chr") > > Arms <- data.table() > for(i in 1:nrow(Chr.Arms)){ > cur.row <- Chr.Arms[i, ] > Arm <- mapfile[ Position >= cur.row$Start & Position <= cur.row$End] > Arm <- Arm[ , Arm:=cur.row$Arm][] > Arms <- rbind(Arms, Arm) > } > > # Or use plyr to loop over each possible arm > library(plyr) > Arms <- ddply(Chr.Arms, .variables = "Arm", function(cur.row, mapfile){ > mapfile <- mapfile[ Position >= cur.row$Start & Position <= cur.row$End] > mapfile <- mapfile[ , Arm:=cur.row$Arm][] > return(mapfile) > }, mapfile = mapfile) > > I have just started to use the data.table and I have the feeling the code > above can be greatly improved - maybe the loop can be dropped entirely? > > Hope this helps > Ulrik > > On Sat, 30 Jan 2016 at 03:29 Gaius Augustus <gaiusjaugustus at gmail.com> > wrote: > >> I have two dataframes. One has chromosome arm information, and the other >> has SNP position information. I am trying to assign each SNP an arm >> identity. I'd like to create this new column based on comparing it to the >> reference file. >> >> *1) Mapfile (has millions of rows)* >> >> Name Chr Position >> S1 1 3000 >> S2 1 6000 >> S3 1 1000 >> >> *2) Chr.Arms file (has 39 rows)* >> >> Chr Arm Start End >> 1 p 0 5000 >> 1 q 5001 10000 >> >> >> *R Script that works, but slow:* >> Arms <- c() >> for (line in 1:nrow(Mapfile)){ >> Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr & >> Mapfile$Position[line] > Chr.Arms$Start & Mapfile$Position[line] < >> Chr.Arms$End]} >> } >> Mapfile$Arm <- Arms >> >> >> *Output Table:* >> >> Name Chr Position Arm >> S1 1 3000 p >> S2 1 6000 q >> S3 1 1000 p >> >> >> In words: I want each line to look up the location ( 1) find the right >> Chr, >> 2) find the line where the START < POSITION < END), then get the ARM >> information and place it in a new column. >> >> This R script works, but surely there is a more time/processing efficient >> way to do it. >> >> Thanks in advance for any help, >> Gaius >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >[[alternative HTML version deleted]]
Gaius Augustus
2016-Jan-30 18:48 UTC
[R] Efficient way to create new column based on comparison with another dataframe
I'll look into the Intervals idea. The data.table code posted might not work (because I don't believe it would put the rows in the correct order if the chromosomes are interspersed), however, it did make me think about possibly assigning based on values... *SOLUTION* mapfile <- data.frame(Name = c("S1", "S2", "S3"), Chr = 1, Position c(3000, 6000, 1000), key = "Chr") Chr.Arms <- data.frame(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End = c(5000, 10000), key = "Chr") for(i in 1:nrow(Chr.Arms)){ cur.row <- Chr.Arms[i, ] mapfile$Arm[ mapfile$Chr == cur.row$Chr & mapfile$Position >cur.row$Start & mapfile$Position <= cur.row$End] <- cur.row$Arm } This took out the need for the intermediate table/vector. This worked for me, and was VERY fast. Took <5 minutes on a dataframe with 35 million rows. Thanks for the help, Gaius On Sat, Jan 30, 2016 at 10:50 AM, Gaius Augustus <gaiusjaugustus at gmail.com> wrote:> I'll look into the Intervals idea. The data.table code posted might not > work (because I don't believe it would put the rows in the correct order if > the chromosomes are interspersed), however, it did make me think about > possibly assigning based on values... > > Something like: > mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position > c(3000, 6000, 1000), key = "Chr") > Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End > = c(5000, 10000), key = "Chr") > > for(i in 1:nrow(Chr.Arms)){ > cur.row <- Chr.Arms[i, ] > mapfile[ Chr == cur.row$Chr & Position >= cur.row$Start & Position <> cur.row$End] <- Chr.Arms$Arm > } > > This might take out the need for the intermediate table/vector. Not sure > yet if it'll work, but we'll see. I'm interested to know if anyone else > has any ideas, too. > > Thanks, > Gaius > > On Fri, Jan 29, 2016 at 11:34 PM, Ulrik Stervbo <ulrik.stervbo at gmail.com> > wrote: > >> Hi Gaius, >> >> Could you use data.table and loop over the small Chr.arms? >> >> library(data.table) >> mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position >> c(3000, 6000, 1000), key = "Chr") >> Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), >> End = c(5000, 10000), key = "Chr") >> >> Arms <- data.table() >> for(i in 1:nrow(Chr.Arms)){ >> cur.row <- Chr.Arms[i, ] >> Arm <- mapfile[ Position >= cur.row$Start & Position <= cur.row$End] >> Arm <- Arm[ , Arm:=cur.row$Arm][] >> Arms <- rbind(Arms, Arm) >> } >> >> # Or use plyr to loop over each possible arm >> library(plyr) >> Arms <- ddply(Chr.Arms, .variables = "Arm", function(cur.row, mapfile){ >> mapfile <- mapfile[ Position >= cur.row$Start & Position <= cur.row$End] >> mapfile <- mapfile[ , Arm:=cur.row$Arm][] >> return(mapfile) >> }, mapfile = mapfile) >> >> I have just started to use the data.table and I have the feeling the code >> above can be greatly improved - maybe the loop can be dropped entirely? >> >> Hope this helps >> Ulrik >> >> On Sat, 30 Jan 2016 at 03:29 Gaius Augustus <gaiusjaugustus at gmail.com> >> wrote: >> >>> I have two dataframes. One has chromosome arm information, and the other >>> has SNP position information. I am trying to assign each SNP an arm >>> identity. I'd like to create this new column based on comparing it to >>> the >>> reference file. >>> >>> *1) Mapfile (has millions of rows)* >>> >>> Name Chr Position >>> S1 1 3000 >>> S2 1 6000 >>> S3 1 1000 >>> >>> *2) Chr.Arms file (has 39 rows)* >>> >>> Chr Arm Start End >>> 1 p 0 5000 >>> 1 q 5001 10000 >>> >>> >>> *R Script that works, but slow:* >>> Arms <- c() >>> for (line in 1:nrow(Mapfile)){ >>> Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr & >>> Mapfile$Position[line] > Chr.Arms$Start & Mapfile$Position[line] < >>> Chr.Arms$End]} >>> } >>> Mapfile$Arm <- Arms >>> >>> >>> *Output Table:* >>> >>> Name Chr Position Arm >>> S1 1 3000 p >>> S2 1 6000 q >>> S3 1 1000 p >>> >>> >>> In words: I want each line to look up the location ( 1) find the right >>> Chr, >>> 2) find the line where the START < POSITION < END), then get the ARM >>> information and place it in a new column. >>> >>> This R script works, but surely there is a more time/processing efficient >>> way to do it. >>> >>> Thanks in advance for any help, >>> Gaius >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >[[alternative HTML version deleted]]
Dénes Tóth
2016-Jan-31 09:17 UTC
[R] Efficient way to create new column based on comparison with another dataframe
Hi, I have not followed this thread from the beginning, but have you tried the foverlaps() function from the data.table package? Something along the lines of: --- # create the tables (use as.data.table() or setDT() if you # start with a data.frame) mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position = c(3000, 6000, 1000)) Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End = c(5000, 10000)) # add a dummy variable to be able to define Position as an interval mapfile[, Position2 := Position] # add keys setkey(mapfile, Chr, Position, Position2) setkey(Chr.Arms, Chr, Start, End) # use data.table::foverlaps (see ?foverlaps) mapfile <- foverlaps(mapfile, Chr.Arms, type = "within") # remove the dummy variable mapfile[, Position2 := NULL] # recreate original order setorder(mapfile, Chr, Name) --- BTW, there is a typo in your *SOLUTION*. I guess you wanted to write data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position = c(3000, 6000, 1000), key = "Chr") instead of data.frame(Name = c("S1", "S2", "S3"), Chr = 1, Position = c(3000, 6000, 1000), key = "Chr"). HTH, Denes On 01/30/2016 07:48 PM, Gaius Augustus wrote:> I'll look into the Intervals idea. The data.table code posted might not > work (because I don't believe it would put the rows in the correct order if > the chromosomes are interspersed), however, it did make me think about > possibly assigning based on values... > > *SOLUTION* > mapfile <- data.frame(Name = c("S1", "S2", "S3"), Chr = 1, Position > c(3000, 6000, 1000), key = "Chr") > Chr.Arms <- data.frame(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End > = c(5000, 10000), key = "Chr") > > for(i in 1:nrow(Chr.Arms)){ > cur.row <- Chr.Arms[i, ] > mapfile$Arm[ mapfile$Chr == cur.row$Chr & mapfile$Position >> cur.row$Start & mapfile$Position <= cur.row$End] <- cur.row$Arm > } > > This took out the need for the intermediate table/vector. This worked for > me, and was VERY fast. Took <5 minutes on a dataframe with 35 million rows. > > Thanks for the help, > Gaius > > On Sat, Jan 30, 2016 at 10:50 AM, Gaius Augustus <gaiusjaugustus at gmail.com> > wrote: > >> I'll look into the Intervals idea. The data.table code posted might not >> work (because I don't believe it would put the rows in the correct order if >> the chromosomes are interspersed), however, it did make me think about >> possibly assigning based on values... >> >> Something like: >> mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position >> c(3000, 6000, 1000), key = "Chr") >> Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End >> = c(5000, 10000), key = "Chr") >> >> for(i in 1:nrow(Chr.Arms)){ >> cur.row <- Chr.Arms[i, ] >> mapfile[ Chr == cur.row$Chr & Position >= cur.row$Start & Position <>> cur.row$End] <- Chr.Arms$Arm >> } >> >> This might take out the need for the intermediate table/vector. Not sure >> yet if it'll work, but we'll see. I'm interested to know if anyone else >> has any ideas, too. >> >> Thanks, >> Gaius >> >> On Fri, Jan 29, 2016 at 11:34 PM, Ulrik Stervbo <ulrik.stervbo at gmail.com> >> wrote: >> >>> Hi Gaius, >>> >>> Could you use data.table and loop over the small Chr.arms? >>> >>> library(data.table) >>> mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position >>> c(3000, 6000, 1000), key = "Chr") >>> Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), >>> End = c(5000, 10000), key = "Chr") >>> >>> Arms <- data.table() >>> for(i in 1:nrow(Chr.Arms)){ >>> cur.row <- Chr.Arms[i, ] >>> Arm <- mapfile[ Position >= cur.row$Start & Position <= cur.row$End] >>> Arm <- Arm[ , Arm:=cur.row$Arm][] >>> Arms <- rbind(Arms, Arm) >>> } >>> >>> # Or use plyr to loop over each possible arm >>> library(plyr) >>> Arms <- ddply(Chr.Arms, .variables = "Arm", function(cur.row, mapfile){ >>> mapfile <- mapfile[ Position >= cur.row$Start & Position <= cur.row$End] >>> mapfile <- mapfile[ , Arm:=cur.row$Arm][] >>> return(mapfile) >>> }, mapfile = mapfile) >>> >>> I have just started to use the data.table and I have the feeling the code >>> above can be greatly improved - maybe the loop can be dropped entirely? >>> >>> Hope this helps >>> Ulrik >>> >>> On Sat, 30 Jan 2016 at 03:29 Gaius Augustus <gaiusjaugustus at gmail.com> >>> wrote: >>> >>>> I have two dataframes. One has chromosome arm information, and the other >>>> has SNP position information. I am trying to assign each SNP an arm >>>> identity. I'd like to create this new column based on comparing it to >>>> the >>>> reference file. >>>> >>>> *1) Mapfile (has millions of rows)* >>>> >>>> Name Chr Position >>>> S1 1 3000 >>>> S2 1 6000 >>>> S3 1 1000 >>>> >>>> *2) Chr.Arms file (has 39 rows)* >>>> >>>> Chr Arm Start End >>>> 1 p 0 5000 >>>> 1 q 5001 10000 >>>> >>>> >>>> *R Script that works, but slow:* >>>> Arms <- c() >>>> for (line in 1:nrow(Mapfile)){ >>>> Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr & >>>> Mapfile$Position[line] > Chr.Arms$Start & Mapfile$Position[line] < >>>> Chr.Arms$End]} >>>> } >>>> Mapfile$Arm <- Arms >>>> >>>> >>>> *Output Table:* >>>> >>>> Name Chr Position Arm >>>> S1 1 3000 p >>>> S2 1 6000 q >>>> S3 1 1000 p >>>> >>>> >>>> In words: I want each line to look up the location ( 1) find the right >>>> Chr, >>>> 2) find the line where the START < POSITION < END), then get the ARM >>>> information and place it in a new column. >>>> >>>> This R script works, but surely there is a more time/processing efficient >>>> way to do it. >>>> >>>> Thanks in advance for any help, >>>> Gaius >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________________________ >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>> >> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >