Gaius Augustus
2016-Jan-29 18:52 UTC
[R] Efficient way to create new column based on comparison with another dataframe
I have two dataframes. One has chromosome arm information, and the other has SNP position information. I am trying to assign each SNP an arm identity. I'd like to create this new column based on comparing it to the reference file. *1) Mapfile (has millions of rows)* Name Chr Position S1 1 3000 S2 1 6000 S3 1 1000 *2) Chr.Arms file (has 39 rows)* Chr Arm Start End 1 p 0 5000 1 q 5001 10000 *R Script that works, but slow:* Arms <- c() for (line in 1:nrow(Mapfile)){ Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr & Mapfile$Position[line] > Chr.Arms$Start & Mapfile$Position[line] < Chr.Arms$End]} } Mapfile$Arm <- Arms *Output Table:* Name Chr Position Arm S1 1 3000 p S2 1 6000 q S3 1 1000 p In words: I want each line to look up the location ( 1) find the right Chr, 2) find the line where the START < POSITION < END), then get the ARM information and place it in a new column. This R script works, but surely there is a more time/processing efficient way to do it. Thanks in advance for any help, Gaius [[alternative HTML version deleted]]
Ulrik Stervbo
2016-Jan-30 06:34 UTC
[R] Efficient way to create new column based on comparison with another dataframe
Hi Gaius, Could you use data.table and loop over the small Chr.arms? library(data.table) mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position c(3000, 6000, 1000), key = "Chr") Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End = c(5000, 10000), key = "Chr") Arms <- data.table() for(i in 1:nrow(Chr.Arms)){ cur.row <- Chr.Arms[i, ] Arm <- mapfile[ Position >= cur.row$Start & Position <= cur.row$End] Arm <- Arm[ , Arm:=cur.row$Arm][] Arms <- rbind(Arms, Arm) } # Or use plyr to loop over each possible arm library(plyr) Arms <- ddply(Chr.Arms, .variables = "Arm", function(cur.row, mapfile){ mapfile <- mapfile[ Position >= cur.row$Start & Position <= cur.row$End] mapfile <- mapfile[ , Arm:=cur.row$Arm][] return(mapfile) }, mapfile = mapfile) I have just started to use the data.table and I have the feeling the code above can be greatly improved - maybe the loop can be dropped entirely? Hope this helps Ulrik On Sat, 30 Jan 2016 at 03:29 Gaius Augustus <gaiusjaugustus at gmail.com> wrote:> I have two dataframes. One has chromosome arm information, and the other > has SNP position information. I am trying to assign each SNP an arm > identity. I'd like to create this new column based on comparing it to the > reference file. > > *1) Mapfile (has millions of rows)* > > Name Chr Position > S1 1 3000 > S2 1 6000 > S3 1 1000 > > *2) Chr.Arms file (has 39 rows)* > > Chr Arm Start End > 1 p 0 5000 > 1 q 5001 10000 > > > *R Script that works, but slow:* > Arms <- c() > for (line in 1:nrow(Mapfile)){ > Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr & > Mapfile$Position[line] > Chr.Arms$Start & Mapfile$Position[line] < > Chr.Arms$End]} > } > Mapfile$Arm <- Arms > > > *Output Table:* > > Name Chr Position Arm > S1 1 3000 p > S2 1 6000 q > S3 1 1000 p > > > In words: I want each line to look up the location ( 1) find the right Chr, > 2) find the line where the START < POSITION < END), then get the ARM > information and place it in a new column. > > This R script works, but surely there is a more time/processing efficient > way to do it. > > Thanks in advance for any help, > Gaius > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Gaius Augustus
2016-Jan-30 17:50 UTC
[R] Efficient way to create new column based on comparison with another dataframe
I'll look into the Intervals idea. The data.table code posted might not work (because I don't believe it would put the rows in the correct order if the chromosomes are interspersed), however, it did make me think about possibly assigning based on values... Something like: mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position c(3000, 6000, 1000), key = "Chr") Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End = c(5000, 10000), key = "Chr") for(i in 1:nrow(Chr.Arms)){ cur.row <- Chr.Arms[i, ] mapfile[ Chr == cur.row$Chr & Position >= cur.row$Start & Position <cur.row$End] <- Chr.Arms$Arm } This might take out the need for the intermediate table/vector. Not sure yet if it'll work, but we'll see. I'm interested to know if anyone else has any ideas, too. Thanks, Gaius On Fri, Jan 29, 2016 at 11:34 PM, Ulrik Stervbo <ulrik.stervbo at gmail.com> wrote:> Hi Gaius, > > Could you use data.table and loop over the small Chr.arms? > > library(data.table) > mapfile <- data.table(Name = c("S1", "S2", "S3"), Chr = 1, Position > c(3000, 6000, 1000), key = "Chr") > Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End > = c(5000, 10000), key = "Chr") > > Arms <- data.table() > for(i in 1:nrow(Chr.Arms)){ > cur.row <- Chr.Arms[i, ] > Arm <- mapfile[ Position >= cur.row$Start & Position <= cur.row$End] > Arm <- Arm[ , Arm:=cur.row$Arm][] > Arms <- rbind(Arms, Arm) > } > > # Or use plyr to loop over each possible arm > library(plyr) > Arms <- ddply(Chr.Arms, .variables = "Arm", function(cur.row, mapfile){ > mapfile <- mapfile[ Position >= cur.row$Start & Position <= cur.row$End] > mapfile <- mapfile[ , Arm:=cur.row$Arm][] > return(mapfile) > }, mapfile = mapfile) > > I have just started to use the data.table and I have the feeling the code > above can be greatly improved - maybe the loop can be dropped entirely? > > Hope this helps > Ulrik > > On Sat, 30 Jan 2016 at 03:29 Gaius Augustus <gaiusjaugustus at gmail.com> > wrote: > >> I have two dataframes. One has chromosome arm information, and the other >> has SNP position information. I am trying to assign each SNP an arm >> identity. I'd like to create this new column based on comparing it to the >> reference file. >> >> *1) Mapfile (has millions of rows)* >> >> Name Chr Position >> S1 1 3000 >> S2 1 6000 >> S3 1 1000 >> >> *2) Chr.Arms file (has 39 rows)* >> >> Chr Arm Start End >> 1 p 0 5000 >> 1 q 5001 10000 >> >> >> *R Script that works, but slow:* >> Arms <- c() >> for (line in 1:nrow(Mapfile)){ >> Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr & >> Mapfile$Position[line] > Chr.Arms$Start & Mapfile$Position[line] < >> Chr.Arms$End]} >> } >> Mapfile$Arm <- Arms >> >> >> *Output Table:* >> >> Name Chr Position Arm >> S1 1 3000 p >> S2 1 6000 q >> S3 1 1000 p >> >> >> In words: I want each line to look up the location ( 1) find the right >> Chr, >> 2) find the line where the START < POSITION < END), then get the ARM >> information and place it in a new column. >> >> This R script works, but surely there is a more time/processing efficient >> way to do it. >> >> Thanks in advance for any help, >> Gaius >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >[[alternative HTML version deleted]]
Hervé Pagès
2016-Feb-01 22:06 UTC
[R] Efficient way to create new column based on comparison with another dataframe
Hi Gaius, On 01/29/2016 10:52 AM, Gaius Augustus wrote:> I have two dataframes. One has chromosome arm information, and the other > has SNP position information. I am trying to assign each SNP an arm > identity. I'd like to create this new column based on comparing it to the > reference file. > > *1) Mapfile (has millions of rows)* > > Name Chr Position > S1 1 3000 > S2 1 6000 > S3 1 1000 > > *2) Chr.Arms file (has 39 rows)* > > Chr Arm Start End > 1 p 0 5000 > 1 q 5001 10000 > > > *R Script that works, but slow:* > Arms <- c() > for (line in 1:nrow(Mapfile)){ > Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr & > Mapfile$Position[line] > Chr.Arms$Start & Mapfile$Position[line] < > Chr.Arms$End]} > } > Mapfile$Arm <- Arms > > > *Output Table:* > > Name Chr Position Arm > S1 1 3000 p > S2 1 6000 q > S3 1 1000 p > > > In words: I want each line to look up the location ( 1) find the right Chr, > 2) find the line where the START < POSITION < END), then get the ARM > information and place it in a new column. > > This R script works, but surely there is a more time/processing efficient > way to do it.You could use the GenomicRanges package for this: 1) Turn 'Mapfile' and 'Chr.Arms' into GRanges objects: library(GenomicRanges) query <- makeGRangesFromDataFrame(Mapfile, start.field="Position", end.field="Position") subject <- makeGRangesFromDataFrame(Chr.Arms) 2) Call findOverlaps() on them: Mapfile2Chr.Arms <- findOverlaps(query, subject, select="arbitrary") 3) Use the result of findOverlaps() to create the column to add to 'Mapfile': Mapfile$Arm <- Chr.Arms$Arm[Mapfile2Chr.Arms] Mapfile # Name Chr Position Arm # 1 S1 1 3000 p # 2 S2 1 6000 q # 3 S3 1 1000 p Should be very fast. Note that GenomicRanges is a Bioconductor package: http://bioconductor.org/packages/GenomicRanges Make sure you follow the Installation instructions on that page. Cheers, H.> > Thanks in advance for any help, > Gaius > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319