Gaius Augustus
2016-Jan-29 18:52 UTC
[R] Efficient way to create new column based on comparison with another dataframe
I have two dataframes. One has chromosome arm information, and the other
has SNP position information. I am trying to assign each SNP an arm
identity. I'd like to create this new column based on comparing it to the
reference file.
*1) Mapfile (has millions of rows)*
Name Chr Position
S1 1 3000
S2 1 6000
S3 1 1000
*2) Chr.Arms file (has 39 rows)*
Chr Arm Start End
1 p 0 5000
1 q 5001 10000
*R Script that works, but slow:*
Arms <- c()
for (line in 1:nrow(Mapfile)){
Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr &
Mapfile$Position[line] > Chr.Arms$Start & Mapfile$Position[line] <
Chr.Arms$End]}
}
Mapfile$Arm <- Arms
*Output Table:*
Name Chr Position Arm
S1 1 3000 p
S2 1 6000 q
S3 1 1000 p
In words: I want each line to look up the location ( 1) find the right Chr,
2) find the line where the START < POSITION < END), then get the ARM
information and place it in a new column.
This R script works, but surely there is a more time/processing efficient
way to do it.
Thanks in advance for any help,
Gaius
[[alternative HTML version deleted]]
Ulrik Stervbo
2016-Jan-30 06:34 UTC
[R] Efficient way to create new column based on comparison with another dataframe
Hi Gaius,
Could you use data.table and loop over the small Chr.arms?
library(data.table)
mapfile <- data.table(Name = c("S1", "S2",
"S3"), Chr = 1, Position c(3000, 6000, 1000), key = "Chr")
Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start
= c(0, 5001), End
= c(5000, 10000), key = "Chr")
Arms <- data.table()
for(i in 1:nrow(Chr.Arms)){
cur.row <- Chr.Arms[i, ]
Arm <- mapfile[ Position >= cur.row$Start & Position <=
cur.row$End]
Arm <- Arm[ , Arm:=cur.row$Arm][]
Arms <- rbind(Arms, Arm)
}
# Or use plyr to loop over each possible arm
library(plyr)
Arms <- ddply(Chr.Arms, .variables = "Arm", function(cur.row,
mapfile){
mapfile <- mapfile[ Position >= cur.row$Start & Position <=
cur.row$End]
mapfile <- mapfile[ , Arm:=cur.row$Arm][]
return(mapfile)
}, mapfile = mapfile)
I have just started to use the data.table and I have the feeling the code
above can be greatly improved - maybe the loop can be dropped entirely?
Hope this helps
Ulrik
On Sat, 30 Jan 2016 at 03:29 Gaius Augustus <gaiusjaugustus at gmail.com>
wrote:
> I have two dataframes. One has chromosome arm information, and the other
> has SNP position information. I am trying to assign each SNP an arm
> identity. I'd like to create this new column based on comparing it to
the
> reference file.
>
> *1) Mapfile (has millions of rows)*
>
> Name Chr Position
> S1 1 3000
> S2 1 6000
> S3 1 1000
>
> *2) Chr.Arms file (has 39 rows)*
>
> Chr Arm Start End
> 1 p 0 5000
> 1 q 5001 10000
>
>
> *R Script that works, but slow:*
> Arms <- c()
> for (line in 1:nrow(Mapfile)){
> Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr
&
> Mapfile$Position[line] > Chr.Arms$Start & Mapfile$Position[line]
<
> Chr.Arms$End]}
> }
> Mapfile$Arm <- Arms
>
>
> *Output Table:*
>
> Name Chr Position Arm
> S1 1 3000 p
> S2 1 6000 q
> S3 1 1000 p
>
>
> In words: I want each line to look up the location ( 1) find the right Chr,
> 2) find the line where the START < POSITION < END), then get the ARM
> information and place it in a new column.
>
> This R script works, but surely there is a more time/processing efficient
> way to do it.
>
> Thanks in advance for any help,
> Gaius
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
Gaius Augustus
2016-Jan-30 17:50 UTC
[R] Efficient way to create new column based on comparison with another dataframe
I'll look into the Intervals idea. The data.table code posted might not
work (because I don't believe it would put the rows in the correct order if
the chromosomes are interspersed), however, it did make me think about
possibly assigning based on values...
Something like:
mapfile <- data.table(Name = c("S1", "S2",
"S3"), Chr = 1, Position c(3000, 6000, 1000), key = "Chr")
Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start
= c(0, 5001), End
= c(5000, 10000), key = "Chr")
for(i in 1:nrow(Chr.Arms)){
cur.row <- Chr.Arms[i, ]
mapfile[ Chr == cur.row$Chr & Position >= cur.row$Start & Position
<cur.row$End] <- Chr.Arms$Arm
}
This might take out the need for the intermediate table/vector. Not sure
yet if it'll work, but we'll see. I'm interested to know if anyone
else
has any ideas, too.
Thanks,
Gaius
On Fri, Jan 29, 2016 at 11:34 PM, Ulrik Stervbo <ulrik.stervbo at
gmail.com>
wrote:
> Hi Gaius,
>
> Could you use data.table and loop over the small Chr.arms?
>
> library(data.table)
> mapfile <- data.table(Name = c("S1", "S2",
"S3"), Chr = 1, Position > c(3000, 6000, 1000), key =
"Chr")
> Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"),
Start = c(0, 5001), End
> = c(5000, 10000), key = "Chr")
>
> Arms <- data.table()
> for(i in 1:nrow(Chr.Arms)){
> cur.row <- Chr.Arms[i, ]
> Arm <- mapfile[ Position >= cur.row$Start & Position <=
cur.row$End]
> Arm <- Arm[ , Arm:=cur.row$Arm][]
> Arms <- rbind(Arms, Arm)
> }
>
> # Or use plyr to loop over each possible arm
> library(plyr)
> Arms <- ddply(Chr.Arms, .variables = "Arm", function(cur.row,
mapfile){
> mapfile <- mapfile[ Position >= cur.row$Start & Position <=
cur.row$End]
> mapfile <- mapfile[ , Arm:=cur.row$Arm][]
> return(mapfile)
> }, mapfile = mapfile)
>
> I have just started to use the data.table and I have the feeling the code
> above can be greatly improved - maybe the loop can be dropped entirely?
>
> Hope this helps
> Ulrik
>
> On Sat, 30 Jan 2016 at 03:29 Gaius Augustus <gaiusjaugustus at
gmail.com>
> wrote:
>
>> I have two dataframes. One has chromosome arm information, and the
other
>> has SNP position information. I am trying to assign each SNP an arm
>> identity. I'd like to create this new column based on comparing it
to the
>> reference file.
>>
>> *1) Mapfile (has millions of rows)*
>>
>> Name Chr Position
>> S1 1 3000
>> S2 1 6000
>> S3 1 1000
>>
>> *2) Chr.Arms file (has 39 rows)*
>>
>> Chr Arm Start End
>> 1 p 0 5000
>> 1 q 5001 10000
>>
>>
>> *R Script that works, but slow:*
>> Arms <- c()
>> for (line in 1:nrow(Mapfile)){
>> Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr
&
>> Mapfile$Position[line] > Chr.Arms$Start &
Mapfile$Position[line] <
>> Chr.Arms$End]}
>> }
>> Mapfile$Arm <- Arms
>>
>>
>> *Output Table:*
>>
>> Name Chr Position Arm
>> S1 1 3000 p
>> S2 1 6000 q
>> S3 1 1000 p
>>
>>
>> In words: I want each line to look up the location ( 1) find the right
>> Chr,
>> 2) find the line where the START < POSITION < END), then get the
ARM
>> information and place it in a new column.
>>
>> This R script works, but surely there is a more time/processing
efficient
>> way to do it.
>>
>> Thanks in advance for any help,
>> Gaius
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
[[alternative HTML version deleted]]
Hervé Pagès
2016-Feb-01 22:06 UTC
[R] Efficient way to create new column based on comparison with another dataframe
Hi Gaius, On 01/29/2016 10:52 AM, Gaius Augustus wrote:> I have two dataframes. One has chromosome arm information, and the other > has SNP position information. I am trying to assign each SNP an arm > identity. I'd like to create this new column based on comparing it to the > reference file. > > *1) Mapfile (has millions of rows)* > > Name Chr Position > S1 1 3000 > S2 1 6000 > S3 1 1000 > > *2) Chr.Arms file (has 39 rows)* > > Chr Arm Start End > 1 p 0 5000 > 1 q 5001 10000 > > > *R Script that works, but slow:* > Arms <- c() > for (line in 1:nrow(Mapfile)){ > Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr & > Mapfile$Position[line] > Chr.Arms$Start & Mapfile$Position[line] < > Chr.Arms$End]} > } > Mapfile$Arm <- Arms > > > *Output Table:* > > Name Chr Position Arm > S1 1 3000 p > S2 1 6000 q > S3 1 1000 p > > > In words: I want each line to look up the location ( 1) find the right Chr, > 2) find the line where the START < POSITION < END), then get the ARM > information and place it in a new column. > > This R script works, but surely there is a more time/processing efficient > way to do it.You could use the GenomicRanges package for this: 1) Turn 'Mapfile' and 'Chr.Arms' into GRanges objects: library(GenomicRanges) query <- makeGRangesFromDataFrame(Mapfile, start.field="Position", end.field="Position") subject <- makeGRangesFromDataFrame(Chr.Arms) 2) Call findOverlaps() on them: Mapfile2Chr.Arms <- findOverlaps(query, subject, select="arbitrary") 3) Use the result of findOverlaps() to create the column to add to 'Mapfile': Mapfile$Arm <- Chr.Arms$Arm[Mapfile2Chr.Arms] Mapfile # Name Chr Position Arm # 1 S1 1 3000 p # 2 S2 1 6000 q # 3 S3 1 1000 p Should be very fast. Note that GenomicRanges is a Bioconductor package: http://bioconductor.org/packages/GenomicRanges Make sure you follow the Installation instructions on that page. Cheers, H.> > Thanks in advance for any help, > Gaius > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319