thr3ads.net - R help - [R] Efficient way to create new column based on comparison with another dataframe [Jan 2016]

If this information is useful, please help other people find it:
Share via:

Gaius Augustus

2016-Jan-29 18:52 UTC

[R] Efficient way to create new column based on comparison with another dataframe

I have two dataframes. One has chromosome arm information, and the other
has SNP position information. I am trying to assign each SNP an arm
identity.  I'd like to create this new column based on comparing it to the
reference file.

*1) Mapfile (has millions of rows)*

Name    Chr   Position
S1      1      3000
S2      1      6000
S3      1      1000

*2) Chr.Arms   file (has 39 rows)*

Chr    Arm    Start   End
1      p      0       5000
1      q      5001    10000


*R Script that works, but slow:*
Arms  <- c()
for (line in 1:nrow(Mapfile)){
      Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr &
 Mapfile$Position[line] > Chr.Arms$Start &  Mapfile$Position[line] <
Chr.Arms$End]}
}
Mapfile$Arm <- Arms


*Output Table:*

Name   Chr   Position   Arm
S1      1     3000      p
S2      1     6000      q
S3      1     1000      p


In words: I want each line to look up the location ( 1) find the right Chr,
2) find the line where the START < POSITION < END), then get the ARM
information and place it in a new column.

This R script works, but surely there is a more time/processing efficient
way to do it.

Thanks in advance for any help,
Gaius

	[[alternative HTML version deleted]]

Ulrik Stervbo

2016-Jan-30 06:34 UTC

head link

[R] Efficient way to create new column based on comparison with another dataframe

Hi Gaius,

Could you use data.table and loop over the small Chr.arms?

library(data.table)
mapfile <- data.table(Name = c("S1", "S2",
"S3"), Chr = 1, Position c(3000, 6000, 1000), key = "Chr")
Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start
= c(0, 5001), End
= c(5000, 10000), key = "Chr")

Arms <- data.table()
for(i in 1:nrow(Chr.Arms)){
  cur.row <- Chr.Arms[i, ]
  Arm <- mapfile[ Position >= cur.row$Start & Position <=
cur.row$End]
  Arm <- Arm[ , Arm:=cur.row$Arm][]
  Arms <- rbind(Arms, Arm)
}

# Or use plyr to loop over each possible arm
library(plyr)
Arms <- ddply(Chr.Arms, .variables = "Arm", function(cur.row,
mapfile){
  mapfile <- mapfile[ Position >= cur.row$Start & Position <=
cur.row$End]
  mapfile <- mapfile[ , Arm:=cur.row$Arm][]
  return(mapfile)
}, mapfile = mapfile)

I have just started to use the data.table and I have the feeling the code
above can be greatly improved - maybe the loop can be dropped entirely?

Hope this helps
Ulrik

On Sat, 30 Jan 2016 at 03:29 Gaius Augustus <gaiusjaugustus at gmail.com>
wrote:
> I have two dataframes. One has chromosome arm information, and the other
> has SNP position information. I am trying to assign each SNP an arm
> identity.  I'd like to create this new column based on comparing it to
the
> reference file.
>
> *1) Mapfile (has millions of rows)*
>
> Name    Chr   Position
> S1      1      3000
> S2      1      6000
> S3      1      1000
>
> *2) Chr.Arms   file (has 39 rows)*
>
> Chr    Arm    Start   End
> 1      p      0       5000
> 1      q      5001    10000
>
>
> *R Script that works, but slow:*
> Arms  <- c()
> for (line in 1:nrow(Mapfile)){
>       Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr
&
>  Mapfile$Position[line] > Chr.Arms$Start &  Mapfile$Position[line]
<
> Chr.Arms$End]}
> }
> Mapfile$Arm <- Arms
>
>
> *Output Table:*
>
> Name   Chr   Position   Arm
> S1      1     3000      p
> S2      1     6000      q
> S3      1     1000      p
>
>
> In words: I want each line to look up the location ( 1) find the right Chr,
> 2) find the line where the START < POSITION < END), then get the ARM
> information and place it in a new column.
>
> This R script works, but surely there is a more time/processing efficient
> way to do it.
>
> Thanks in advance for any help,
> Gaius
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Gaius Augustus

2016-Jan-30 17:50 UTC

head link

[R] Efficient way to create new column based on comparison with another dataframe

I'll look into the Intervals idea.  The data.table code posted might not
work (because I don't believe it would put the rows in the correct order if
the chromosomes are interspersed), however, it did make me think about
possibly assigning based on values...

Something like:
mapfile <- data.table(Name = c("S1", "S2",
"S3"), Chr = 1, Position c(3000, 6000, 1000), key = "Chr")
Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start
= c(0, 5001), End
= c(5000, 10000), key = "Chr")

for(i in 1:nrow(Chr.Arms)){
  cur.row <- Chr.Arms[i, ]
  mapfile[ Chr == cur.row$Chr & Position >= cur.row$Start & Position
<cur.row$End] <- Chr.Arms$Arm
}

This might take out the need for the intermediate table/vector.  Not sure
yet if it'll work, but we'll see.  I'm interested to know if anyone
else
has any ideas, too.

Thanks,
Gaius

On Fri, Jan 29, 2016 at 11:34 PM, Ulrik Stervbo <ulrik.stervbo at
gmail.com>
wrote:
> Hi Gaius,
>
> Could you use data.table and loop over the small Chr.arms?
>
> library(data.table)
> mapfile <- data.table(Name = c("S1", "S2",
"S3"), Chr = 1, Position > c(3000, 6000, 1000), key =
"Chr")
> Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"),
Start = c(0, 5001), End
> = c(5000, 10000), key = "Chr")
>
> Arms <- data.table()
> for(i in 1:nrow(Chr.Arms)){
>   cur.row <- Chr.Arms[i, ]
>   Arm <- mapfile[ Position >= cur.row$Start & Position <=
cur.row$End]
>   Arm <- Arm[ , Arm:=cur.row$Arm][]
>   Arms <- rbind(Arms, Arm)
> }
>
> # Or use plyr to loop over each possible arm
> library(plyr)
> Arms <- ddply(Chr.Arms, .variables = "Arm", function(cur.row,
mapfile){
>   mapfile <- mapfile[ Position >= cur.row$Start & Position <=
cur.row$End]
>   mapfile <- mapfile[ , Arm:=cur.row$Arm][]
>   return(mapfile)
> }, mapfile = mapfile)
>
> I have just started to use the data.table and I have the feeling the code
> above can be greatly improved - maybe the loop can be dropped entirely?
>
> Hope this helps
> Ulrik
>
> On Sat, 30 Jan 2016 at 03:29 Gaius Augustus <gaiusjaugustus at
gmail.com>
> wrote:
>
>> I have two dataframes. One has chromosome arm information, and the
other
>> has SNP position information. I am trying to assign each SNP an arm
>> identity.  I'd like to create this new column based on comparing it
to the
>> reference file.
>>
>> *1) Mapfile (has millions of rows)*
>>
>> Name    Chr   Position
>> S1      1      3000
>> S2      1      6000
>> S3      1      1000
>>
>> *2) Chr.Arms   file (has 39 rows)*
>>
>> Chr    Arm    Start   End
>> 1      p      0       5000
>> 1      q      5001    10000
>>
>>
>> *R Script that works, but slow:*
>> Arms  <- c()
>> for (line in 1:nrow(Mapfile)){
>>       Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr
&
>>  Mapfile$Position[line] > Chr.Arms$Start & 
Mapfile$Position[line] <
>> Chr.Arms$End]}
>> }
>> Mapfile$Arm <- Arms
>>
>>
>> *Output Table:*
>>
>> Name   Chr   Position   Arm
>> S1      1     3000      p
>> S2      1     6000      q
>> S3      1     1000      p
>>
>>
>> In words: I want each line to look up the location ( 1) find the right
>> Chr,
>> 2) find the line where the START < POSITION < END), then get the
ARM
>> information and place it in a new column.
>>
>> This R script works, but surely there is a more time/processing
efficient
>> way to do it.
>>
>> Thanks in advance for any help,
>> Gaius
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
	[[alternative HTML version deleted]]

Hervé Pagès

2016-Feb-01 22:06 UTC

head link

[R] Efficient way to create new column based on comparison with another dataframe

Hi Gaius,

On 01/29/2016 10:52 AM, Gaius Augustus wrote:> I have two dataframes. One has chromosome arm information, and the other
> has SNP position information. I am trying to assign each SNP an arm
> identity.  I'd like to create this new column based on comparing it to
the
> reference file.
>
> *1) Mapfile (has millions of rows)*
>
> Name    Chr   Position
> S1      1      3000
> S2      1      6000
> S3      1      1000
>
> *2) Chr.Arms   file (has 39 rows)*
>
> Chr    Arm    Start   End
> 1      p      0       5000
> 1      q      5001    10000
>
>
> *R Script that works, but slow:*
> Arms  <- c()
> for (line in 1:nrow(Mapfile)){
>        Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr
&
>   Mapfile$Position[line] > Chr.Arms$Start &  Mapfile$Position[line]
<
> Chr.Arms$End]}
> }
> Mapfile$Arm <- Arms
>
>
> *Output Table:*
>
> Name   Chr   Position   Arm
> S1      1     3000      p
> S2      1     6000      q
> S3      1     1000      p
>
>
> In words: I want each line to look up the location ( 1) find the right Chr,
> 2) find the line where the START < POSITION < END), then get the ARM
> information and place it in a new column.
>
> This R script works, but surely there is a more time/processing efficient
> way to do it.
You could use the GenomicRanges package for this:

1) Turn 'Mapfile' and 'Chr.Arms' into GRanges objects:

   library(GenomicRanges)
   query <- makeGRangesFromDataFrame(Mapfile,
start.field="Position",
                                              end.field="Position")
   subject <- makeGRangesFromDataFrame(Chr.Arms)

2) Call findOverlaps() on them:

   Mapfile2Chr.Arms <- findOverlaps(query, subject,
select="arbitrary")

3) Use the result of findOverlaps() to create the column to add to
   'Mapfile':

   Mapfile$Arm <- Chr.Arms$Arm[Mapfile2Chr.Arms]
   Mapfile
   #   Name Chr Position Arm
   # 1   S1   1     3000   p
   # 2   S2   1     6000   q
   # 3   S3   1     1000   p

Should be very fast.

Note that GenomicRanges is a Bioconductor package:

   http://bioconductor.org/packages/GenomicRanges

Make sure you follow the Installation instructions on that page.

Cheers,
H.
>
> Thanks in advance for any help,
> Gaius
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
-- 
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

R help - Jan 2016 - Efficient way to create new column based on comparison with another dataframe

[R] Efficient way to create new column based on comparison with another dataframe

[R] Efficient way to create new column based on comparison with another dataframe

[R] Efficient way to create new column based on comparison with another dataframe

[R] Efficient way to create new column based on comparison with another dataframe