thr3ads.net - R help - [R] Efficient way to create new column based on comparison with another dataframe [Jan 2016]

If this information is useful, please help other people find it:
Share via:

Gaius Augustus

2016-Jan-30 17:50 UTC

[R] Efficient way to create new column based on comparison with another dataframe

I'll look into the Intervals idea.  The data.table code posted might not
work (because I don't believe it would put the rows in the correct order if
the chromosomes are interspersed), however, it did make me think about
possibly assigning based on values...

Something like:
mapfile <- data.table(Name = c("S1", "S2",
"S3"), Chr = 1, Position c(3000, 6000, 1000), key = "Chr")
Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"), Start
= c(0, 5001), End
= c(5000, 10000), key = "Chr")

for(i in 1:nrow(Chr.Arms)){
  cur.row <- Chr.Arms[i, ]
  mapfile[ Chr == cur.row$Chr & Position >= cur.row$Start & Position
<cur.row$End] <- Chr.Arms$Arm
}

This might take out the need for the intermediate table/vector.  Not sure
yet if it'll work, but we'll see.  I'm interested to know if anyone
else
has any ideas, too.

Thanks,
Gaius

On Fri, Jan 29, 2016 at 11:34 PM, Ulrik Stervbo <ulrik.stervbo at
gmail.com>
wrote:
> Hi Gaius,
>
> Could you use data.table and loop over the small Chr.arms?
>
> library(data.table)
> mapfile <- data.table(Name = c("S1", "S2",
"S3"), Chr = 1, Position > c(3000, 6000, 1000), key =
"Chr")
> Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"),
Start = c(0, 5001), End
> = c(5000, 10000), key = "Chr")
>
> Arms <- data.table()
> for(i in 1:nrow(Chr.Arms)){
>   cur.row <- Chr.Arms[i, ]
>   Arm <- mapfile[ Position >= cur.row$Start & Position <=
cur.row$End]
>   Arm <- Arm[ , Arm:=cur.row$Arm][]
>   Arms <- rbind(Arms, Arm)
> }
>
> # Or use plyr to loop over each possible arm
> library(plyr)
> Arms <- ddply(Chr.Arms, .variables = "Arm", function(cur.row,
mapfile){
>   mapfile <- mapfile[ Position >= cur.row$Start & Position <=
cur.row$End]
>   mapfile <- mapfile[ , Arm:=cur.row$Arm][]
>   return(mapfile)
> }, mapfile = mapfile)
>
> I have just started to use the data.table and I have the feeling the code
> above can be greatly improved - maybe the loop can be dropped entirely?
>
> Hope this helps
> Ulrik
>
> On Sat, 30 Jan 2016 at 03:29 Gaius Augustus <gaiusjaugustus at
gmail.com>
> wrote:
>
>> I have two dataframes. One has chromosome arm information, and the
other
>> has SNP position information. I am trying to assign each SNP an arm
>> identity.  I'd like to create this new column based on comparing it
to the
>> reference file.
>>
>> *1) Mapfile (has millions of rows)*
>>
>> Name    Chr   Position
>> S1      1      3000
>> S2      1      6000
>> S3      1      1000
>>
>> *2) Chr.Arms   file (has 39 rows)*
>>
>> Chr    Arm    Start   End
>> 1      p      0       5000
>> 1      q      5001    10000
>>
>>
>> *R Script that works, but slow:*
>> Arms  <- c()
>> for (line in 1:nrow(Mapfile)){
>>       Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] == Chr.Arms$Chr
&
>>  Mapfile$Position[line] > Chr.Arms$Start & 
Mapfile$Position[line] <
>> Chr.Arms$End]}
>> }
>> Mapfile$Arm <- Arms
>>
>>
>> *Output Table:*
>>
>> Name   Chr   Position   Arm
>> S1      1     3000      p
>> S2      1     6000      q
>> S3      1     1000      p
>>
>>
>> In words: I want each line to look up the location ( 1) find the right
>> Chr,
>> 2) find the line where the START < POSITION < END), then get the
ARM
>> information and place it in a new column.
>>
>> This R script works, but surely there is a more time/processing
efficient
>> way to do it.
>>
>> Thanks in advance for any help,
>> Gaius
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
	[[alternative HTML version deleted]]

Gaius Augustus

2016-Jan-30 18:48 UTC

head link

[R] Efficient way to create new column based on comparison with another dataframe

I'll look into the Intervals idea.  The data.table code posted might not
work (because I don't believe it would put the rows in the correct order if
the chromosomes are interspersed), however, it did make me think about
possibly assigning based on values...

*SOLUTION*
mapfile <- data.frame(Name = c("S1", "S2",
"S3"), Chr = 1, Position c(3000, 6000, 1000), key = "Chr")
Chr.Arms <- data.frame(Chr = 1, Arm = c("p", "q"), Start
= c(0, 5001), End
= c(5000, 10000), key = "Chr")

for(i in 1:nrow(Chr.Arms)){
  cur.row <- Chr.Arms[i, ]
  mapfile$Arm[ mapfile$Chr == cur.row$Chr & mapfile$Position
>cur.row$Start & mapfile$Position <= cur.row$End] <- cur.row$Arm
}

This took out the need for the intermediate table/vector.  This worked for
me, and was VERY fast.  Took <5 minutes on a dataframe with 35 million rows.

Thanks for the help,
Gaius

On Sat, Jan 30, 2016 at 10:50 AM, Gaius Augustus <gaiusjaugustus at
gmail.com>
wrote:
> I'll look into the Intervals idea.  The data.table code posted might
not
> work (because I don't believe it would put the rows in the correct
order if
> the chromosomes are interspersed), however, it did make me think about
> possibly assigning based on values...
>
> Something like:
> mapfile <- data.table(Name = c("S1", "S2",
"S3"), Chr = 1, Position > c(3000, 6000, 1000), key =
"Chr")
> Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"),
Start = c(0, 5001), End
> = c(5000, 10000), key = "Chr")
>
> for(i in 1:nrow(Chr.Arms)){
>   cur.row <- Chr.Arms[i, ]
>   mapfile[ Chr == cur.row$Chr & Position >= cur.row$Start &
Position <> cur.row$End] <- Chr.Arms$Arm
> }
>
> This might take out the need for the intermediate table/vector.  Not sure
> yet if it'll work, but we'll see.  I'm interested to know if
anyone else
> has any ideas, too.
>
> Thanks,
> Gaius
>
> On Fri, Jan 29, 2016 at 11:34 PM, Ulrik Stervbo <ulrik.stervbo at
gmail.com>
> wrote:
>
>> Hi Gaius,
>>
>> Could you use data.table and loop over the small Chr.arms?
>>
>> library(data.table)
>> mapfile <- data.table(Name = c("S1", "S2",
"S3"), Chr = 1, Position >> c(3000, 6000, 1000), key =
"Chr")
>> Chr.Arms <- data.table(Chr = 1, Arm = c("p",
"q"), Start = c(0, 5001),
>> End = c(5000, 10000), key = "Chr")
>>
>> Arms <- data.table()
>> for(i in 1:nrow(Chr.Arms)){
>>   cur.row <- Chr.Arms[i, ]
>>   Arm <- mapfile[ Position >= cur.row$Start & Position <=
cur.row$End]
>>   Arm <- Arm[ , Arm:=cur.row$Arm][]
>>   Arms <- rbind(Arms, Arm)
>> }
>>
>> # Or use plyr to loop over each possible arm
>> library(plyr)
>> Arms <- ddply(Chr.Arms, .variables = "Arm",
function(cur.row, mapfile){
>>   mapfile <- mapfile[ Position >= cur.row$Start & Position
<= cur.row$End]
>>   mapfile <- mapfile[ , Arm:=cur.row$Arm][]
>>   return(mapfile)
>> }, mapfile = mapfile)
>>
>> I have just started to use the data.table and I have the feeling the
code
>> above can be greatly improved - maybe the loop can be dropped entirely?
>>
>> Hope this helps
>> Ulrik
>>
>> On Sat, 30 Jan 2016 at 03:29 Gaius Augustus <gaiusjaugustus at
gmail.com>
>> wrote:
>>
>>> I have two dataframes. One has chromosome arm information, and the
other
>>> has SNP position information. I am trying to assign each SNP an arm
>>> identity.  I'd like to create this new column based on
comparing it to
>>> the
>>> reference file.
>>>
>>> *1) Mapfile (has millions of rows)*
>>>
>>> Name    Chr   Position
>>> S1      1      3000
>>> S2      1      6000
>>> S3      1      1000
>>>
>>> *2) Chr.Arms   file (has 39 rows)*
>>>
>>> Chr    Arm    Start   End
>>> 1      p      0       5000
>>> 1      q      5001    10000
>>>
>>>
>>> *R Script that works, but slow:*
>>> Arms  <- c()
>>> for (line in 1:nrow(Mapfile)){
>>>       Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] ==
Chr.Arms$Chr &
>>>  Mapfile$Position[line] > Chr.Arms$Start & 
Mapfile$Position[line] <
>>> Chr.Arms$End]}
>>> }
>>> Mapfile$Arm <- Arms
>>>
>>>
>>> *Output Table:*
>>>
>>> Name   Chr   Position   Arm
>>> S1      1     3000      p
>>> S2      1     6000      q
>>> S3      1     1000      p
>>>
>>>
>>> In words: I want each line to look up the location ( 1) find the
right
>>> Chr,
>>> 2) find the line where the START < POSITION < END), then get
the ARM
>>> information and place it in a new column.
>>>
>>> This R script works, but surely there is a more time/processing
efficient
>>> way to do it.
>>>
>>> Thanks in advance for any help,
>>> Gaius
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>
	[[alternative HTML version deleted]]

Dénes Tóth

2016-Jan-31 09:17 UTC

head link

[R] Efficient way to create new column based on comparison with another dataframe

Hi,

I have not followed this thread from the beginning, but have you tried 
the foverlaps() function from the data.table package?

Something along the lines of:

---
# create the tables (use as.data.table() or setDT() if you
# start with a data.frame)
mapfile <- data.table(Name = c("S1", "S2",
"S3"), Chr = 1,
                       Position = c(3000, 6000, 1000))
Chr.Arms <- data.table(Chr = 1, Arm = c("p", "q"),
                        Start = c(0, 5001), End = c(5000, 10000))

# add a dummy variable to be able to define Position as an interval
mapfile[, Position2 := Position]

# add keys
setkey(mapfile, Chr, Position, Position2)
setkey(Chr.Arms, Chr, Start, End)

# use data.table::foverlaps (see ?foverlaps)
mapfile <- foverlaps(mapfile, Chr.Arms, type = "within")

# remove the dummy variable
mapfile[, Position2 := NULL]

# recreate original order
setorder(mapfile, Chr, Name)

---

BTW, there is a typo in your *SOLUTION*. I guess you wanted to write 
data.table(Name = c("S1", "S2", "S3"), Chr = 1,
Position = c(3000, 6000,
1000), key = "Chr") instead of data.frame(Name = c("S1",
"S2", "S3"),
Chr = 1, Position = c(3000, 6000, 1000), key = "Chr").

HTH,
   Denes



On 01/30/2016 07:48 PM, Gaius Augustus wrote:> I'll look into the Intervals idea.  The data.table code posted might
not
> work (because I don't believe it would put the rows in the correct
order if
> the chromosomes are interspersed), however, it did make me think about
> possibly assigning based on values...
>
> *SOLUTION*
> mapfile <- data.frame(Name = c("S1", "S2",
"S3"), Chr = 1, Position > c(3000, 6000, 1000), key =
"Chr")
> Chr.Arms <- data.frame(Chr = 1, Arm = c("p", "q"),
Start = c(0, 5001), End
> = c(5000, 10000), key = "Chr")
>
> for(i in 1:nrow(Chr.Arms)){
>    cur.row <- Chr.Arms[i, ]
>    mapfile$Arm[ mapfile$Chr == cur.row$Chr & mapfile$Position >>
cur.row$Start & mapfile$Position <= cur.row$End] <- cur.row$Arm
> }
>
> This took out the need for the intermediate table/vector.  This worked for
> me, and was VERY fast.  Took <5 minutes on a dataframe with 35 million
rows.
>
> Thanks for the help,
> Gaius
>
> On Sat, Jan 30, 2016 at 10:50 AM, Gaius Augustus <gaiusjaugustus at
gmail.com>
> wrote:
>
>> I'll look into the Intervals idea.  The data.table code posted
might not
>> work (because I don't believe it would put the rows in the correct
order if
>> the chromosomes are interspersed), however, it did make me think about
>> possibly assigning based on values...
>>
>> Something like:
>> mapfile <- data.table(Name = c("S1", "S2",
"S3"), Chr = 1, Position >> c(3000, 6000, 1000), key =
"Chr")
>> Chr.Arms <- data.table(Chr = 1, Arm = c("p",
"q"), Start = c(0, 5001), End
>> = c(5000, 10000), key = "Chr")
>>
>> for(i in 1:nrow(Chr.Arms)){
>>    cur.row <- Chr.Arms[i, ]
>>    mapfile[ Chr == cur.row$Chr & Position >= cur.row$Start &
Position <>> cur.row$End] <- Chr.Arms$Arm
>> }
>>
>> This might take out the need for the intermediate table/vector.  Not
sure
>> yet if it'll work, but we'll see.  I'm interested to know
if anyone else
>> has any ideas, too.
>>
>> Thanks,
>> Gaius
>>
>> On Fri, Jan 29, 2016 at 11:34 PM, Ulrik Stervbo <ulrik.stervbo at
gmail.com>
>> wrote:
>>
>>> Hi Gaius,
>>>
>>> Could you use data.table and loop over the small Chr.arms?
>>>
>>> library(data.table)
>>> mapfile <- data.table(Name = c("S1", "S2",
"S3"), Chr = 1, Position >>> c(3000, 6000, 1000), key =
"Chr")
>>> Chr.Arms <- data.table(Chr = 1, Arm = c("p",
"q"), Start = c(0, 5001),
>>> End = c(5000, 10000), key = "Chr")
>>>
>>> Arms <- data.table()
>>> for(i in 1:nrow(Chr.Arms)){
>>>    cur.row <- Chr.Arms[i, ]
>>>    Arm <- mapfile[ Position >= cur.row$Start & Position
<= cur.row$End]
>>>    Arm <- Arm[ , Arm:=cur.row$Arm][]
>>>    Arms <- rbind(Arms, Arm)
>>> }
>>>
>>> # Or use plyr to loop over each possible arm
>>> library(plyr)
>>> Arms <- ddply(Chr.Arms, .variables = "Arm",
function(cur.row, mapfile){
>>>    mapfile <- mapfile[ Position >= cur.row$Start &
Position <= cur.row$End]
>>>    mapfile <- mapfile[ , Arm:=cur.row$Arm][]
>>>    return(mapfile)
>>> }, mapfile = mapfile)
>>>
>>> I have just started to use the data.table and I have the feeling
the code
>>> above can be greatly improved - maybe the loop can be dropped
entirely?
>>>
>>> Hope this helps
>>> Ulrik
>>>
>>> On Sat, 30 Jan 2016 at 03:29 Gaius Augustus <gaiusjaugustus at
gmail.com>
>>> wrote:
>>>
>>>> I have two dataframes. One has chromosome arm information, and
the other
>>>> has SNP position information. I am trying to assign each SNP an
arm
>>>> identity.  I'd like to create this new column based on
comparing it to
>>>> the
>>>> reference file.
>>>>
>>>> *1) Mapfile (has millions of rows)*
>>>>
>>>> Name    Chr   Position
>>>> S1      1      3000
>>>> S2      1      6000
>>>> S3      1      1000
>>>>
>>>> *2) Chr.Arms   file (has 39 rows)*
>>>>
>>>> Chr    Arm    Start   End
>>>> 1      p      0       5000
>>>> 1      q      5001    10000
>>>>
>>>>
>>>> *R Script that works, but slow:*
>>>> Arms  <- c()
>>>> for (line in 1:nrow(Mapfile)){
>>>>        Arms[line] <- Chr.Arms$Arm[ Mapfile$Chr[line] ==
Chr.Arms$Chr &
>>>>   Mapfile$Position[line] > Chr.Arms$Start & 
Mapfile$Position[line] <
>>>> Chr.Arms$End]}
>>>> }
>>>> Mapfile$Arm <- Arms
>>>>
>>>>
>>>> *Output Table:*
>>>>
>>>> Name   Chr   Position   Arm
>>>> S1      1     3000      p
>>>> S2      1     6000      q
>>>> S3      1     1000      p
>>>>
>>>>
>>>> In words: I want each line to look up the location ( 1) find
the right
>>>> Chr,
>>>> 2) find the line where the START < POSITION < END), then
get the ARM
>>>> information and place it in a new column.
>>>>
>>>> This R script works, but surely there is a more time/processing
efficient
>>>> way to do it.
>>>>
>>>> Thanks in advance for any help,
>>>> Gaius
>>>>
>>>>          [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible
code.
>>>>
>>>
>>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

R help - Jan 2016 - Efficient way to create new column based on comparison with another dataframe

[R] Efficient way to create new column based on comparison with another dataframe

[R] Efficient way to create new column based on comparison with another dataframe

[R] Efficient way to create new column based on comparison with another dataframe