thr3ads.net - R help - [R] loop in a data.table [Mar 2013]

If this information is useful, please help other people find it:
Share via:

Camilo Mora

2013-Mar-13 23:25 UTC

[R] loop in a data.table

Hi everyone,

I have a data.table called "data" with many columns which I want to  
group by column1 using data.table, given how fast it is.

The problem with looping a data.table is that data.table does not like  
quotations  to define the column names (e.g. "col2" instead of col2).
I found a way around which is to use get("col2"), which works fine but
the processing time multiples by 20.

So if I use:

data[,sum(col2),by=(key)]

entering the column names by hand, the operation is done in 1 sec. but  
if in the contrary I use:

data[,sum(get("col2")),by=(key)]

using a loop to put the column names, the same operation takes 20 sec.  
I cannot use the former code because I have 100000 files to process  
but the later will simply take months to complete. Is there any  
alternative to the function "get" or any other way in which data.table
con recognize the names of the columns?.

Thanks,
Camilo




Camilo Mora, Ph.D.
Department of Geography, University of Hawaii
Currently available in Colombia
Phone:   Country code: 57
          Provider code: 313
          Phone 776 2282
          From the USA or Canada you have to dial 011 57 313 776 2282
http://www.soc.hawaii.edu/mora/

Steve Lianoglou

2013-Mar-14 01:28 UTC

head link

[R] loop in a data.table

Hi,

On Wed, Mar 13, 2013 at 7:25 PM, Camilo Mora <cmora at dal.ca>
wrote:> Hi everyone,
>
> I have a data.table called "data" with many columns which I want
to group by
> column1 using data.table, given how fast it is.
>
> The problem with looping a data.table is that data.table does not like
> quotations  to define the column names (e.g. "col2" instead of
col2). I
> found a way around which is to use get("col2"), which works fine
but the
> processing time multiples by 20.
>
> So if I use:
>
> data[,sum(col2),by=(key)]
>
> entering the column names by hand, the operation is done in 1 sec. but if
in
> the contrary I use:
>
> data[,sum(get("col2")),by=(key)]
>
> using a loop to put the column names, the same operation takes 20 sec. I
> cannot use the former code because I have 100000 files to process but the
> later will simply take months to complete. Is there any alternative to the
> function "get" or any other way in which data.table con recognize
the names
> of the columns?.
I'm still not sure what you're trying to do. Could you maybe create an
example that's a bit closer to you real data and the stuff you want to
do on it?

Are all the columns of the same type?
Are you just summing columns?

If you post code into an email that reconstructions a small version of
your data.table (maybe 5-10 columns and one or two groups) it'd be
more clear for me.

Thanks,
-steve
-- 
Steve Lianoglou
Defender of The Thesis
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

Camilo Mora

2013-Mar-14 03:27 UTC

head link

[R] loop in a data.table

I would like to clarify my previous email about using data.table.

imagine the following data.frame called "data":

a     b       c      d     e
1     12     15     65     6
1     65     85     36     5
2     69     84     35     8
2     45     78     65     8

I want to aggregate the rows of columns b:d by the rows of column a.  
the aggregation is sum(col[b:d]/sum(col[e]).
For this I am using a data.table with a loop of the form:

##########################################

ColNames<-colnames(data)   #gets the names of the columns

x=ncol(data)-1    #number of columns to process minus the last column.

data<-data.table(data)     #converts to data.table


for (z in 2:x)  #I start the loop in the second column and finish in column d
{
outputdata<-data[, sum(get(ColNames[z]))/sum(e), by="a"]
}
############################################


this works fine but the function "get" slowdown the aggregation of the
rows by about 20 times. I wonder if there is an alternative fucntion  
to "get" or an alternative way to aggregate all columns at once. I am
reading into the function .SD but have not yet figure out how to put  
more than one operation in the function.

right now I have:
###############
outputdata=data[, lapply(.SD, sum), by="a", .SDcols=2:x]

##############
this later code aggregates all columns at once but only by summing.  
eventually I need to divide the sum of each column by the sum of  
column e as well.

ANy help will be greatly appreciate.

Thanks,

Camilo






Camilo Mora, Ph.D.
Department of Geography, University of Hawaii
Currently available in Colombia
Phone:   Country code: 57
          Provider code: 313
          Phone 776 2282
          From the USA or Canada you have to dial 011 57 313 776 2282
http://www.soc.hawaii.edu/mora/

arun

2013-Mar-14 04:44 UTC

head link

[R] loop in a data.table

Hi,

May be this helps:

dat1<- read.table(text="
a??? b????? c????? d??? e
1??? 12??? 15??? 65??? 6
1??? 65??? 85??? 36??? 5
2??? 69??? 84??? 35??? 8
2??? 45??? 78??? 65??? 8
",sep="",header=TRUE)
library(data.table)
?dat2<- data.table(dat1)
?dat2[,head(sapply(.SD,sum)/sapply(.SD,sum)[4],-1),by="a"]
#?? a??????? V1
#1: 1? 7.000000
#2: 1? 9.090909
#3: 1? 9.181818
#4: 2? 7.125000
#5: 2 10.125000
#6: 2? 6.250000


outputdat<-list()
?ColNames<-colnames(dat2)
?x<- ncol(dat2)-1
?ColNames<-colnames(dat2)
?x<- ncol(dat2)-1
?for(z in 2:x)
?{
?outputdat[[z]]<-dat2[,sum(get(ColNames[z]))/sum(e),by="a"]
?}


do.call(rbind,outputdat)
#?? a??????? V1
#1: 1? 7.000000
#2: 2? 7.125000
#3: 1? 9.090909
#4: 2 10.125000
#5: 1? 9.181818
#6: 2? 6.250000
A.K.

----- Original Message -----
From: Camilo Mora <cmora at DAL.CA>
To: r-help at r-project.org
Cc: 
Sent: Wednesday, March 13, 2013 11:27 PM
Subject: [R] loop in a data.table

I would like to clarify my previous email about using data.table.

imagine the following data.frame called "data":

a? ?  b? ? ?  c? ? ? d? ?  e
1? ?  12? ?  15? ?  65? ?  6
1? ?  65? ?  85? ?  36? ?  5
2? ?  69? ?  84? ?  35? ?  8
2? ?  45? ?  78? ?  65? ?  8

I want to aggregate the rows of columns b:d by the rows of column a. the
aggregation is sum(col[b:d]/sum(col[e]).
For this I am using a data.table with a loop of the form:

##########################################

ColNames<-colnames(data)?  #gets the names of the columns

x=ncol(data)-1? ? #number of columns to process minus the last column.

data<-data.table(data)? ?  #converts to data.table


for (z in 2:x)? #I start the loop in the second column and finish in column d
{
outputdata<-data[, sum(get(ColNames[z]))/sum(e), by="a"]
}
############################################


this works fine but the function "get" slowdown the aggregation of the
rows by about 20 times. I wonder if there is an alternative fucntion to
"get" or an alternative way to aggregate all columns at once. I am
reading into the function .SD but have not yet figure out how to put more than
one operation in the function.

right now I have:
###############
outputdata=data[, lapply(.SD, sum), by="a", .SDcols=2:x]

##############
this later code aggregates all columns at once but only by summing. eventually I
need to divide the sum of each column by the sum of column e as well.

ANy help will be greatly appreciate.

Thanks,

Camilo






Camilo Mora, Ph.D.
Department of Geography, University of Hawaii
Currently available in Colombia
Phone:?  Country code: 57
? ? ? ?  Provider code: 313
? ? ? ?  Phone 776 2282
? ? ? ?  From the USA or Canada you have to dial 011 57 313 776 2282
http://www.soc.hawaii.edu/mora/

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Seemingly Similar Threads

Search for more possibly parallel threads

R help - Mar 2013 - loop in a data.table

[R] loop in a data.table

[R] loop in a data.table

[R] loop in a data.table

[R] loop in a data.table

Seemingly Similar Threads