Hi everyone,
I have a data.table called "data" with many columns which I want to
group by column1 using data.table, given how fast it is.
The problem with looping a data.table is that data.table does not like
quotations to define the column names (e.g. "col2" instead of col2).
I found a way around which is to use get("col2"), which works fine but
the processing time multiples by 20.
So if I use:
data[,sum(col2),by=(key)]
entering the column names by hand, the operation is done in 1 sec. but
if in the contrary I use:
data[,sum(get("col2")),by=(key)]
using a loop to put the column names, the same operation takes 20 sec.
I cannot use the former code because I have 100000 files to process
but the later will simply take months to complete. Is there any
alternative to the function "get" or any other way in which data.table
con recognize the names of the columns?.
Thanks,
Camilo
Camilo Mora, Ph.D.
Department of Geography, University of Hawaii
Currently available in Colombia
Phone: Country code: 57
Provider code: 313
Phone 776 2282
From the USA or Canada you have to dial 011 57 313 776 2282
http://www.soc.hawaii.edu/mora/
Hi, On Wed, Mar 13, 2013 at 7:25 PM, Camilo Mora <cmora at dal.ca> wrote:> Hi everyone, > > I have a data.table called "data" with many columns which I want to group by > column1 using data.table, given how fast it is. > > The problem with looping a data.table is that data.table does not like > quotations to define the column names (e.g. "col2" instead of col2). I > found a way around which is to use get("col2"), which works fine but the > processing time multiples by 20. > > So if I use: > > data[,sum(col2),by=(key)] > > entering the column names by hand, the operation is done in 1 sec. but if in > the contrary I use: > > data[,sum(get("col2")),by=(key)] > > using a loop to put the column names, the same operation takes 20 sec. I > cannot use the former code because I have 100000 files to process but the > later will simply take months to complete. Is there any alternative to the > function "get" or any other way in which data.table con recognize the names > of the columns?.I'm still not sure what you're trying to do. Could you maybe create an example that's a bit closer to you real data and the stuff you want to do on it? Are all the columns of the same type? Are you just summing columns? If you post code into an email that reconstructions a small version of your data.table (maybe 5-10 columns and one or two groups) it'd be more clear for me. Thanks, -steve -- Steve Lianoglou Defender of The Thesis | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
I would like to clarify my previous email about using data.table.
imagine the following data.frame called "data":
a b c d e
1 12 15 65 6
1 65 85 36 5
2 69 84 35 8
2 45 78 65 8
I want to aggregate the rows of columns b:d by the rows of column a.
the aggregation is sum(col[b:d]/sum(col[e]).
For this I am using a data.table with a loop of the form:
##########################################
ColNames<-colnames(data) #gets the names of the columns
x=ncol(data)-1 #number of columns to process minus the last column.
data<-data.table(data) #converts to data.table
for (z in 2:x) #I start the loop in the second column and finish in column d
{
outputdata<-data[, sum(get(ColNames[z]))/sum(e), by="a"]
}
############################################
this works fine but the function "get" slowdown the aggregation of the
rows by about 20 times. I wonder if there is an alternative fucntion
to "get" or an alternative way to aggregate all columns at once. I am
reading into the function .SD but have not yet figure out how to put
more than one operation in the function.
right now I have:
###############
outputdata=data[, lapply(.SD, sum), by="a", .SDcols=2:x]
##############
this later code aggregates all columns at once but only by summing.
eventually I need to divide the sum of each column by the sum of
column e as well.
ANy help will be greatly appreciate.
Thanks,
Camilo
Camilo Mora, Ph.D.
Department of Geography, University of Hawaii
Currently available in Colombia
Phone: Country code: 57
Provider code: 313
Phone 776 2282
From the USA or Canada you have to dial 011 57 313 776 2282
http://www.soc.hawaii.edu/mora/
Hi,
May be this helps:
dat1<- read.table(text="
a??? b????? c????? d??? e
1??? 12??? 15??? 65??? 6
1??? 65??? 85??? 36??? 5
2??? 69??? 84??? 35??? 8
2??? 45??? 78??? 65??? 8
",sep="",header=TRUE)
library(data.table)
?dat2<- data.table(dat1)
?dat2[,head(sapply(.SD,sum)/sapply(.SD,sum)[4],-1),by="a"]
#?? a??????? V1
#1: 1? 7.000000
#2: 1? 9.090909
#3: 1? 9.181818
#4: 2? 7.125000
#5: 2 10.125000
#6: 2? 6.250000
outputdat<-list()
?ColNames<-colnames(dat2)
?x<- ncol(dat2)-1
?ColNames<-colnames(dat2)
?x<- ncol(dat2)-1
?for(z in 2:x)
?{
?outputdat[[z]]<-dat2[,sum(get(ColNames[z]))/sum(e),by="a"]
?}
do.call(rbind,outputdat)
#?? a??????? V1
#1: 1? 7.000000
#2: 2? 7.125000
#3: 1? 9.090909
#4: 2 10.125000
#5: 1? 9.181818
#6: 2? 6.250000
A.K.
----- Original Message -----
From: Camilo Mora <cmora at DAL.CA>
To: r-help at r-project.org
Cc:
Sent: Wednesday, March 13, 2013 11:27 PM
Subject: [R] loop in a data.table
I would like to clarify my previous email about using data.table.
imagine the following data.frame called "data":
a? ? b? ? ? c? ? ? d? ? e
1? ? 12? ? 15? ? 65? ? 6
1? ? 65? ? 85? ? 36? ? 5
2? ? 69? ? 84? ? 35? ? 8
2? ? 45? ? 78? ? 65? ? 8
I want to aggregate the rows of columns b:d by the rows of column a. the
aggregation is sum(col[b:d]/sum(col[e]).
For this I am using a data.table with a loop of the form:
##########################################
ColNames<-colnames(data)? #gets the names of the columns
x=ncol(data)-1? ? #number of columns to process minus the last column.
data<-data.table(data)? ? #converts to data.table
for (z in 2:x)? #I start the loop in the second column and finish in column d
{
outputdata<-data[, sum(get(ColNames[z]))/sum(e), by="a"]
}
############################################
this works fine but the function "get" slowdown the aggregation of the
rows by about 20 times. I wonder if there is an alternative fucntion to
"get" or an alternative way to aggregate all columns at once. I am
reading into the function .SD but have not yet figure out how to put more than
one operation in the function.
right now I have:
###############
outputdata=data[, lapply(.SD, sum), by="a", .SDcols=2:x]
##############
this later code aggregates all columns at once but only by summing. eventually I
need to divide the sum of each column by the sum of column e as well.
ANy help will be greatly appreciate.
Thanks,
Camilo
Camilo Mora, Ph.D.
Department of Geography, University of Hawaii
Currently available in Colombia
Phone:? Country code: 57
? ? ? ? Provider code: 313
? ? ? ? Phone 776 2282
? ? ? ? From the USA or Canada you have to dial 011 57 313 776 2282
http://www.soc.hawaii.edu/mora/
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Apparently Analagous Threads
- conditional Dataframe filling
- First value in a row
- replace zeros for NA in a column based on values of another column
- why variations in accuracy between R to ARCGIS for the same point reprojection?
- Error with custom function in apply: Error in FUN(newX[, i], ...) : unused argument(s) (newX[, i])