Hi everyone, I have a data.table called "data" with many columns which I want to group by column1 using data.table, given how fast it is. The problem with looping a data.table is that data.table does not like quotations to define the column names (e.g. "col2" instead of col2). I found a way around which is to use get("col2"), which works fine but the processing time multiples by 20. So if I use: data[,sum(col2),by=(key)] entering the column names by hand, the operation is done in 1 sec. but if in the contrary I use: data[,sum(get("col2")),by=(key)] using a loop to put the column names, the same operation takes 20 sec. I cannot use the former code because I have 100000 files to process but the later will simply take months to complete. Is there any alternative to the function "get" or any other way in which data.table con recognize the names of the columns?. Thanks, Camilo Camilo Mora, Ph.D. Department of Geography, University of Hawaii Currently available in Colombia Phone: Country code: 57 Provider code: 313 Phone 776 2282 From the USA or Canada you have to dial 011 57 313 776 2282 http://www.soc.hawaii.edu/mora/
Hi, On Wed, Mar 13, 2013 at 7:25 PM, Camilo Mora <cmora at dal.ca> wrote:> Hi everyone, > > I have a data.table called "data" with many columns which I want to group by > column1 using data.table, given how fast it is. > > The problem with looping a data.table is that data.table does not like > quotations to define the column names (e.g. "col2" instead of col2). I > found a way around which is to use get("col2"), which works fine but the > processing time multiples by 20. > > So if I use: > > data[,sum(col2),by=(key)] > > entering the column names by hand, the operation is done in 1 sec. but if in > the contrary I use: > > data[,sum(get("col2")),by=(key)] > > using a loop to put the column names, the same operation takes 20 sec. I > cannot use the former code because I have 100000 files to process but the > later will simply take months to complete. Is there any alternative to the > function "get" or any other way in which data.table con recognize the names > of the columns?.I'm still not sure what you're trying to do. Could you maybe create an example that's a bit closer to you real data and the stuff you want to do on it? Are all the columns of the same type? Are you just summing columns? If you post code into an email that reconstructions a small version of your data.table (maybe 5-10 columns and one or two groups) it'd be more clear for me. Thanks, -steve -- Steve Lianoglou Defender of The Thesis | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
I would like to clarify my previous email about using data.table. imagine the following data.frame called "data": a b c d e 1 12 15 65 6 1 65 85 36 5 2 69 84 35 8 2 45 78 65 8 I want to aggregate the rows of columns b:d by the rows of column a. the aggregation is sum(col[b:d]/sum(col[e]). For this I am using a data.table with a loop of the form: ########################################## ColNames<-colnames(data) #gets the names of the columns x=ncol(data)-1 #number of columns to process minus the last column. data<-data.table(data) #converts to data.table for (z in 2:x) #I start the loop in the second column and finish in column d { outputdata<-data[, sum(get(ColNames[z]))/sum(e), by="a"] } ############################################ this works fine but the function "get" slowdown the aggregation of the rows by about 20 times. I wonder if there is an alternative fucntion to "get" or an alternative way to aggregate all columns at once. I am reading into the function .SD but have not yet figure out how to put more than one operation in the function. right now I have: ############### outputdata=data[, lapply(.SD, sum), by="a", .SDcols=2:x] ############## this later code aggregates all columns at once but only by summing. eventually I need to divide the sum of each column by the sum of column e as well. ANy help will be greatly appreciate. Thanks, Camilo Camilo Mora, Ph.D. Department of Geography, University of Hawaii Currently available in Colombia Phone: Country code: 57 Provider code: 313 Phone 776 2282 From the USA or Canada you have to dial 011 57 313 776 2282 http://www.soc.hawaii.edu/mora/
Hi, May be this helps: dat1<- read.table(text=" a??? b????? c????? d??? e 1??? 12??? 15??? 65??? 6 1??? 65??? 85??? 36??? 5 2??? 69??? 84??? 35??? 8 2??? 45??? 78??? 65??? 8 ",sep="",header=TRUE) library(data.table) ?dat2<- data.table(dat1) ?dat2[,head(sapply(.SD,sum)/sapply(.SD,sum)[4],-1),by="a"] #?? a??????? V1 #1: 1? 7.000000 #2: 1? 9.090909 #3: 1? 9.181818 #4: 2? 7.125000 #5: 2 10.125000 #6: 2? 6.250000 outputdat<-list() ?ColNames<-colnames(dat2) ?x<- ncol(dat2)-1 ?ColNames<-colnames(dat2) ?x<- ncol(dat2)-1 ?for(z in 2:x) ?{ ?outputdat[[z]]<-dat2[,sum(get(ColNames[z]))/sum(e),by="a"] ?} do.call(rbind,outputdat) #?? a??????? V1 #1: 1? 7.000000 #2: 2? 7.125000 #3: 1? 9.090909 #4: 2 10.125000 #5: 1? 9.181818 #6: 2? 6.250000 A.K. ----- Original Message ----- From: Camilo Mora <cmora at DAL.CA> To: r-help at r-project.org Cc: Sent: Wednesday, March 13, 2013 11:27 PM Subject: [R] loop in a data.table I would like to clarify my previous email about using data.table. imagine the following data.frame called "data": a? ? b? ? ? c? ? ? d? ? e 1? ? 12? ? 15? ? 65? ? 6 1? ? 65? ? 85? ? 36? ? 5 2? ? 69? ? 84? ? 35? ? 8 2? ? 45? ? 78? ? 65? ? 8 I want to aggregate the rows of columns b:d by the rows of column a. the aggregation is sum(col[b:d]/sum(col[e]). For this I am using a data.table with a loop of the form: ########################################## ColNames<-colnames(data)? #gets the names of the columns x=ncol(data)-1? ? #number of columns to process minus the last column. data<-data.table(data)? ? #converts to data.table for (z in 2:x)? #I start the loop in the second column and finish in column d { outputdata<-data[, sum(get(ColNames[z]))/sum(e), by="a"] } ############################################ this works fine but the function "get" slowdown the aggregation of the rows by about 20 times. I wonder if there is an alternative fucntion to "get" or an alternative way to aggregate all columns at once. I am reading into the function .SD but have not yet figure out how to put more than one operation in the function. right now I have: ############### outputdata=data[, lapply(.SD, sum), by="a", .SDcols=2:x] ############## this later code aggregates all columns at once but only by summing. eventually I need to divide the sum of each column by the sum of column e as well. ANy help will be greatly appreciate. Thanks, Camilo Camilo Mora, Ph.D. Department of Geography, University of Hawaii Currently available in Colombia Phone:? Country code: 57 ? ? ? ? Provider code: 313 ? ? ? ? Phone 776 2282 ? ? ? ? From the USA or Canada you have to dial 011 57 313 776 2282 http://www.soc.hawaii.edu/mora/ ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Possibly Parallel Threads
- conditional Dataframe filling
- First value in a row
- replace zeros for NA in a column based on values of another column
- why variations in accuracy between R to ARCGIS for the same point reprojection?
- Error with custom function in apply: Error in FUN(newX[, i], ...) : unused argument(s) (newX[, i])