mtb954@gmail.com
2006-Feb-21 00:41 UTC
[R] How to sum values across multiple variables using a wildcard?
I have a dataframe called "data" with 5 records (in rows) each of which has been scored on each of many variables (in columns). Five of the variables are named var1, var2, var3, var4, var5 using headers. The other variables are named using other conventions. I can create a new variable called var6 with the value 15 for each record with this code:> var6=var1+var2+var3+var4+var5but this is tedious for my real dataset with dozens of variables. I would rather use a wildcard to add up all the variables that begin with "Var" like this pseudocode:> Var6=sum(var*)Any suggestions for implementing this in R? Thanks! Mark
Gabor Grothendieck
2006-Feb-21 01:01 UTC
[R] How to sum values across multiple variables using a wildcard?
See: ?rowSums On 2/20/06, mtb954 at gmail.com <mtb954 at gmail.com> wrote:> I have a dataframe called "data" with 5 records (in rows) each of > which has been scored on each of many variables (in columns). > > Five of the variables are named var1, var2, var3, var4, var5 using > headers. The other variables are named using other conventions. > > I can create a new variable called var6 with the value 15 for each > record with this code: > > > var6=var1+var2+var3+var4+var5 > > but this is tedious for my real dataset with dozens of variables. I > would rather use a wildcard to add up all the variables that begin > with "Var" like this pseudocode: > > > Var6=sum(var*) > > Any suggestions for implementing this in R? Thanks! Mark > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
Simon Blomberg
2006-Feb-21 01:09 UTC
[R] How to sum values across multiple variables using a wildcard?
data <- data.frame(var1=c(1,2,3), var2=c(3,4,5), var3=c(4,5,6), foo = c(100,200,300)) # sum rows with "var" in their name rowSums(data[, grep("var", names(data))]) 1 2 3 8 11 14 mtb954 at gmail.com wrote:> I have a dataframe called "data" with 5 records (in rows) each of > which has been scored on each of many variables (in columns). > > Five of the variables are named var1, var2, var3, var4, var5 using > headers. The other variables are named using other conventions. > > I can create a new variable called var6 with the value 15 for each > record with this code: > > >> var6=var1+var2+var3+var4+var5 >> > > but this is tedious for my real dataset with dozens of variables. I > would rather use a wildcard to add up all the variables that begin > with "Var" like this pseudocode: > > >> Var6=sum(var*) >> > > Any suggestions for implementing this in R? Thanks! Mark > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > >-- Simon Blomberg, B.Sc.(Hons.), Ph.D, M.App.Stat. Centre for Resource and Environmental Studies The Australian National University Canberra ACT 0200 Australia T: +61 2 6125 7800 email: Simon.Blomberg_at_anu.edu.au F: +61 2 6125 0757 CRICOS Provider # 00120C
Marc Schwartz
2006-Feb-21 01:18 UTC
[R] How to sum values across multiple variables using a wildcard?
On Mon, 2006-02-20 at 18:41 -0600, mtb954 at gmail.com wrote:> I have a dataframe called "data" with 5 records (in rows) each of > which has been scored on each of many variables (in columns). > > Five of the variables are named var1, var2, var3, var4, var5 using > headers. The other variables are named using other conventions. > > I can create a new variable called var6 with the value 15 for each > record with this code: > > > var6=var1+var2+var3+var4+var5 > > but this is tedious for my real dataset with dozens of variables. I > would rather use a wildcard to add up all the variables that begin > with "Var" like this pseudocode: > > > Var6=sum(var*) > > Any suggestions for implementing this in R? Thanks! MarkHere is one approach using grep(). Given a data frame called MyDF with the following structure:> str(MyDF)`data.frame': 10 obs. of 20 variables: $ other4 : num -0.869 0.376 -2.022 0.619 -0.129 ... $ var8 : num -0.380 1.428 -1.075 -0.796 -0.588 ... $ var4 : num -0.0850 -0.7335 -0.5019 -1.1633 -0.0197 ... $ other9 : num 0.0210 -0.6455 0.0289 1.2405 -1.3359 ... $ var10 : num 0.647 -0.798 0.180 1.135 -0.258 ... $ other2 : num 0.1332 -0.2227 0.0423 0.6881 2.0304 ... $ other10: num 0.811 2.166 0.569 0.302 0.669 ... $ var1 : num -0.774 -1.812 -1.230 -0.969 0.245 ... $ var2 : num -0.0538 0.3712 0.8222 -0.8025 -0.6914 ... $ other6 : num 0.871 0.291 2.079 1.098 1.025 ... $ other1 : num -0.5130 0.1358 0.8744 0.0997 1.7458 ... $ var9 : num 0.664 -0.456 0.415 2.090 -0.283 ... $ other3 : num -0.425 -0.283 0.706 -1.879 -0.828 ... $ other7 : num 0.100 0.177 0.570 -0.631 -1.009 ... $ var3 : num 1.446 -0.862 0.184 1.077 0.146 ... $ var5 : num 0.402 -0.498 -0.906 0.641 1.690 ... $ var6 : num 0.892 -0.242 0.561 0.530 -0.291 ... $ other5 : num -1.210 0.815 -1.284 -0.152 0.329 ... $ other8 : num -0.265 -1.278 1.152 0.232 -1.189 ... $ var7 : num -0.616 -0.994 -0.263 1.626 -1.372 ... Note that the column names are either var* or other*. Using grep() we get the indices of the column names that contain "other" plus one or more following characters, where the "other" begins the word:> grep("\\bother.", names(MyDF))[1] 1 4 6 7 10 11 13 14 18 19 See ?regexp for more information. Note that I use "\\b" to being the search at the starting word boundary and then "." to require that there be following characters. Thus, this would not match " other1" or " other". You can then use the following to subset the data frame MyDF and sum the rows for the requested columns:> rowSums(MyDF[, grep("\\bother.", names(MyDF))])1 2 3 4 5 6 7 -1.344893 1.531417 2.715234 1.616971 1.307379 4.655568 4.638446 8 9 10 -2.640485 -2.226270 -2.158248 You could use grep("other", names(MyDF)), but this would also get "other" if it appears anywhere in the name. For example:> grep("other", "Thisotherone")[1] 1 It just depends upon your naming schema and how strict you need to be in the search. HTH, Marc Schwartz
Hi, I have a dataframe, x of the following form: Date Symbol A B C 20041201 ABC 10 12 15 20041201 DEF 9 5 4 ... 20050101 ABC 5 3 1 20050101 GHM 12 4 2 .... here A, B,C are properties of a set symbols recorded for a given date. I wante to decile the symbols For each date and property and create another set of columns "bucketA","bucketB", "bucketC" containing the decile rank for each symbol. The following non-vectorized code does what I want, bucket <- function(data,nBuckets) { q <- quantile(data,seq(0,1,len=nBuckets+1),na.rm=T) q[1] <- q[1] - 0.1 # need to do this to ensure there are no extra NAs cut(data,q,include.lowest=T,labels=F) } calcDeciles <- function(x,colNames) { nBuckets <- 10 dates <- unique(x$Date) for ( date in dates) { iVec <- x$Date == date xx <- x[iVec,] for (colName in colNames) { data <- xx[,colName] bColName <- paste("bucket",colName,sep="") x[iVec,bColName] <- bucket(data,nBuckets) } } x } x <- calcDeciles(x,c("A","B","C")) I was wondering if it is possible to vectorize the above function to make it more efficient. I tried, rlist <- tapply(x$A,x$Date,bucket) but I am not sure how to assign the contents of "rlist" to their appropriate slots in the original dataframe. Thanks, Maneesh
It might help to give a simple reproducible example in the future. For example df <- cbind.data.frame( date=rep( 1:5, each=100 ), A=rpois(500, 100), B=rpois(500, 50), C=rpois(500, 30) ) might generate something like date A B C 1 1 93 51 32 2 1 95 51 30 3 1 102 59 28 4 1 105 52 32 5 1 105 53 26 6 1 99 59 37 ... . ... .. .. 495 5 100 57 19 496 5 96 47 44 497 5 111 56 35 498 5 105 49 23 499 5 105 61 30 500 5 92 53 32 Here is my proposed solution. Can you double check with your existing functions to see if they are correct. decile.fn <- function(x, nbreaks=10){ br <- quantile( x, seq(0, 1, len=nbreaks+1), na.rm=T ) br[1] <- -Inf return( cut(x, br, labels=F) ) } out <- apply( df[ ,c("A", "B", "C")], 2, function(v) unlist( tapply( v, df$date, decile.fn ) ) ) rownames(out) <- rownames(df) out <- cbind(df$date, out) Regards, Adai On Tue, 2006-02-21 at 21:44 -0500, maneesh deshpande wrote:> Hi, > > I have a dataframe, x of the following form: > > Date Symbol A B C > 20041201 ABC 10 12 15 > 20041201 DEF 9 5 4 > ... > 20050101 ABC 5 3 1 > 20050101 GHM 12 4 2 > .... > > here A, B,C are properties of a set symbols recorded for a given date. > I wante to decile the symbols For each date and property and > create another set of columns "bucketA","bucketB", "bucketC" containing the > decile rank > for each symbol. The following non-vectorized code does what I want, > > bucket <- function(data,nBuckets) { > q <- quantile(data,seq(0,1,len=nBuckets+1),na.rm=T) > q[1] <- q[1] - 0.1 # need to do this to ensure there are no extra NAs > cut(data,q,include.lowest=T,labels=F) > } > > calcDeciles <- function(x,colNames) { > nBuckets <- 10 > dates <- unique(x$Date) > for ( date in dates) { > iVec <- x$Date == date > xx <- x[iVec,] > for (colName in colNames) { > data <- xx[,colName] > bColName <- paste("bucket",colName,sep="") > x[iVec,bColName] <- bucket(data,nBuckets) > } > } > x > } > > x <- calcDeciles(x,c("A","B","C")) > > > I was wondering if it is possible to vectorize the above function to make it > more efficient. > I tried, > rlist <- tapply(x$A,x$Date,bucket) > but I am not sure how to assign the contents of "rlist" to their appropriate > slots in the original > dataframe. > > Thanks, > > Maneesh > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
Hi Adai, I think your solution only works if the rows of the data frame are ordered by "date" and the ordering function is the same used to order the levels of factor(df$date) ? It turns out (as I implied in my question) my data is indeed organized in this manner, so my current problem is solved. In the general case, I suppose, one could always order the data frame by date before proceeding ? Thanks, Maneesh>From: Adaikalavan Ramasamy <ramasamy at cancer.org.uk> >Reply-To: ramasamy at cancer.org.uk >To: maneesh deshpande <dmaneesh at hotmail.com> >CC: r-help at stat.math.ethz.ch >Subject: Re: [R] Ranking within factor subgroups >Date: Wed, 22 Feb 2006 03:44:45 +0000 > >It might help to give a simple reproducible example in the future. For >example > > df <- cbind.data.frame( date=rep( 1:5, each=100 ), A=rpois(500, 100), > B=rpois(500, 50), C=rpois(500, 30) ) > >might generate something like > > date A B C > 1 1 93 51 32 > 2 1 95 51 30 > 3 1 102 59 28 > 4 1 105 52 32 > 5 1 105 53 26 > 6 1 99 59 37 > ... . ... .. .. > 495 5 100 57 19 > 496 5 96 47 44 > 497 5 111 56 35 > 498 5 105 49 23 > 499 5 105 61 30 > 500 5 92 53 32 > >Here is my proposed solution. Can you double check with your existing >functions to see if they are correct. > > decile.fn <- function(x, nbreaks=10){ > br <- quantile( x, seq(0, 1, len=nbreaks+1), na.rm=T ) > br[1] <- -Inf > return( cut(x, br, labels=F) ) > } > > out <- apply( df[ ,c("A", "B", "C")], 2, > function(v) unlist( tapply( v, df$date, decile.fn ) ) ) > > rownames(out) <- rownames(df) > out <- cbind(df$date, out) > >Regards, Adai > > > >On Tue, 2006-02-21 at 21:44 -0500, maneesh deshpande wrote: > > Hi, > > > > I have a dataframe, x of the following form: > > > > Date Symbol A B C > > 20041201 ABC 10 12 15 > > 20041201 DEF 9 5 4 > > ... > > 20050101 ABC 5 3 1 > > 20050101 GHM 12 4 2 > > .... > > > > here A, B,C are properties of a set symbols recorded for a given date. > > I wante to decile the symbols For each date and property and > > create another set of columns "bucketA","bucketB", "bucketC" containing >the > > decile rank > > for each symbol. The following non-vectorized code does what I want, > > > > bucket <- function(data,nBuckets) { > > q <- quantile(data,seq(0,1,len=nBuckets+1),na.rm=T) > > q[1] <- q[1] - 0.1 # need to do this to ensure there are no extra >NAs > > cut(data,q,include.lowest=T,labels=F) > > } > > > > calcDeciles <- function(x,colNames) { > > nBuckets <- 10 > > dates <- unique(x$Date) > > for ( date in dates) { > > iVec <- x$Date == date > > xx <- x[iVec,] > > for (colName in colNames) { > > data <- xx[,colName] > > bColName <- paste("bucket",colName,sep="") > > x[iVec,bColName] <- bucket(data,nBuckets) > > } > > } > > x > > } > > > > x <- calcDeciles(x,c("A","B","C")) > > > > > > I was wondering if it is possible to vectorize the above function to >make it > > more efficient. > > I tried, > > rlist <- tapply(x$A,x$Date,bucket) > > but I am not sure how to assign the contents of "rlist" to their >appropriate > > slots in the original > > dataframe. > > > > Thanks, > > > > Maneesh > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! >http://www.R-project.org/posting-guide.html > > >