thr3ads.net - R help - [R] How to sum values across multiple variables using a wildcard? [Feb 2006]

If this information is useful, please help other people find it:
Share via:

mtb954@gmail.com

2006-Feb-21 00:41 UTC

[R] How to sum values across multiple variables using a wildcard?

I have a dataframe called "data" with 5 records (in rows) each of
which has been scored on each of many variables (in columns).

Five of the variables are named var1, var2, var3, var4, var5 using
headers. The other variables are named using other conventions.

I can create a new variable called var6 with the value 15 for each
record with this code:
> var6=var1+var2+var3+var4+var5
but this is tedious for my real dataset with dozens of variables. I
would rather use a wildcard to add up all the variables that begin
with "Var" like this pseudocode:
> Var6=sum(var*)
Any suggestions for implementing this in R? Thanks! Mark

Gabor Grothendieck

2006-Feb-21 01:01 UTC

head link

[R] How to sum values across multiple variables using a wildcard?

See:

?rowSums



On 2/20/06, mtb954 at gmail.com <mtb954 at gmail.com>
wrote:> I have a dataframe called "data" with 5 records (in rows) each of
> which has been scored on each of many variables (in columns).
>
> Five of the variables are named var1, var2, var3, var4, var5 using
> headers. The other variables are named using other conventions.
>
> I can create a new variable called var6 with the value 15 for each
> record with this code:
>
> > var6=var1+var2+var3+var4+var5
>
> but this is tedious for my real dataset with dozens of variables. I
> would rather use a wildcard to add up all the variables that begin
> with "Var" like this pseudocode:
>
> > Var6=sum(var*)
>
> Any suggestions for implementing this in R? Thanks! Mark
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>

Simon Blomberg

2006-Feb-21 01:09 UTC

head link

[R] How to sum values across multiple variables using a wildcard?

data <- data.frame(var1=c(1,2,3), var2=c(3,4,5), var3=c(4,5,6), foo = 
c(100,200,300))
# sum rows with "var" in their name
rowSums(data[, grep("var", names(data))])

 1  2  3
 8 11 14



mtb954 at gmail.com wrote:> I have a dataframe called "data" with 5 records (in rows) each of
> which has been scored on each of many variables (in columns).
>
> Five of the variables are named var1, var2, var3, var4, var5 using
> headers. The other variables are named using other conventions.
>
> I can create a new variable called var6 with the value 15 for each
> record with this code:
>
>   
>> var6=var1+var2+var3+var4+var5
>>     
>
> but this is tedious for my real dataset with dozens of variables. I
> would rather use a wildcard to add up all the variables that begin
> with "Var" like this pseudocode:
>
>   
>> Var6=sum(var*)
>>     
>
> Any suggestions for implementing this in R? Thanks! Mark
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>
>   

-- 
Simon Blomberg, B.Sc.(Hons.), Ph.D, M.App.Stat.
Centre for Resource and Environmental Studies
The Australian National University
Canberra ACT 0200
Australia
T: +61 2 6125 7800 email: Simon.Blomberg_at_anu.edu.au
F: +61 2 6125 0757
CRICOS Provider # 00120C

Marc Schwartz

2006-Feb-21 01:18 UTC

head link

[R] How to sum values across multiple variables using a wildcard?

On Mon, 2006-02-20 at 18:41 -0600, mtb954 at gmail.com
wrote:> I have a dataframe called "data" with 5 records (in rows) each of
> which has been scored on each of many variables (in columns).
> 
> Five of the variables are named var1, var2, var3, var4, var5 using
> headers. The other variables are named using other conventions.
> 
> I can create a new variable called var6 with the value 15 for each
> record with this code:
> 
> > var6=var1+var2+var3+var4+var5
> 
> but this is tedious for my real dataset with dozens of variables. I
> would rather use a wildcard to add up all the variables that begin
> with "Var" like this pseudocode:
> 
> > Var6=sum(var*)
> 
> Any suggestions for implementing this in R? Thanks! Mark
Here is one approach using grep().

Given a data frame called MyDF with the following structure:
> str(MyDF)`data.frame':	10 obs. of  20 variables:
 $ other4 : num  -0.869  0.376 -2.022  0.619 -0.129 ...
 $ var8   : num  -0.380  1.428 -1.075 -0.796 -0.588 ...
 $ var4   : num  -0.0850 -0.7335 -0.5019 -1.1633 -0.0197 ...
 $ other9 : num   0.0210 -0.6455  0.0289  1.2405 -1.3359 ...
 $ var10  : num   0.647 -0.798  0.180  1.135 -0.258 ...
 $ other2 : num   0.1332 -0.2227  0.0423  0.6881  2.0304 ...
 $ other10: num  0.811 2.166 0.569 0.302 0.669 ...
 $ var1   : num  -0.774 -1.812 -1.230 -0.969  0.245 ...
 $ var2   : num  -0.0538  0.3712  0.8222 -0.8025 -0.6914 ...
 $ other6 : num  0.871 0.291 2.079 1.098 1.025 ...
 $ other1 : num  -0.5130  0.1358  0.8744  0.0997  1.7458 ...
 $ var9   : num   0.664 -0.456  0.415  2.090 -0.283 ...
 $ other3 : num  -0.425 -0.283  0.706 -1.879 -0.828 ...
 $ other7 : num   0.100  0.177  0.570 -0.631 -1.009 ...
 $ var3   : num   1.446 -0.862  0.184  1.077  0.146 ...
 $ var5   : num   0.402 -0.498 -0.906  0.641  1.690 ...
 $ var6   : num   0.892 -0.242  0.561  0.530 -0.291 ...
 $ other5 : num  -1.210  0.815 -1.284 -0.152  0.329 ...
 $ other8 : num  -0.265 -1.278  1.152  0.232 -1.189 ...
 $ var7   : num  -0.616 -0.994 -0.263  1.626 -1.372 ...


Note that the column names are either var* or other*.

Using grep() we get the indices of the column names that contain
"other"
plus one or more following characters, where the "other" begins the
word:
> grep("\\bother.", names(MyDF)) [1]  1  4  6  7 10 11 13 14 18 19

See ?regexp for more information. Note that I use "\\b" to being the
search at the starting word boundary and then "." to require that
there
be following characters. Thus, this would not match " other1" or
" other".

You can then use the following to subset the data frame MyDF and sum the
rows for the requested columns:
> rowSums(MyDF[, grep("\\bother.", names(MyDF))])        1         2         3         4         5         6         7 
-1.344893  1.531417  2.715234  1.616971  1.307379  4.655568  4.638446 
        8         9        10 
-2.640485 -2.226270 -2.158248 


You could use grep("other", names(MyDF)), but this would also get
"other" if it appears anywhere in the name. For example:
> grep("other", "Thisotherone")[1] 1

It just depends upon your naming schema and how strict you need to be in
the search.

HTH,

Marc Schwartz

maneesh deshpande

2006-Feb-22 02:44 UTC

head link

[R] Ranking within factor subgroups

Hi,

I have a dataframe, x of the following form:

Date            Symbol   A    B  C
20041201     ABC      10  12 15
20041201     DEF       9    5   4
...
20050101     ABC         5  3   1
20050101     GHM       12 4    2
....

here A, B,C are properties of a set symbols recorded for a given date.
I wante to decile the symbols For each date and property and
create another set of columns "bucketA","bucketB",
"bucketC" containing the
decile rank
for each symbol. The following non-vectorized code does what I want,

bucket <- function(data,nBuckets) {
     q <- quantile(data,seq(0,1,len=nBuckets+1),na.rm=T)
     q[1] <- q[1] - 0.1 # need to do this to ensure there are no extra NAs
     cut(data,q,include.lowest=T,labels=F)
}

calcDeciles <- function(x,colNames) {
nBuckets <- 10
dates <- unique(x$Date)
for ( date in dates) {
  iVec <- x$Date == date
  xx <- x[iVec,]
  for (colName in colNames) {
     data <- xx[,colName]
     bColName <- paste("bucket",colName,sep="")
     x[iVec,bColName] <- bucket(data,nBuckets)
  }
}
x
}

x <- calcDeciles(x,c("A","B","C"))


I was wondering if it is possible to vectorize the above function to make it 
more efficient.
I tried,
rlist <- tapply(x$A,x$Date,bucket)
but I am not sure how to assign the contents of "rlist" to their
appropriate
slots in the original
dataframe.

Thanks,

Maneesh

Adaikalavan Ramasamy

2006-Feb-22 03:44 UTC

head link

[R] Ranking within factor subgroups

It might help to give a simple reproducible example in the future. For
example

 df <- cbind.data.frame( date=rep( 1:5, each=100 ), A=rpois(500, 100),
                         B=rpois(500, 50), C=rpois(500, 30) )

might generate something like

	    date   A  B  C
	  1    1  93 51 32
	  2    1  95 51 30
	  3    1 102 59 28
	  4    1 105 52 32
	  5    1 105 53 26
	  6    1  99 59 37
	...    . ... .. ..
	495    5 100 57 19
	496    5  96 47 44
	497    5 111 56 35
	498    5 105 49 23
	499    5 105 61 30
	500    5  92 53 32

Here is my proposed solution. Can you double check with your existing
functions to see if they are correct.

   decile.fn <- function(x, nbreaks=10){
     br     <- quantile( x, seq(0, 1, len=nbreaks+1), na.rm=T )
     br[1]  <- -Inf
     return( cut(x, br, labels=F) )
   }

   out <- apply( df[ ,c("A", "B", "C")], 2,
                 function(v) unlist( tapply( v, df$date, decile.fn ) ) )

   rownames(out) <- rownames(df)
   out <- cbind(df$date, out)

Regards, Adai



On Tue, 2006-02-21 at 21:44 -0500, maneesh deshpande
wrote:> Hi,
> 
> I have a dataframe, x of the following form:
> 
> Date            Symbol   A    B  C
> 20041201     ABC      10  12 15
> 20041201     DEF       9    5   4
> ...
> 20050101     ABC         5  3   1
> 20050101     GHM       12 4    2
> ....
> 
> here A, B,C are properties of a set symbols recorded for a given date.
> I wante to decile the symbols For each date and property and
> create another set of columns "bucketA","bucketB",
"bucketC" containing the
> decile rank
> for each symbol. The following non-vectorized code does what I want,
> 
> bucket <- function(data,nBuckets) {
>      q <- quantile(data,seq(0,1,len=nBuckets+1),na.rm=T)
>      q[1] <- q[1] - 0.1 # need to do this to ensure there are no extra
NAs
>      cut(data,q,include.lowest=T,labels=F)
> }
> 
> calcDeciles <- function(x,colNames) {
> nBuckets <- 10
> dates <- unique(x$Date)
> for ( date in dates) {
>   iVec <- x$Date == date
>   xx <- x[iVec,]
>   for (colName in colNames) {
>      data <- xx[,colName]
>      bColName <- paste("bucket",colName,sep="")
>      x[iVec,bColName] <- bucket(data,nBuckets)
>   }
> }
> x
> }
> 
> x <- calcDeciles(x,c("A","B","C"))
> 
> 
> I was wondering if it is possible to vectorize the above function to make
it
> more efficient.
> I tried,
> rlist <- tapply(x$A,x$Date,bucket)
> but I am not sure how to assign the contents of "rlist" to their
appropriate
> slots in the original
> dataframe.
> 
> Thanks,
> 
> Maneesh
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>

maneesh deshpande

2006-Feb-23 03:45 UTC

head link

[R] Ranking within factor subgroups

Hi Adai,

I think your solution only works if the rows of the data frame are ordered 
by "date" and
the ordering function is the same used to order the levels of 
factor(df$date) ?
It turns out (as I implied in my question) my data is indeed organized in 
this manner, so my
current problem is solved.
In the general case, I suppose, one could always order the data frame by 
date before proceeding ?

Thanks,

Maneesh

>From: Adaikalavan Ramasamy <ramasamy at cancer.org.uk>
>Reply-To: ramasamy at cancer.org.uk
>To: maneesh deshpande <dmaneesh at hotmail.com>
>CC: r-help at stat.math.ethz.ch
>Subject: Re: [R]  Ranking within factor subgroups
>Date: Wed, 22 Feb 2006 03:44:45 +0000
>
>It might help to give a simple reproducible example in the future. For
>example
>
>  df <- cbind.data.frame( date=rep( 1:5, each=100 ), A=rpois(500, 100),
>                          B=rpois(500, 50), C=rpois(500, 30) )
>
>might generate something like
>
>	    date   A  B  C
>	  1    1  93 51 32
>	  2    1  95 51 30
>	  3    1 102 59 28
>	  4    1 105 52 32
>	  5    1 105 53 26
>	  6    1  99 59 37
>	...    . ... .. ..
>	495    5 100 57 19
>	496    5  96 47 44
>	497    5 111 56 35
>	498    5 105 49 23
>	499    5 105 61 30
>	500    5  92 53 32
>
>Here is my proposed solution. Can you double check with your existing
>functions to see if they are correct.
>
>    decile.fn <- function(x, nbreaks=10){
>      br     <- quantile( x, seq(0, 1, len=nbreaks+1), na.rm=T )
>      br[1]  <- -Inf
>      return( cut(x, br, labels=F) )
>    }
>
>    out <- apply( df[ ,c("A", "B", "C")],
2,
>                  function(v) unlist( tapply( v, df$date, decile.fn ) ) )
>
>    rownames(out) <- rownames(df)
>    out <- cbind(df$date, out)
>
>Regards, Adai
>
>
>
>On Tue, 2006-02-21 at 21:44 -0500, maneesh deshpande wrote:
> > Hi,
> >
> > I have a dataframe, x of the following form:
> >
> > Date            Symbol   A    B  C
> > 20041201     ABC      10  12 15
> > 20041201     DEF       9    5   4
> > ...
> > 20050101     ABC         5  3   1
> > 20050101     GHM       12 4    2
> > ....
> >
> > here A, B,C are properties of a set symbols recorded for a given date.
> > I wante to decile the symbols For each date and property and
> > create another set of columns "bucketA","bucketB",
"bucketC" containing
>the
> > decile rank
> > for each symbol. The following non-vectorized code does what I want,
> >
> > bucket <- function(data,nBuckets) {
> >      q <- quantile(data,seq(0,1,len=nBuckets+1),na.rm=T)
> >      q[1] <- q[1] - 0.1 # need to do this to ensure there are no
extra
>NAs
> >      cut(data,q,include.lowest=T,labels=F)
> > }
> >
> > calcDeciles <- function(x,colNames) {
> > nBuckets <- 10
> > dates <- unique(x$Date)
> > for ( date in dates) {
> >   iVec <- x$Date == date
> >   xx <- x[iVec,]
> >   for (colName in colNames) {
> >      data <- xx[,colName]
> >      bColName <- paste("bucket",colName,sep="")
> >      x[iVec,bColName] <- bucket(data,nBuckets)
> >   }
> > }
> > x
> > }
> >
> > x <- calcDeciles(x,c("A","B","C"))
> >
> >
> > I was wondering if it is possible to vectorize the above function to 
>make it
> > more efficient.
> > I tried,
> > rlist <- tapply(x$A,x$Date,bucket)
> > but I am not sure how to assign the contents of "rlist" to
their
>appropriate
> > slots in the original
> > dataframe.
> >
> > Thanks,
> >
> > Maneesh
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide! 
>http://www.R-project.org/posting-guide.html
> >
>

Reasonably Related Threads

Search for more maybe matching threads

R help - Feb 2006 - How to sum values across multiple variables using a wildcard?

[R] How to sum values across multiple variables using a wildcard?

[R] How to sum values across multiple variables using a wildcard?

[R] How to sum values across multiple variables using a wildcard?

[R] How to sum values across multiple variables using a wildcard?

[R] Ranking within factor subgroups

[R] Ranking within factor subgroups

[R] Ranking within factor subgroups

Reasonably Related Threads