Friends
I can't quite find a direct answer to this question from the lists, so here
goes:
I have several dataframes, 200+ columns 2000+ rows. I wish to script some
operations to perform on some of the variables (columns) in the data frames not
knowing what the column number is, hence have to refer by name. I have variable
names in a text file "varlist". So, something like this:
for (i in 1:length(varlist)){
j<-varlist[i]
v<-mean(Dataset[[j]])
print(v)
}
Now, if I force it
j<-"Var1"
v<-mean(Dataset[[j]])
print(v)
then it works, but not if i read the varlist as above.
Looking at "j" I get:
> print(j)
V1
1 Var1
Hence there is a lot of other stuff read into "j" that confuses
"mean". I can't figure out how to just get the value of the
variable and nothing else. I've tried space separated, comma separated, tab
separated lists and all give the same error. I've tried get(), parse()... no
go.
Any suggestions?
Thanks a lot
Jon
Soli Deo Gloria
Jon Erik Ween, MD, MS
Scientist, Kunin-Lunenfeld Applied Research Unit
Director, Stroke Clinic, Brain Health Clinic, Baycrest Centre
Assistant Professor, Dept. of Medicine, Div. of Neurology
University of Toronto Faculty of Medicine
Kimel Family Building, 6th Floor, Room 644
Baycrest Centre
3560 Bathurst Street
Toronto, Ontario M6A 2E1
Canada
Phone: 416-785-2500 x3648
Fax: 416-785-2484
Email: jween@klaru-baycrest.on.ca
Confidential: This communication and any attachment(s) may contain confidential
or privileged information and is intended solely for the address(es) or the
entity representing the recipient(s). If you have received this information in
error, you are hereby advised to destroy the document and any attachment(s),
make no copies of same and inform the sender immediately of the error. Any
unauthorized use or disclosure of this information is strictly prohibited.
[[alternative HTML version deleted]]
On Feb 24, 2010, at 8:18 PM, Jon Erik Ween wrote:> Friends > > I can't quite find a direct answer to this question from the lists, > so here goes: > > I have several dataframes, 200+ columns 2000+ rows. I wish to script > some operations to perform on some of the variables (columns) in the > data frames not knowing what the column number is, hence have to > refer by name. I have variable names in a text file "varlist". So, > something like this: > > for (i in 1:length(varlist)){ > j<-varlist[i] > v<-mean(Dataset[[j]]) > print(v) > }Without a data example this is untested guesswork and may fail if varlist is not a vector or something that can be coerced to a vector. Might help to look at str(varlist), ... anyway ... for (i in varlist){ v<-mean(Dataset[ , i]) print(v) } If it doesn't work, then pay close attention to the error messages. -- David> > Now, if I force it > > j<-"Var1" > v<-mean(Dataset[[j]]) > print(v) > > then it works, but not if i read the varlist as above. > > Looking at "j" I get: > >> print(j) > V1 > 1 Var1Right. You are supplying a list as an index.> > Hence there is a lot of other stuff read into "j" that confuses > "mean".I think it is confusing "[[", but it's really more guesswork because you did not supply a reproducible example.> I can't figure out how to just get the value of the variable and > nothing else. I've tried space separated, comma separated, tab > separated lists and all give the same error. I've tried get(), > parse()... no go. > > Any suggestions? > > Thanks a lot > > Jon > > Soli Deo Gloria > > Jon Erik Ween, MD, MS > Scientist, Kunin-Lunenfeld Applied Research Unit > Director, Stroke Clinic, Brain Health Clinic, Baycrest Centre > Assistant Professor, Dept. of Medicine, Div. of Neurology > University of Toronto Faculty of Medicine > > Kimel Family Building, 6th Floor, Room 644 > Baycrest Centre > 3560 Bathurst Street > Toronto, Ontario M6A 2E1 > Canada > > Phone: 416-785-2500 x3648 > Fax: 416-785-2484 > Email: jween at klaru-baycrest.on.ca > > > Confidential: This communication and any attachment(s) may contain > confidential or privileged information and is intended solely for > the address(es) or the entity representing the recipient(s). If you > have received this information in error, you are hereby advised to > destroy the document and any attachment(s), make no copies of same > and inform the sender immediately of the error. Any unauthorized use > or disclosure of this information is strictly prohibited. > > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD Heritage Laboratories West Hartford, CT
Friends
First, thanks to all for great feed-back. Open-source rocks! I have a workable
solution to my question, attached below in case it might be of any use to
anyone. I'm sure there are more elegant ways of doing this, so any further
feedback is welcome!
Things I've learned (for other noobs like me to learn from):
1) dataset[[j]] seems equivalent to dataset$var if j<-var, though quotes can
mess you up, hence j<-noquote(varlist[i]) in the script (it also makes a
difference that variables in varlist be stored as a space-separated string. tab-
or line-break-separated lists don't seem to work, though a different method
might handle that)
2) Loops will abort if they encounter an error (like ROCR encountering a
prediction that is singular). Error handling can be built in, but is a little
tricky. I reduplicated the method with a function to test and advance the loop
on failure. You can suppress error messages if you like)
3) Some stats methods don't have NA handling built into them (eg:
"prediction" in ROCR chokes if there are empty cells in the variables)
hence it seems a good idea to strip these out before starting. The subsetting
with na.omit does this
4) You reference pieces (slots) of results (S3/S4 objects) by using object@slot.
Hence, you pull out the the auc value in ROCR-"performance" by
perf@y.value in the script. you can see what slots are in an object by simply
listing the object contents at the command line >object.
Thanks again for all the help!
Jon
Soli Deo Gloria
Jon Erik Ween, MD, MS
Scientist, Kunin-Lunenfeld Applied Research Unit
Director, Stroke Clinic, Brain Health Clinic, Baycrest Centre
Assistant Professor, Dept. of Medicine, Div. of Neurology
University of Toronto Faculty of Medicine
...code
################################################################################
## R script for automating stats crunching in large datasets ##
## Needs space separated list of variable names matching dataset column names ##
## You have to tinker with the code to customize for your application ##
## ##
## Jon Erik Ween MD, MSc, 26 Feb 2010 ##
################################################################################
library(ROCR) # Load stats package to use if not standard
varslist<-scan("/Users/jween/Desktop/INCASvars.txt","list")
# Read variable list
results<-as.data.frame(array(,c(3,length(varslist)))) # Initialize results
array, one type of stat at a time for now
for (i in 1:length(varslist)){ # Loop throught the variables you want to
process. Determined by varslist
j<-noquote(varslist[i])
vars<-c(varslist[i],"Issue_class") # Variables to be analyzed
temp<-na.omit(incas[vars]) # Have to subset to get rid of NA values causing
ROCR to choke
n<-nrow(temp) # Record how many cases the analysis ios based on. Need to
figure out how to calc cases/controls
#.table<-table(temp$SubjClass) # Maybe for later figure out cases/controls
results[1,i]<-j # Name particular results column
results[2,i]<-n # Number of subjects in analysis
test<-try(aucval(i,j),silent=TRUE) # Error handling in case procedure craps
oust so loop can continue. Supress annoying error messages
if(class(test)=="try-error") next else # Run procedure only if OK,
otherwise skip
pred<-prediction(incas[[j]],incas$Issue_class); # Procedure
perf<-performance(pred,"auc");
results[3,i]<-as.numeric(perf@y.values) # Enter result into appropriate row
}
write.table(results,"/Users/jween/Desktop/IncasRres_
Issue_class.csv",sep=",",col.names=FALSE,row.names=FALSE) # Write
results to table
rm(aucval,i,n,temp,vars,results,test,pred,perf,j,varslist) # Clean up
aucval<-function(i,j){ # Function to trap errors. Should be the same as real
procedure above
pred<-prediction(incas[[j]],incas$Issue_class) # Don't put any real
results here, they don't seem to be passed back
perf<-performance(pred,"auc")
}
...end
On 2010-02-24, at 9:19 PM, Dennis Murphy wrote:
> Hi:
>
> The plyr package may come in handy here, as it allows you to create
functions
> based on the variables (and their names) in the data frame. Here's a
simple,
> cooked-up example that shows a couple of ways to handle this class of
problem:
>
> (1) Create three simple data frames with the same set of variables,
coincidentally
> in the same order, although that shouldn't really matter since
we're referencing
> by name rather than position:
>
> library(plyr)
> a <- data.frame(x = sample(1:50, 10, replace = TRUE),
> y = rpois(10, 30),
> z = rnorm(10, 15, 5))
> b <- data.frame(x = sample(1:50, 10, replace = TRUE),
> y = rpois(10, 30),
> z = rnorm(10, 15, 5))
> d <- data.frame(x = sample(1:50, 10, replace = TRUE),
> y = rpois(10, 30),
> z = rnorm(10, 15, 5))
>
> (2) rbind the three data frames and assign an indicator to differentiate
the
> individual data frames:
>
> dd <- rbind(a, b, d)
> dd$df <- rep(letters[c(1, 2, 4)], each = 10)
>
> (3) Use the ddply() function: .(df) refers to the grouping variable,
summarise
> indicates that we want to compute a groupwise summary, and the
> remaining code defines the desired summaries (by variable name).
>
> ddply(dd, .(df), summarise, avgx = mean(x), avgz = mean(z))
> df avgx avgz
> 1 a 28.3 17.27372
> 2 b 28.0 14.32962
> 3 d 20.3 13.26147
>
> (4) If we create a list of data frames instead, we can accomplish the same
> task by using ldply() [list to data frame as the first two characters]
instead.
> Since we have a list as input, there's no need for a group
indicator as the
> list components comprise the 'groups'.
>
> > l <- list(a, b, d)
> > ldply(l, summarise, avgx = mean(x), avgz = mean(z))
> avgx avgz
> 1 28.3 17.27372
> 2 28.0 14.32962
> 3 20.3 13.26147
>
> These represent two ways that you can produce summaries by variable name
> for multiple data frames. The rbind construct works if all of the data
frames have
> the same variables in the same order; if not, the list approach in (4) is
better.
> To see this,
>
> e <- data.frame(y = rpois(10, 30), z = rnorm(10, 15, 5),
> x = sample(1:50, 10, replace =TRUE))
> l <- list(a, b, d, e)
> ldply(l, summarise, avgx = mean(x), avgz = mean(z))
> avgx avgz
> 1 28.3 17.27372
> 2 28.0 14.32962
> 3 20.3 13.26147
> 4 29.9 13.64617
>
> plyr is not the only package you could use for this. The doBy package with
> function summaryBy() would also work, and you could also use the
aggregate()
> function. The advantage of plyr and doBy is that the code is a bit tighter
and
> easier to understand.
>
>
>
> On Wed, Feb 24, 2010 at 5:18 PM, Jon Erik Ween
<jween@klaru-baycrest.on.ca> wrote:
> Friends
>
> I can't quite find a direct answer to this question from the lists, so
here goes:
>
> I have several dataframes, 200+ columns 2000+ rows. I wish to script some
operations to perform on some of the variables (columns) in the data frames not
knowing what the column number is, hence have to refer by name. I have variable
names in a text file "varlist". So, something like this:
>
> for (i in 1:length(varlist)){
> j<-varlist[i]
> v<-mean(Dataset[[j]])
> print(v)
> }
>
> When you think of writing code like this, you should think "apply
family". R performs
> vectorized operations, and you'll become more efficient when you start
thinking about
> how to vectorize rather than how to loop...
>
>
> Now, if I force it
>
> j<-"Var1"
> v<-mean(Dataset[[j]])
> print(v)
>
> then it works, but not if i read the varlist as above.
>
> Looking at "j" I get:
>
> > print(j)
> V1
> 1 Var1
>
> Hence there is a lot of other stuff read into "j" that confuses
"mean". I can't figure out how to just get the value of the
variable and nothing else. I've tried space separated, comma separated, tab
separated lists and all give the same error. I've tried get(), parse()... no
go.
>
> Any suggestions?
>
> Thanks a lot
>
> Jon
>
> Soli Deo Gloria
>
> Jon Erik Ween, MD, MS
> Scientist, Kunin-Lunenfeld Applied Research Unit
> Director, Stroke Clinic, Brain Health Clinic, Baycrest Centre
> Assistant Professor, Dept. of Medicine, Div. of Neurology
> University of Toronto Faculty of Medicine
>
> Kimel Family Building, 6th Floor, Room 644
> Baycrest Centre
> 3560 Bathurst Street
> Toronto, Ontario M6A 2E1
> Canada
>
> Phone: 416-785-2500 x3648
> Fax: 416-785-2484
> Email: jween@klaru-baycrest.on.ca
>
>
> Confidential: This communication and any attachment(s) may contain
confidential or privileged information and is intended solely for the
address(es) or the entity representing the recipient(s). If you have received
this information in error, you are hereby advised to destroy the document and
any attachment(s), make no copies of same and inform the sender immediately of
the error. Any unauthorized use or disclosure of this information is strictly
prohibited.
>
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
Friends
Seems I've run into another snag. More of the nitty-gritty r-details I
don't understand.
So, as I mentioned below, dataset[[var_sub]] seems to be understood well by the
functions I previously used and I was able to run my loop successfully with the
[[var_sub]] as a variable-substitution method. However, now I want to do the
same with TukeyHSD, and this function does not play nice with this kind of
syntax. So if I do
fac<-as.factor(dataset$factor)
res<-aov(dataset$var~dataset$factor)
tuk<-TukeyHSD(res,"fac")
things work fine. But if I try (similar to the script below which worked for
ROCR functions):
fac<-as.factor(dataset$factor)
var_sub<-noquotes("var")
res<-aov(dataset[[var_sub]]~dataset$factor)
tuk<-TukeyHSD(res,"fac")
TukeyHSD craps out with an error, even though "res" is identical in
both cases, apart from the formula syntax.
So, TukeyHSD seems to be picky about syntax. Is there any other way I can do
variable substitution (so I can read variable names from my list) and get this
loop to work for TukeyHSD?
Thanks
Jon
Friends
First, thanks to all for great feed-back. Open-source rocks! I have a workable
solution to my question, attached below in case it might be of any use to
anyone. I'm sure there are more elegant ways of doing this, so any further
feedback is welcome!
Things I've learned (for other noobs like me to learn from):
1) dataset[[j]] seems equivalent to dataset$var if j<-var, though quotes can
mess
you up, hence j<-noquote(varlist[i]) in the script (it also makes a
difference
that variables in varlist be stored as a space-separated string. tab- or
line-break-separated lists don't seem to work, though a different method
might handle
that)
dataset[["var"]] is "equivalent" to dataset$var given var
does not contain any special characters. Otherwise j == "var" has to
be TRUE.
2) Loops will abort if they encounter an error (like ROCR encountering a
prediction that is singular). Error handling can be built in, but is a little
tricky. I reduplicated the method with a function to test and advance the loop
on failure. You can suppress error messages if you like)
Not tricky, just use try().
3) Some stats methods don't have NA handling built into them (eg:
"prediction"
in ROCR chokes if there are empty cells in the variables) hence it seems a good
idea to
strip these out before starting. The subsetting with na.omit does this
... given you know what you are doing (and omitting).
4) You reference pieces (slots) of results (S3/S4 objects) by using obj...@slot.
The @ operator is defined for slots of *S4* classes.
Best,
Uwe Ligges
> Hence, you pull out the the auc value in ROCR-"performance" by
p...@y.value in the script. you can see what slots are in an object by simply
listing the object contents at the command line>object.
Thanks again for all the help!
Jon
Soli Deo Gloria
Jon Erik Ween, MD, MS
Scientist, Kunin-Lunenfeld Applied Research Unit
Director, Stroke Clinic, Brain Health Clinic, Baycrest Centre
Assistant Professor, Dept. of Medicine, Div. of Neurology
University of Toronto Faculty of Medicine
...code
################################################################################
## R script for automating stats crunching in large datasets ##
## Needs space separated list of variable names matching dataset column names ##
## You have to tinker with the code to customize for your application ##
##
##
## Jon Erik Ween MD, MSc, 26 Feb 2010 ##
################################################################################
library(ROCR) # Load stats package to use if not standard
varslist<-scan("/Users/jween/Desktop/INCASvars.txt","list")
# Read variable list
results<-as.data.frame(array(,c(3,length(varslist)))) # Initialize results
array, one type of stat at a time for now
for (i in 1:length(varslist)){ # Loop throught the variables you want to
process. Determined by varslist
j<-noquote(varslist[i])
vars<-c(varslist[i],"Issue_class") # Variables to be
analyzed
temp<-na.omit(incas[vars]) # Have to subset to get rid of NA values
causing ROCR to choke
n<-nrow(temp) # Record how many cases the analysis ios based on. Need
to figure out how to calc cases/controls
#.table<-table(temp$SubjClass) # Maybe for later figure out
cases/controls
results[1,i]<-j # Name particular results column
results[2,i]<-n # Number of subjects in analysis
test<-try(aucval(i,j),silent=TRUE) # Error handling in case procedure
craps oust so loop can continue. Supress annoying error messages
if(class(test)=="try-error") next else # Run procedure only if
OK,
otherwise skip
pred<-prediction(incas[[j]],incas$Issue_class); # Procedure
perf<-performance(pred,"auc");
results[3,i]<-as.numeric(p...@y.values) # Enter result into
appropriate
row
}
write.table(results,"/Users/jween/Desktop/IncasRres_
Issue_class.csv",sep=",",col.names=FALSE,row.names=FALSE) # Write
results to table
rm(aucval,i,n,temp,vars,results,test,pred,perf,j,varslist) # Clean up
aucval<-function(i,j){ # Function to trap errors. Should be the same as real
procedure above
pred<-prediction(incas[[j]],incas$Issue_class) # Don't put any
real
results here, they don't seem to be passed back
perf<-performance(pred,"auc")
}
[[alternative HTML version deleted]]