Greetings - While I've used R a fair bit for basic statistical machinations, I've not used it for data manipulation - I've used SAS for 20+ years (and SAS real shines in data handling). So, I've started the process of trying to figure out 'how to do in R what I can do in my sleep in SAS' - specifically wrt to data manipulating. So, these are decidely 'newbie' level questions. So, starting very simple. Created a tine example CSV file, which I call test.csv. Loc,cost A,1 C,3 D,2 F,3 H,4 K,3 M,8 Now, all I want to do is read it in, and derive a new variable which is a Z-transform of 'cost'. Here is what I've tried so far:> prices <- read.csv("c:/documents and settings/user/desktop/test.csv",header=TRUE,sep=",",na.strings="."); > ?print(prices$cost);So far, so good (being able to pull in the data is a good thing). Now, while I'm sure there are lots of ways to do what I want, I'm going to brute force it, by calculating column mean and column SD for 'cost', generate the Z-transformed value, and then add it to the dataframe. However, here is where I'm having problems. After about an hour of searching, I realized I need to use an 'apply' function to apply a function (say, mean) to a column in a dataframe. But, I can seem to get it to work successfully (and this is the gist of the question). If I try> result <- sapply(prices['cost'],MARGIN=2,FUN=mean,na.rm=TRUE); > print(result);Works perfectly. But, if I simply change FUN=mean to FUN=sd, not so successful: If I try> result <- sapply(prices['cost'],MARGIN=2,FUN=sd,na.rm=TRUE); > print(result);Throws the following error: Error in FUN(X[[1L]], ...) : unused argument(s) (MARGIN = 2) Further, If I try> result <- sapply(prices$cost,MARGIN=2,FUN=mean,na.rm=TRUE); > print(result);it prints 8 values corresponding to the value of each element of the data set - meaning, its treating prices$cost as a row vector.Which makes no sense to me. What do I have to do to use prices$cost as the first argument in the sapply call? If I can't, why not? is.vector(prices$cost) shows TRUE, so why can't I take the mean over the vector? At any rate, I'll start from here. Being able to apply functions to column(s) of a dataframe seems pretty fundamental, so I'd like to start by understanding the basics. Thanks in advance.
On May 20, 2010, at 5:42 PM, egc wrote:> Greetings - > > While I've used R a fair bit for basic statistical machinations, I've > not used it for data manipulation - I've used SAS for 20+ years (and > SAS real shines in data handling). So, I've started the process of > trying to figure out 'how to do in R what I can do in my sleep in SAS' > - specifically wrt to data manipulating. So, these are decidely > 'newbie' level questions. > > So, starting very simple. Created a tine example CSV file, which I > call test.csv. > > Loc,cost > A,1 > C,3 > D,2 > F,3 > H,4 > K,3 > M,8 > > Now, all I want to do is read it in, and derive a new variable which > is a Z-transform of 'cost'. Here is what I've tried so far: > >> prices <- read.csv("c:/documents and settings/user/desktop/ >> test.csv",header=TRUE,sep=",",na.strings="."); >> print(prices$cost); > > So far, so good (being able to pull in the data is a good thing). > > Now, while I'm sure there are lots of ways to do what I want, I'm > going to brute force it, by calculating column mean and column SD for > 'cost', generate the Z-transformed value, and then add it to the > dataframe. However, here is where I'm having problems. After about an > hour of searching, I realized I need to use an 'apply' function to > apply a function (say, mean) to a column in a dataframe. But, I can > seem to get it to work successfully (and this is the gist of the > question). > > If I try > >> result <- sapply(prices['cost'],MARGIN=2,FUN=mean,na.rm=TRUE); >> print(result);I suspect you are missing the easy way to do this; mean( prices['cost'] )> > > Works perfectly. > > But, if I simply change FUN=mean to FUN=sd, not so successful: > > If I try > >> result <- sapply(prices['cost'],MARGIN=2,FUN=sd,na.rm=TRUE); >> print(result); >Try: result <- sd(prices['cost']) R functions often expect to work on vectors without an explicit look or apply function.> Throws the following error: > > Error in FUN(X[[1L]], ...) : unused argument(s) (MARGIN = 2) > > Further, If I try > >> result <- sapply(prices$cost,MARGIN=2,FUN=mean,na.rm=TRUE); >> print(result); > > it prints 8 values corresponding to the value of each element of the > data set - meaning, its treating prices$cost as a row vector.Which > makes no sense to me. What do I have to do to use prices$cost as the > first argument in the sapply call?Not use sapply. "sapply" generally will be used to produce a vector or list as a result. If you only want a scalar, then it's not the right tool.> If I can't, why not? > is.vector(prices$cost) shows TRUE, so why can't I take the mean over > the vector? > > At any rate, I'll start from here. Being able to apply functions to > column(s) of a dataframe seems pretty fundamental, so I'd like to > start by understanding the basics. > > Thanks in advance. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD West Hartford, CT
On May 20, 2010, at 4:42 PM, egc wrote:> Greetings - > > While I've used R a fair bit for basic statistical machinations, I've > not used it for data manipulation - I've used SAS for 20+ years (and > SAS real shines in data handling). So, I've started the process of > trying to figure out 'how to do in R what I can do in my sleep in SAS' > - specifically wrt to data manipulating. So, these are decidely > 'newbie' level questions. > > So, starting very simple. Created a tine example CSV file, which I > call test.csv. > > Loc,cost > A,1 > C,3 > D,2 > F,3 > H,4 > K,3 > M,8 > > Now, all I want to do is read it in, and derive a new variable which > is a Z-transform of 'cost'. Here is what I've tried so far: > >> prices <- read.csv("c:/documents and settings/user/desktop/test.csv",header=TRUE,sep=",",na.strings="."); >> print(prices$cost); > > So far, so good (being able to pull in the data is a good thing). > > Now, while I'm sure there are lots of ways to do what I want, I'm > going to brute force it, by calculating column mean and column SD for > 'cost', generate the Z-transformed value, and then add it to the > dataframe. However, here is where I'm having problems. After about an > hour of searching, I realized I need to use an 'apply' function to > apply a function (say, mean) to a column in a dataframe. But, I can > seem to get it to work successfully (and this is the gist of the > question). > > If I try > >> result <- sapply(prices['cost'],MARGIN=2,FUN=mean,na.rm=TRUE); >> print(result); > > > Works perfectly. > > But, if I simply change FUN=mean to FUN=sd, not so successful: > > If I try > >> result <- sapply(prices['cost'],MARGIN=2,FUN=sd,na.rm=TRUE); >> print(result); > > Throws the following error: > > Error in FUN(X[[1L]], ...) : unused argument(s) (MARGIN = 2) > > Further, If I try > >> result <- sapply(prices$cost,MARGIN=2,FUN=mean,na.rm=TRUE); >> print(result); > > it prints 8 values corresponding to the value of each element of the > data set - meaning, its treating prices$cost as a row vector.Which > makes no sense to me. What do I have to do to use prices$cost as the > first argument in the sapply call? If I can't, why not? > is.vector(prices$cost) shows TRUE, so why can't I take the mean over > the vector? > > At any rate, I'll start from here. Being able to apply functions to > column(s) of a dataframe seems pretty fundamental, so I'd like to > start by understanding the basics. > > Thanks in advance.First, welcome to R. Second, you are using the argument 'MARGIN', which is actually used in the apply() function, not in sapply(). Hence the error messages and arguably, the unpredictable behavior. One of the key concepts with R, as opposed to SAS, is that in R, you take a 'holistic' view of objects, not an element-by-element view. So for many operations, R's functions are 'vectorized', which means that they can operate on an entire object (eg. a column in a data frame) with a single function call. So in this case:> mean(prices$cost)[1] 3.428571> sd(prices$cost)[1] 2.225395 gets you want you want. There is also more than one way of accessing the data. For example:> mean(prices[, "cost"])[1] 3.428571> mean(prices[["cost"]])[1] 3.428571 and> mean(prices["cost"])cost 3.428571 Note that in the last example, the result is 'named'. Each of these have to do with the structure of a data frame, which is covered in the manuals and help files, for example: ?Extract and the 'See Also' links on that page. There is no need to loop over each element in the column using one of the *apply() functions. If you have not, I would recommend reading An Introduction to R, which is available via the main R web site in the Manuals section, or it also installed with R on your computer. Additionally, an excellent resource for folks coming from SAS to R, is available at: http://RforSASandSPSSusers.com/ The authors have provided a terrific review of how one performs common operations in R, that you are already comfortable doing in SAS. HTH, Marc Schwartz
Hi: To illustrate the idea of vectorization that the previous posters raised, here's a quick example of finding the z-scores that you requested: # Define a vectorized function to do the standardization - the argument # x below is a vector. We'll keep it simple and ignore the possibility of # missing values and other complications... std <- function(x) (x - mean(x))/sd(x) # Create a new column in the original data frame for the z-scores, # where df is the name of your data frame... df <- transform(df, zscore = std(df[, 'cost'])) df Loc cost zscore 1 A 1 -1.0912993 2 C 3 -0.1925822 3 D 2 -0.6419407 4 F 3 -0.1925822 5 H 4 0.2567763 6 K 3 -0.1925822 7 M 8 2.0542104 transform() is a function used to add one or more columns to an existing data frame, usually by performing some function on its rows. Since a data frame can be indexed by its rows and columns, the comma before 'cost' signifies that we are to choose the column of df named cost, and all rows. (BTW, indexing is a very powerful feature of R that can be used to great advantage in data processing.) Also notice how the std() function takes advantage of the vector property of its input argument by computing the mean and standard deviation in-line and mapping the results to each element of the vector through the function definition. It also implicitly applies the 'recycling rule', since mean(x) and sd(x) are scalars that we are mapping to vectors.I find this more intuitive than the 'SAS way'. It takes three lines to read in the data, define the standardization function, apply it and attach it to the data frame. How many lines of SAS code would this take? HTH, Dennis On Thu, May 20, 2010 at 2:42 PM, egc <forum.query@gmail.com> wrote:> Greetings - > > While I've used R a fair bit for basic statistical machinations, I've > not used it for data manipulation - I've used SAS for 20+ years (and > SAS real shines in data handling). So, I've started the process of > trying to figure out 'how to do in R what I can do in my sleep in SAS' > - specifically wrt to data manipulating. So, these are decidely > 'newbie' level questions. > > So, starting very simple. Created a tine example CSV file, which I > call test.csv. > > Loc,cost > A,1 > C,3 > D,2 > F,3 > H,4 > K,3 > M,8 > > Now, all I want to do is read it in, and derive a new variable which > is a Z-transform of 'cost'. Here is what I've tried so far: > > > prices <- read.csv("c:/documents and > settings/user/desktop/test.csv",header=TRUE,sep=",",na.strings="."); > > print(prices$cost); > > So far, so good (being able to pull in the data is a good thing). > > Now, while I'm sure there are lots of ways to do what I want, I'm > going to brute force it, by calculating column mean and column SD for > 'cost', generate the Z-transformed value, and then add it to the > dataframe. However, here is where I'm having problems. After about an > hour of searching, I realized I need to use an 'apply' function to > apply a function (say, mean) to a column in a dataframe. But, I can > seem to get it to work successfully (and this is the gist of the > question). > > If I try > > > result <- sapply(prices['cost'],MARGIN=2,FUN=mean,na.rm=TRUE); > > print(result); > > > Works perfectly. > > But, if I simply change FUN=mean to FUN=sd, not so successful: > > If I try > > > result <- sapply(prices['cost'],MARGIN=2,FUN=sd,na.rm=TRUE); > > print(result); > > Throws the following error: > > Error in FUN(X[[1L]], ...) : unused argument(s) (MARGIN = 2) > > Further, If I try > > > result <- sapply(prices$cost,MARGIN=2,FUN=mean,na.rm=TRUE); > > print(result); > > it prints 8 values corresponding to the value of each element of the > data set - meaning, its treating prices$cost as a row vector.Which > makes no sense to me. What do I have to do to use prices$cost as the > first argument in the sapply call? If I can't, why not? > is.vector(prices$cost) shows TRUE, so why can't I take the mean over > the vector? > > At any rate, I'll start from here. Being able to apply functions to > column(s) of a dataframe seems pretty fundamental, so I'd like to > start by understanding the basics. > > Thanks in advance. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]