thr3ads.net - R help - [R] Using split() several times in a row? [Mar 2007]

If this information is useful, please help other people find it:
Share via:

Sergey Goriatchev

2007-Mar-30 15:18 UTC

[R] Using split() several times in a row?

Hi, fellow R users.

I have a question about sapply and split combination.

I have a big dataframe (40000 observations, 21 variables). First
variable (factor) is "date" and it is in format "8.29.97",
that is, I
have monthly data. Second variable (also factor) has levels 1 to 6
(fractiles 1 to 5 and missing value with code 6). The other 19
variables are numeric.
For each month I have several hunder observations of 19 numeric and 1 factor.

I am normalizing the numeric variables by dividing val1 by val2, where:

val1: (for each month, for each numeric variable) difference between
mean of ith numeric variable in fractile 1, and mean of ith numeric
variable in fractile 5.

val2: (for each month, for each numeric variable) standard deviation
for ith numeric variable.

Basically, as far as I understand, I need to use split() function several times.
To calculate val1 I need to use split() twice - first to split by
month and then split by fractile. Is this even possible to do (since
after first application of split() I get a list)??

Is there a smart way to perform this normalization computation?

My knowledge of R is not so advanced, but I need to know an efficient
way to perform calculations of this kind.

Would really appreciate some help from experienced R users!

Regards,
S

-- 
Laziness is nothing more than the habit of resting before you get tired.
- Jules Renard (writer)

Experience is one thing you can't get for nothing.
- Oscar Wilde (writer)

When you are finished changing, you're finished.
- Benjamin Franklin (Diplomat)

Stephen Tucker

2007-Mar-31 01:41 UTC

head link

[R] Using split() several times in a row?

Hi Sergey,

I believe the code below should get you close to want you want.

For dates, I usually store them as "POSIXct" classes in data frames,
but
according to Gabor Grothendieck and Thomas Petzoldt's R Help Desk article
<http://cran.r-project.org/doc/Rnews/Rnews_2004-1.pdf>, I should probably
be
using "chron" date and times...

Nonetheless, POSIXct casses are what I know so I can show you that to get the
month out of your column (replace "8.29.97" with your variable), you
can do
the following:

month =
format(strptime("8.29.97",format="%m.%d.%y"),format="%m")

Or,
month = as.data.frame(strsplit("8.29.97","\\."))[1,]

In any case, here is a code, in which I follow a series of function
application and definitions (which effectively includes successive
application of split() and lapply().

Best regards,

ST

# define data (I just made this up)
df <-
data.frame(month=as.character(rep(1:3,each=30)),fac=factor(rep(1:2,each=15)),
            data1=round(runif(90),2),
            data2=round(runif(90),2))

# define functions to split the data and another
# to get statistics
doSplits <- function(df) {
  unlist(lapply(split(df,df$month),function(x)
split(x,x$fac)),recursive=FALSE)
}
getStats <- function(x,f) {
  return(as.data.frame(lapply(x[unlist(lapply(x,mode))=="numeric"
&
                               
unlist(lapply(x,class))!="factor"],f)))
}
# create a matrix of data, means, and standard deviations
listMatrix <- cbind(Data=doSplits(df),
           Means=lapply(doSplits(df),getStats,mean),
           SDs=lapply(doSplits(df),getStats,sd))

# function to subtract means and divide by standard deviations
transformData <- function(x) {
  newdata <- x$Data
  matchedNames <- match(names(x$Means),names(x$Data))
  newdata[matchedNames] <-
   
sweep(sweep(data.matrix(x$Data[matchedNames]),2,unlist(x$Means),"-"),
          2,unlist(x$SDs),"/")
  return(newdata)
}
# apply to data
newDF <- lapply(as.data.frame(t(listMatrix)),transformData)

# Defind Fold function
Fold <- function(f, x, L) for(e in L) x <- f(x, e)
# Apply this to the data
finalData <- Fold(rbind,vector(),newDF)






--- Sergey Goriatchev <sergeyg at gmail.com> wrote:
> Hi, fellow R users.
> 
> I have a question about sapply and split combination.
> 
> I have a big dataframe (40000 observations, 21 variables). First
> variable (factor) is "date" and it is in format
"8.29.97", that is, I
> have monthly data. Second variable (also factor) has levels 1 to 6
> (fractiles 1 to 5 and missing value with code 6). The other 19
> variables are numeric.
> For each month I have several hunder observations of 19 numeric and 1
> factor.
> 
> I am normalizing the numeric variables by dividing val1 by val2, where:
> 
> val1: (for each month, for each numeric variable) difference between
> mean of ith numeric variable in fractile 1, and mean of ith numeric
> variable in fractile 5.
> 
> val2: (for each month, for each numeric variable) standard deviation
> for ith numeric variable.
> 
> Basically, as far as I understand, I need to use split() function several
> times.
> To calculate val1 I need to use split() twice - first to split by
> month and then split by fractile. Is this even possible to do (since
> after first application of split() I get a list)??
> 
> Is there a smart way to perform this normalization computation?
> 
> My knowledge of R is not so advanced, but I need to know an efficient
> way to perform calculations of this kind.
> 
> Would really appreciate some help from experienced R users!
> 
> Regards,
> S
> 
> -- 
> Laziness is nothing more than the habit of resting before you get tired.
> - Jules Renard (writer)
> 
> Experience is one thing you can't get for nothing.
> - Oscar Wilde (writer)
> 
> When you are finished changing, you're finished.
> - Benjamin Franklin (Diplomat)
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Possibly Parallel Threads

Search for more seemingly similar threads

R help - Mar 2007 - Using split() several times in a row?

[R] Using split() several times in a row?

[R] Using split() several times in a row?

Possibly Parallel Threads