thr3ads.net - R help - [R] Difficult subset challenge [Dec 2011]

If this information is useful, please help other people find it:
Share via:

Noah Silverman

2011-Dec-10 21:44 UTC

[R] Difficult subset challenge

Hi,

I'm having difficulty coming up with a good way to subest some data to
generate statistics.

My data frame has multiple observations by group.

Here is an overly-simplified toy example of the data
=========================code	v1	v2
G1		1.2	2.3
G1		0	2.4
G1		1.4	3.4
G2		2.9	2.3
G2		4.3	4.4
etc..
========================
I want to normalize the data *by group*  for certain variable.  But, I want to
ignore 0 values when calculating the mean and standard deviation.

What I *want* to do is something like this:
======================	 for (code in unique (d$code) ){ 
		 mu <- mean( d[which(d[d$code==code,v1] !=0 ), v1] ) 
		 sig <- sd( d[which(d[d$code==code,v1] !=0 ), v1] ) 
		 d[which(d[d$code==code,v1] !=0 ), cname] <- (d[which(d[d$code==code,v1]
!=0 ), v1] - mu) / sig
	 }
======================
My goal, if it isn't apparent, is to replace values with their normalized
value.  (But, the statistics used for normalization are calculated skipping zero
values.)

This doesn't work as the indexing from the which command is relative (1,2,3,
etc.)


Suggestions?



--
Noah Silverman
UCLA Department of Statistics
8208 Math Sciences Building
Los Angeles, CA 90095

Joshua Wiley

2011-Dec-10 22:21 UTC

head link

[R] Difficult subset challenge

Hi Noah,

I am unclear if the 0s should be standardized or not---I am assuming
since you want them excluded from the calculation of the mean and SD,
you do not want  (0 - M) / sigma.  If that is the case, here is an
example:


## read in your data
## FYI: providing via dput() would be easier next time
d <- read.table(textConnection("
code    v1      v2
G1              1.2     2.3
G1              0       2.4
G1              1.4     3.4
G2              2.9     2.3
G2              4.3     4.4"), header = TRUE)
closeAllConnections()

## temporary data as a matrix
tmp <- as.matrix(d[-1])
## index 0s and set to missing
tmp[index.0 <- which(tmp == 0, arr.ind = TRUE)] <- NA
## scale by column and d$code and pull back to matrix
tmp <- do.call("rbind", by(tmp, d$code, scale))
## NAs back to 0s
tmp[index.0] <- 0
d[, 2:3] <- tmp

If you want the zeros standardized, it will take a bit of a different
approach.  The other issue that could come up here is speed, but that
can get to be very dataset dependent (e.g., what is most efficient for
a few levels of code may not be the same as what is efficient for many
columns, etc.  That said, it would not take much work to create a
parallelized version of what by() is doing, and scale is already
vectorized so it works pretty darn fast assuming you pass it a matrix.

Cheers,

Josh

On Sat, Dec 10, 2011 at 1:44 PM, Noah Silverman <noahsilverman at
ucla.edu> wrote:> Hi,
>
> I'm having difficulty coming up with a good way to subest some data to
generate statistics.
>
> My data frame has multiple observations by group.
>
> Here is an overly-simplified toy example of the data
> =========================> code ? ?v1 ? ? ?v2
> G1 ? ? ? ? ? ? ?1.2 ? ? 2.3
> G1 ? ? ? ? ? ? ?0 ? ? ? 2.4
> G1 ? ? ? ? ? ? ?1.4 ? ? 3.4
> G2 ? ? ? ? ? ? ?2.9 ? ? 2.3
> G2 ? ? ? ? ? ? ?4.3 ? ? 4.4
> etc..
> ========================>
> I want to normalize the data *by group* ?for certain variable. ?But, I want
to ignore 0 values when calculating the mean and standard deviation.
>
> What I *want* to do is something like this:
> ======================> ? ? ? ? for (code in unique (d$code) ){
> ? ? ? ? ? ? ? ? mu <- mean( d[which(d[d$code==code,v1] !=0 ), v1] )
> ? ? ? ? ? ? ? ? sig <- sd( d[which(d[d$code==code,v1] !=0 ), v1] )
> ? ? ? ? ? ? ? ? d[which(d[d$code==code,v1] !=0 ), cname] <-
(d[which(d[d$code==code,v1] !=0 ), v1] - mu) / sig
> ? ? ? ? }
> ======================>
> My goal, if it isn't apparent, is to replace values with their
normalized value. ?(But, the statistics used for normalization are calculated
skipping zero values.)
>
> This doesn't work as the indexing from the which command is relative
(1,2,3, etc.)
>
>
> Suggestions?
>
>
>
> --
> Noah Silverman
> UCLA Department of Statistics
> 8208 Math Sciences Building
> Los Angeles, CA 90095
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/

Maybe Matching Threads

Search for more seemingly similar threads

R help - Dec 2011 - Difficult subset challenge

[R] Difficult subset challenge

[R] Difficult subset challenge

Maybe Matching Threads