thr3ads.net - R help - [R] Combining 2 columns into 1 column many times in a very large dataset [Feb 2010]

If this information is useful, please help other people find it:
Share via:

Sherri Rose

2010-Feb-28 00:07 UTC

[R] Combining 2 columns into 1 column many times in a very large dataset

*Combining  2 columns into 1 column many times in a very large dataset*

The clumsy solutions I am working on are not going to be very fast if I can
get them to work and the true dataset is ~1500 X 45000 so they need to be
efficient. I've searched the R help files and the archives for this list and
have some possible workable solutions for 2) and 3) but not my question 1).
However, I include 2) and 3) in case anyone has recommendations that would
be efficient.

Here is a toy example of the data structure:
pop = data.frame(status = rbinom(n, 1, .42), sex = rbinom(n, 1, .5),
age = round(rnorm(n, mean=40, 10)), disType = rbinom(n, 1, .2),
rs123=c(1,3,1,3,3,1,1,1,3,1), rs123.1=rep(1, n),
rs157=c(2,4,2,2,2,4,4,4,2,2),
rs157.1=c(4,4,4,2,4,4,4,4,2,2),  rs132=c(4,4,4,4,4,4,4,4,2,2),
rs132.1=c(4,4,4,4,4,4,4,4,4,4))

Thus, there are a few columns of basic demographic info and then the rest of
the columns are biallelic SNP info.  Ex: rs123 is allele 1 of rs123 and
rs123.1 is the second allele of rs123.

1) I need to merge all the biallelic SNP data that is currently in 2 columns
into 1 column, so, for example: rs123 and rs123.1 into one column (but
within the dataset):
11
31
11
31
31
11
11
11
31
11
2) I need to identify the least frequent SNP value (in the above example it
is 31).
3) I need to replace the least frequent SNP value with 1 and the other(s)
with 0.

Thank you for any assistance,
-S.R.

	[[alternative HTML version deleted]]

Phil Spector

2010-Feb-28 02:13 UTC

head link

[R] Combining 2 columns into 1 column many times in a very large datasetB

Sherri -
    Here's one way:

nms = c('rs123','rs157','rs132')
lowf = function(one,two){
      both = paste(pop[[one]],pop[[two]],sep='')
      tt = table(both)
      lowfreq = names(tt)[which.min(tt)]
      ifelse(both == lowfreq,1,0)
}
res = mapply(lowf,nms,paste(nms,'.1',sep=''),SIMPLIFY=FALSE)
names(res) = paste(names(res),'_new',sep='')
pop = data.frame(pop,res)

It doesn't deal with the case of ties with regard to the frequency
of the SNP value, but it should be easy to modify if that's 
an issue.

 					- Phil Spector
 					 Statistical Computing Facility
 					 Department of Statistics
 					 UC Berkeley
 					 spector at stat.berkeley.edu



On Sat, 27 Feb 2010, Sherri Rose wrote:
> *Combining  2 columns into 1 column many times in a very large dataset*
>
> The clumsy solutions I am working on are not going to be very fast if I can
> get them to work and the true dataset is ~1500 X 45000 so they need to be
> efficient. I've searched the R help files and the archives for this
list and
> have some possible workable solutions for 2) and 3) but not my question 1).
> However, I include 2) and 3) in case anyone has recommendations that would
> be efficient.
>
> Here is a toy example of the data structure:
> pop = data.frame(status = rbinom(n, 1, .42), sex = rbinom(n, 1, .5),
> age = round(rnorm(n, mean=40, 10)), disType = rbinom(n, 1, .2),
> rs123=c(1,3,1,3,3,1,1,1,3,1), rs123.1=rep(1, n),
> rs157=c(2,4,2,2,2,4,4,4,2,2),
> rs157.1=c(4,4,4,2,4,4,4,4,2,2),  rs132=c(4,4,4,4,4,4,4,4,2,2),
> rs132.1=c(4,4,4,4,4,4,4,4,4,4))
>
> Thus, there are a few columns of basic demographic info and then the rest
of
> the columns are biallelic SNP info.  Ex: rs123 is allele 1 of rs123 and
> rs123.1 is the second allele of rs123.
>
> 1) I need to merge all the biallelic SNP data that is currently in 2
columns
> into 1 column, so, for example: rs123 and rs123.1 into one column (but
> within the dataset):
> 11
> 31
> 11
> 31
> 31
> 11
> 11
> 11
> 31
> 11
> 2) I need to identify the least frequent SNP value (in the above example it
> is 31).
> 3) I need to replace the least frequent SNP value with 1 and the other(s)
> with 0.
>
> Thank you for any assistance,
> -S.R.
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Seemingly Similar Threads

Search for more possibly parallel threads

R help - Feb 2010 - Combining 2 columns into 1 column many times in a very large dataset

[R] Combining 2 columns into 1 column many times in a very large dataset

[R] Combining 2 columns into 1 column many times in a very large datasetB

Seemingly Similar Threads