thr3ads.net - R help - [R] Getting SNPS from PLINK to R [Jun 2011]

If this information is useful, please help other people find it:
Share via:

David Duffy

2011-Jun-21 13:53 UTC

[R] Getting SNPS from PLINK to R

snpMatrix package is quite nice (read.plink())

Mike Miller

2011-Jun-23 01:20 UTC

head link

[R] several messages

On Mon, 20 Jun 2011, Jim Silverton wrote:
> I a using plink on a large SNP dataset with a .map and .ped file. I want 
> to get some sort of file say a list of all the SNPs that plink is saying 
> that I have. ANyideas on how to do this?

All the SNPs you have are listed in the .map file.  An easy way to put the 
data in to R, if there isn't too much, is to do this:

plink --file whatever --out whatever --recodeA

That will make a file called whatever.raw, single space delimited, 
consisting of minor allele counts (0, 1, 2, NA) that you can bring into R 
like this:

data <- read.table("whatever.raw", delim=" ", header=T)

If you have tons of data, you'll want to work with the compact binary 
format (four genotypes per byte):

plink --file whatever --out whatever --make-bed

Then see David Duffy's reply.  However, I'm not sure if R can work with 
the compact format in memory.  It might expand those genotypes (minor 
allele counts) from two-bit integers to double-precision floats.  What 
does read.plink() create in memory?

There is another package I've been meaning to look at that is supposed to 
help with the memory management problem for large genotype files:

http://cran.r-project.org/web/packages/ff/

I haven't used it yet, but I am hopeful.  Maybe David Duffy or someone 
else here will know more about it.

If you have a lot of data, also consider chopping the data into pieces 
before loading it into R.  That's what we do.  With a 100 core system, I 
break the data into 100 files (I use the GNU/Linux "split" command and
a
few other tricks) and have all 100 cores run at once to analyze the data.

When I work with genotype data as allele counts using Octave, I store the 
data, both in files and in memory, as unsigned 8-bit integers, using 3 as 
the missing value.  That's still inefficient compared to the PLINK system, 
but it is way better than using doubles.

Best,
Mike

--
Michael B. Miller, Ph.D.
Minnesota Center for Twin and Family Research
Department of Psychology
University of Minnesota

Maybe Matching Threads

Search for more apparently analagous threads

R help - Jun 2011 - Getting SNPS from PLINK to R

[R] Getting SNPS from PLINK to R

[R] several messages

Maybe Matching Threads