On Mon, 20 Jun 2011, Jim Silverton wrote:
> I a using plink on a large SNP dataset with a .map and .ped file. I want
> to get some sort of file say a list of all the SNPs that plink is saying
> that I have. ANyideas on how to do this?
All the SNPs you have are listed in the .map file. An easy way to put the
data in to R, if there isn't too much, is to do this:
plink --file whatever --out whatever --recodeA
That will make a file called whatever.raw, single space delimited,
consisting of minor allele counts (0, 1, 2, NA) that you can bring into R
like this:
data <- read.table("whatever.raw", delim=" ", header=T)
If you have tons of data, you'll want to work with the compact binary
format (four genotypes per byte):
plink --file whatever --out whatever --make-bed
Then see David Duffy's reply. However, I'm not sure if R can work with
the compact format in memory. It might expand those genotypes (minor
allele counts) from two-bit integers to double-precision floats. What
does read.plink() create in memory?
There is another package I've been meaning to look at that is supposed to
help with the memory management problem for large genotype files:
http://cran.r-project.org/web/packages/ff/
I haven't used it yet, but I am hopeful. Maybe David Duffy or someone
else here will know more about it.
If you have a lot of data, also consider chopping the data into pieces
before loading it into R. That's what we do. With a 100 core system, I
break the data into 100 files (I use the GNU/Linux "split" command and
a
few other tricks) and have all 100 cores run at once to analyze the data.
When I work with genotype data as allele counts using Octave, I store the
data, both in files and in memory, as unsigned 8-bit integers, using 3 as
the missing value. That's still inefficient compared to the PLINK system,
but it is way better than using doubles.
Best,
Mike
--
Michael B. Miller, Ph.D.
Minnesota Center for Twin and Family Research
Department of Psychology
University of Minnesota