Bottom Line Up Front: How does one reshape genetic data from long to wide?
I currently have a lot of data. About 180 individuals (some
probands/patients, some parents, rare siblings) and SNP data from 6000 loci
on each. The standard formats seem to be something along the lines of Famid,
pid, fatid, motid, affected, sex, locus1Allele1, locus1Allele2,
locus2Allele1, locus2Allele2, etc
In other words one human, one row. If there were multiple loci then the
variables would continue to be heaped up on the right. This kind of
orientation, shall be referred to as "wide".
Given how big my dataset is, it is easier to manage the data in the database
in the "long" format. In this format I have a pedigree table and from
it, a
one to many relationship with the SNP data. The SNP table has fields:
uniqueHumanID, Allele1, Allele2, locus
That makes for an incredibly long table.
Data is stored in a Sybase database that I communicate with through ODBC
using Microsoft Access. RODBC package then reads the queries that I have
created in Microsoft Access. The only reason for Microsoft Access is that I
have had well over a decade's worth of experience using it at an
intermediate level.
With the magic of SQL I can mix and match these tables. But creating the
table that is 180 rows long and about 12010 variables wide is daunting.
Essentially the 6000 SNPs represent each human having 12000 repeated
measures (6000SNPs times 2 alleles)
I presume I would be able to use the reshape function in R:
"Reshape Grouped Data
Description
This function reshapes a data frame between 'wide' format with repeated
measurements in separate columns of the same record and 'long' format
with
the repeated measurements in separate records. "
BUT BEFORE I launch into this.
Is there a way that either the Warnes package (Genetics) or the David
Clayton package can handle the data in the long form?
If not do any of the packages reshape the data in a way that is pedigree and
genotype aware. The general R reshape function is not predesigned to be
friendly to genetic data.
Farrel Buchinsky, MD --- Mobile (412) 779-1073
Pediatric Otolaryngologist
Allegheny General Hospital
Pittsburgh, PA 
**********************************************************************
This email and any files transmitted with it are confidentia...{{dropped}}
Farrel Buchinsky <fbuchins at wpahs.org> wrote:> Bottom Line Up Front: How does one reshape genetic data from long to wide?I avoid both your "long" and "wide" formats because they are awkward and inefficient for large data sets. The "long" format wastes a huge amount of space with redundant column values, and manipulating a data frame with 12000 columns is not much fun either. Instead, I pack genotypes into strings: usually one string per SNP. So in your example, I'd have a small 180-row table with a few columns of data about the samples, and a small 6000-row SNP table with one column of packed genotypes. The sample table is organized so the row order matches the positions of the sample genotypes in the packed genotype strings. If I want genotypes for a particular SNP, I unpack them with strsplit(), tack them onto the sample table as a new column, and discard when I'm done. I store genotype data in our database this way as well. We also have a "long" format table, but I avoid it whenever possible because the packed format is so much more convenient. The time saved pulling data out of the database in this format dwarfs the time spent parsing out the genotype strings. -- David Hinds
1) I know how to post to the list. You simply send an e-mail to
R-help at stat.math.ethz.ch. But how do you read the items and respond to them?
I usually read the items at
http://tolstoy.newcastle.edu.au/R/help/06/04/index.html#end and then have to
jump through some hoops to answer back.
 Is there any way to access this group through Google Groups or through
Outlook express' user group feature?
2)Storing all the SNP data as a string seems quite clever and a space-saving
way of doing it. However, if you were to analyze a whole chromosome at a
time you would still be creating one almighty big table albeit only
temporarily. Do you use R to run TDT analyses? If so, how are you setting up
your data frames and then what commands do you issue to analyze what is in
your dataframes?
I used David Clayton's Stata add on a few months ago and was able to get it
to run through the analysis. I ran just one locus. Technically, the analysis
seemed to run OK but I want to run all the loci one after the other. 
Currently I have my data such that I can access it from R through an ODBC
connection to Microsoft Access which in turn has an ODBC connection to the
Sybase database. Whether I go through strings or not, I still need to find a
way that I can assemble it so that a program can systematically run a TDT
analysis on all the loci. I can see how strings help me in my storing of the
data but that is already a fait acomplis. Can you explain to me how it would
help me with sequential analysis of each locus? Do you have any history
files so that I can see what you were doing?
Farrel Buchinsky, MD
Pediatric Otolaryngologist
Allegheny General Hospital
Pittsburgh, PA 
**********************************************************************
This email and any files transmitted with it are confidentia...{{dropped}}
Farrel Buchinsky <fbuchins at wpahs.org> wrote:> > 2)Storing all the SNP data as a string seems quite clever and a space-saving > way of doing it. However, if you were to analyze a whole chromosome at a > time you would still be creating one almighty big table albeit only > temporarily. Do you use R to run TDT analyses? If so, how are you setting up > your data frames and then what commands do you issue to analyze what is in > your dataframes? > > I used David Clayton's Stata add on a few months ago and was able to get it > to run through the analysis. I ran just one locus. Technically, the analysis > seemed to run OK but I want to run all the loci one after the other. > > Currently I have my data such that I can access it from R through an ODBC > connection to Microsoft Access which in turn has an ODBC connection to the > Sybase database. Whether I go through strings or not, I still need to find a > way that I can assemble it so that a program can systematically run a TDT > analysis on all the loci. I can see how strings help me in my storing of the > data but that is already a fait acomplis. Can you explain to me how it would > help me with sequential analysis of each locus? Do you have any history > files so that I can see what you were doing? >At some stage, you will need to have a "wide" format pedigree file, at least as wide as the haplotypes you are interested in. Using the idea of storage as strings (which would be especially good for SNPs), you would just do something like: cbind(phenotypes, unlist(strplit(snp, " "))) or ex <- function(x) matrix(unlist(strsplit(x," ")), nc=length(x)) cbind(phenotypes, ex(snps[chosen.snps]) Here is a fragment using reshape from long to wide extract.snp <- function(snplist, snpdata) { # names of snps from one file snps <- read.delim(snplist, sep="\t", head=FALSE, colClasses=c("character", "numeric")) names(snps) <- c("Names", "Pos") snps <- snps[order(snps[,2]),] # snp data in long format x <- read.delim(snpdata, sep="\t", colClasses="character", head=FALSE) names(x) <- c("id","marker", "genotype") ped<-reshape(x, v.names="genotype", timevar="marker", idvar="id",direction="wide") names(ped) <- gsub("genotype.","", names(ped)) rpos <- match(snps$Name, names(ped)) ped <- ped[,c(1, rpos[!is.na(rpos)])] ped } | David Duffy (MBBS PhD) ,-_|\ | email: davidD at qimr.edu.au ph: INT+61+7+3362-0217 fax: -0101 / * | Epidemiology Unit, Queensland Institute of Medical Research \_,-._/ | 300 Herston Rd, Brisbane, Queensland 4029, Australia GPG 4D0B994A v
You might want to check the double check the list archives https://www.stat.math.ethz.ch/pipermail/r-help/ to see if your posts got through or not just in case its just some problem in displaying your own posts. On 4/29/06, Farrel Buchinsky <fbuchins at wpahs.org> wrote:> Gabor Grothendieck <ggrothendieck <at> gmail.com> writes: > > > http://news.gmane.org/gmane.comp.lang.r.general > > or one of these: > > http://dir.gmane.org/gmane.comp.lang.r.general > > > > Yes but when I hit "Post this article" it send something to gMane (I think) > but not to R-help at stat.math.ethz.ch > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >