Bottom Line Up Front: How does one reshape genetic data from long to wide? I currently have a lot of data. About 180 individuals (some probands/patients, some parents, rare siblings) and SNP data from 6000 loci on each. The standard formats seem to be something along the lines of Famid, pid, fatid, motid, affected, sex, locus1Allele1, locus1Allele2, locus2Allele1, locus2Allele2, etc In other words one human, one row. If there were multiple loci then the variables would continue to be heaped up on the right. This kind of orientation, shall be referred to as "wide". Given how big my dataset is, it is easier to manage the data in the database in the "long" format. In this format I have a pedigree table and from it, a one to many relationship with the SNP data. The SNP table has fields: uniqueHumanID, Allele1, Allele2, locus That makes for an incredibly long table. Data is stored in a Sybase database that I communicate with through ODBC using Microsoft Access. RODBC package then reads the queries that I have created in Microsoft Access. The only reason for Microsoft Access is that I have had well over a decade's worth of experience using it at an intermediate level. With the magic of SQL I can mix and match these tables. But creating the table that is 180 rows long and about 12010 variables wide is daunting. Essentially the 6000 SNPs represent each human having 12000 repeated measures (6000SNPs times 2 alleles) I presume I would be able to use the reshape function in R: "Reshape Grouped Data Description This function reshapes a data frame between 'wide' format with repeated measurements in separate columns of the same record and 'long' format with the repeated measurements in separate records. " BUT BEFORE I launch into this. Is there a way that either the Warnes package (Genetics) or the David Clayton package can handle the data in the long form? If not do any of the packages reshape the data in a way that is pedigree and genotype aware. The general R reshape function is not predesigned to be friendly to genetic data. Farrel Buchinsky, MD --- Mobile (412) 779-1073 Pediatric Otolaryngologist Allegheny General Hospital Pittsburgh, PA ********************************************************************** This email and any files transmitted with it are confidentia...{{dropped}}
Farrel Buchinsky <fbuchins at wpahs.org> wrote:> Bottom Line Up Front: How does one reshape genetic data from long to wide?I avoid both your "long" and "wide" formats because they are awkward and inefficient for large data sets. The "long" format wastes a huge amount of space with redundant column values, and manipulating a data frame with 12000 columns is not much fun either. Instead, I pack genotypes into strings: usually one string per SNP. So in your example, I'd have a small 180-row table with a few columns of data about the samples, and a small 6000-row SNP table with one column of packed genotypes. The sample table is organized so the row order matches the positions of the sample genotypes in the packed genotype strings. If I want genotypes for a particular SNP, I unpack them with strsplit(), tack them onto the sample table as a new column, and discard when I'm done. I store genotype data in our database this way as well. We also have a "long" format table, but I avoid it whenever possible because the packed format is so much more convenient. The time saved pulling data out of the database in this format dwarfs the time spent parsing out the genotype strings. -- David Hinds
1) I know how to post to the list. You simply send an e-mail to R-help at stat.math.ethz.ch. But how do you read the items and respond to them? I usually read the items at http://tolstoy.newcastle.edu.au/R/help/06/04/index.html#end and then have to jump through some hoops to answer back. Is there any way to access this group through Google Groups or through Outlook express' user group feature? 2)Storing all the SNP data as a string seems quite clever and a space-saving way of doing it. However, if you were to analyze a whole chromosome at a time you would still be creating one almighty big table albeit only temporarily. Do you use R to run TDT analyses? If so, how are you setting up your data frames and then what commands do you issue to analyze what is in your dataframes? I used David Clayton's Stata add on a few months ago and was able to get it to run through the analysis. I ran just one locus. Technically, the analysis seemed to run OK but I want to run all the loci one after the other. Currently I have my data such that I can access it from R through an ODBC connection to Microsoft Access which in turn has an ODBC connection to the Sybase database. Whether I go through strings or not, I still need to find a way that I can assemble it so that a program can systematically run a TDT analysis on all the loci. I can see how strings help me in my storing of the data but that is already a fait acomplis. Can you explain to me how it would help me with sequential analysis of each locus? Do you have any history files so that I can see what you were doing? Farrel Buchinsky, MD Pediatric Otolaryngologist Allegheny General Hospital Pittsburgh, PA ********************************************************************** This email and any files transmitted with it are confidentia...{{dropped}}
Farrel Buchinsky <fbuchins at wpahs.org> wrote:> > 2)Storing all the SNP data as a string seems quite clever and a space-saving > way of doing it. However, if you were to analyze a whole chromosome at a > time you would still be creating one almighty big table albeit only > temporarily. Do you use R to run TDT analyses? If so, how are you setting up > your data frames and then what commands do you issue to analyze what is in > your dataframes? > > I used David Clayton's Stata add on a few months ago and was able to get it > to run through the analysis. I ran just one locus. Technically, the analysis > seemed to run OK but I want to run all the loci one after the other. > > Currently I have my data such that I can access it from R through an ODBC > connection to Microsoft Access which in turn has an ODBC connection to the > Sybase database. Whether I go through strings or not, I still need to find a > way that I can assemble it so that a program can systematically run a TDT > analysis on all the loci. I can see how strings help me in my storing of the > data but that is already a fait acomplis. Can you explain to me how it would > help me with sequential analysis of each locus? Do you have any history > files so that I can see what you were doing? >At some stage, you will need to have a "wide" format pedigree file, at least as wide as the haplotypes you are interested in. Using the idea of storage as strings (which would be especially good for SNPs), you would just do something like: cbind(phenotypes, unlist(strplit(snp, " "))) or ex <- function(x) matrix(unlist(strsplit(x," ")), nc=length(x)) cbind(phenotypes, ex(snps[chosen.snps]) Here is a fragment using reshape from long to wide extract.snp <- function(snplist, snpdata) { # names of snps from one file snps <- read.delim(snplist, sep="\t", head=FALSE, colClasses=c("character", "numeric")) names(snps) <- c("Names", "Pos") snps <- snps[order(snps[,2]),] # snp data in long format x <- read.delim(snpdata, sep="\t", colClasses="character", head=FALSE) names(x) <- c("id","marker", "genotype") ped<-reshape(x, v.names="genotype", timevar="marker", idvar="id",direction="wide") names(ped) <- gsub("genotype.","", names(ped)) rpos <- match(snps$Name, names(ped)) ped <- ped[,c(1, rpos[!is.na(rpos)])] ped } | David Duffy (MBBS PhD) ,-_|\ | email: davidD at qimr.edu.au ph: INT+61+7+3362-0217 fax: -0101 / * | Epidemiology Unit, Queensland Institute of Medical Research \_,-._/ | 300 Herston Rd, Brisbane, Queensland 4029, Australia GPG 4D0B994A v
You might want to check the double check the list archives https://www.stat.math.ethz.ch/pipermail/r-help/ to see if your posts got through or not just in case its just some problem in displaying your own posts. On 4/29/06, Farrel Buchinsky <fbuchins at wpahs.org> wrote:> Gabor Grothendieck <ggrothendieck <at> gmail.com> writes: > > > http://news.gmane.org/gmane.comp.lang.r.general > > or one of these: > > http://dir.gmane.org/gmane.comp.lang.r.general > > > > Yes but when I hit "Post this article" it send something to gMane (I think) > but not to R-help at stat.math.ethz.ch > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >