Dear all, I have some very big data files that look something like this: id chr pos ihh1 ihh2 xpehh rs5748748 22 15795572 0.0230222 0.0268394 -0.153413 rs5748755 22 15806401 0.0186084 0.0268672 -0.367296 rs2385785 22 15807037 0.0198204 0.0186616 0.0602451 rs1981707 22 15809384 0.0299685 0.0176768 0.527892 rs1981708 22 15809434 0.0305465 0.0187227 0.489512 rs11914222 22 15810040 0.0307183 0.0172399 0.577633 rs4819923 22 15813210 0.02707 0.0159736 0.527491 rs5994105 22 15813888 0.025202 0.0141296 0.578651 rs5748760 22 15814084 0.0242894 0.0146486 0.505691 rs2385786 22 15816846 0.0173057 0.0107816 0.473199 rs1990483 22 15817310 0.0176641 0.0130525 0.302555 rs5994110 22 15821524 0.0178411 0.0129001 0.324267 rs17733785 22 15822154 0.0201797 0.0182093 0.102746 rs7287116 22 15823131 0.0201993 0.0179028 0.12069 rs5748765 22 15825502 0.0193195 0.0176513 0.090302 I'm trying to extract the maximum and minimum xpehh (last column) values within a sliding window (non overlapping), of width 10000 (calculated relative to pos (third column)). However, as you can tell from the brief excerpt here, although all possible intervals will probably be covered by at least one data point, the number of data points will be variable (incidentally, if anyone knows of a way to obtain this number, that would be lovely), as will the spacing between them. Furthermore, values of chr (second column) will range from 1 to 22, and values of pos will be overlapping across them; I want to evaluate the window separately for each value of chr. I've looked at the help and FAQ on sliding windows, but I'm a relative newcomer to R and cannot find a way to do what I need to do. Everything I've managed to unearth so far seems geared towards smoother time series. Any help on this problem would be vastly appreciated. Thanks, Irene -- Irene Gallego Romero Leverhulme Centre for Human Evolutionary Studies University of Cambridge Fitzwilliam St Cambridge CB2 1QH UK
The window you describe is not one I would call sliding and the intervals are regular with an irregular number of events within the windows. One way would be to use the results of trunc(pos/10000) as a factor with tapply: (Related functions are floor() and round(), but your pos values appear to be positive, so there should not be problems with how they work across 0) After creating a dataframe, dta, try something like: > tapply(dta$xpehh, as.factor(trunc(dta$pos/10000)), min) 1579 1580 1581 1582 -0.153413 -0.367296 0.302555 0.090302 -- David Winsemius On Mar 30, 2009, at 9:01 AM, Irene Gallego Romero wrote:> Dear all, > > I have some very big data files that look something like this: > > id chr pos ihh1 ihh2 xpehh > rs5748748 22 15795572 0.0230222 0.0268394 -0.153413 > rs5748755 22 15806401 0.0186084 0.0268672 -0.367296 > rs2385785 22 15807037 0.0198204 0.0186616 0.0602451 > rs1981707 22 15809384 0.0299685 0.0176768 0.527892 > rs1981708 22 15809434 0.0305465 0.0187227 0.489512 > rs11914222 22 15810040 0.0307183 0.0172399 0.577633 > rs4819923 22 15813210 0.02707 0.0159736 0.527491 > rs5994105 22 15813888 0.025202 0.0141296 0.578651 > rs5748760 22 15814084 0.0242894 0.0146486 0.505691 > rs2385786 22 15816846 0.0173057 0.0107816 0.473199 > rs1990483 22 15817310 0.0176641 0.0130525 0.302555 > rs5994110 22 15821524 0.0178411 0.0129001 0.324267 > rs17733785 22 15822154 0.0201797 0.0182093 0.102746 > rs7287116 22 15823131 0.0201993 0.0179028 0.12069 > rs5748765 22 15825502 0.0193195 0.0176513 0.090302 > > I'm trying to extract the maximum and minimum xpehh (last column) > values within a sliding window (non overlapping), of width 10000 > (calculated relative to pos (third column)). However, as you can > tell from the brief excerpt here, although all possible intervals > will probably be covered by at least one data point, the number of > data points will be variable (incidentally, if anyone knows of a way > to obtain this number, that would be lovely), as will the spacing > between them. Furthermore, values of chr (second column) will range > from 1 to 22, and values of pos will be overlapping across them; I > want to evaluate the window separately for each value of chr. > > I've looked at the help and FAQ on sliding windows, but I'm a > relative newcomer to R and cannot find a way to do what I need to > do. Everything I've managed to unearth so far seems geared towards > smoother time series. Any help on this problem would be vastly > appreciated. > > Thanks, > Irene > > -- > Irene Gallego Romero > Leverhulme Centre for Human Evolutionary Studies > University of Cambridge > Fitzwilliam St > Cambridge > CB2 1QH > UK > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD Heritage Laboratories West Hartford, CT
On Mon, Mar 30, 2009 at 6:01 AM, Irene Gallego Romero <ig247@cam.ac.uk>wrote:> Dear all, > > I have some very big data files that look something like this: > > id chr pos ihh1 ihh2 xpehh > rs5748748 22 15795572 0.0230222 0.0268394 -0.153413 > rs5748755 22 15806401 0.0186084 0.0268672 -0.367296 > rs2385785 22 15807037 0.0198204 0.0186616 0.0602451 > rs1981707 22 15809384 0.0299685 0.0176768 0.527892 > rs1981708 22 15809434 0.0305465 0.0187227 0.489512 > rs11914222 22 15810040 0.0307183 0.0172399 0.577633 > rs4819923 22 15813210 0.02707 0.0159736 0.527491 > rs5994105 22 15813888 0.025202 0.0141296 0.578651 > rs5748760 22 15814084 0.0242894 0.0146486 0.505691 > rs2385786 22 15816846 0.0173057 0.0107816 0.473199 > rs1990483 22 15817310 0.0176641 0.0130525 0.302555 > rs5994110 22 15821524 0.0178411 0.0129001 0.324267 > rs17733785 22 15822154 0.0201797 0.0182093 0.102746 > rs7287116 22 15823131 0.0201993 0.0179028 0.12069 > rs5748765 22 15825502 0.0193195 0.0176513 0.090302 > > I'm trying to extract the maximum and minimum xpehh (last column) values > within a sliding window (non overlapping), of width 10000 (calculated > relative to pos (third column)). However, as you can tell from the brief > excerpt here, although all possible intervals will probably be covered by at > least one data point, the number of data points will be variable > (incidentally, if anyone knows of a way to obtain this number, that would be > lovely), as will the spacing between them. Furthermore, values of chr > (second column) will range from 1 to 22, and values of pos will be > overlapping across them; I want to evaluate the window separately for each > value of chr. >The IRanges package from the Bioconductor project attempts to solve problems like these. For example, to count the number of overlapping intervals at a given position in the chromosome, you would use the coverage() function. The RangedData class is designed to store data like yours and rdapply() makes it easy to perform operations one chromosome at a time. That said, I don't think it has any easy way to solve your problem of calculating quantiles. That's a feature that needs to be added to the package. I could imagine something like (with the development version), calling disjointBins() to separate the ranges in bins where there is no overlap, then converting each bin into an Rle, and then using pmin/max on the Rle objects in series to get your answer. Anyway, you probably want to check out IRanges. Michael> > I've looked at the help and FAQ on sliding windows, but I'm a relative > newcomer to R and cannot find a way to do what I need to do. Everything I've > managed to unearth so far seems geared towards smoother time series. Any > help on this problem would be vastly appreciated. > > Thanks, > Irene > > -- > Irene Gallego Romero > Leverhulme Centre for Human Evolutionary Studies > University of Cambridge > Fitzwilliam St > Cambridge > CB2 1QH > UK > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]