Hi: I have a two large files (over 300K lines). file 1: Name X UK 199 UK 230 UK 139 ...... UAE 194 UAE 94 File 2: Name X Y UK 140 180 UK 195 240 UK 304 340 .... I want to select X of File 1 and search if it falls in range of X and Y of File 2 and Print only those lines of File 1 that are in range of File 2 X and Y How can it be done it in R. thanks Adrian
On Mar 13, 2010, at 10:14 PM, Adrian Johnson wrote:> Hi: > > I have a two large files (over 300K lines). > > file 1: > > Name X > UK 199 > UK 230 > UK 139 > ...... > UAE 194 > UAE 94 > > > > > File 2: > > Name X Y > UK 140 180 > UK 195 240 > UK 304 340 > .... >I haven't figured out what you are expecting. Cannot tell whether you want to make this test a) all file1$X within values of "Name" or b) all file1$X across all values, or c) to pick a specific line in file1, or d) file1$X on a line by line basis in file2. This implements that last of those three and has the downside of generating warnings. > file1[file2[, "X"] < file1[, "X"] & file1[, "X"] < file2[,"Y"], ] Name X 2 UK 230 Warning messages: 1: In file2[, "X"] < file1[, "X"] : longer object length is not a multiple of shorter object length 2: In file1[, "X"] < file2[, "Y"] : longer object length is not a multiple of shorter object length> > I want to select X of File 1 and search if it falls in range of X and > Y of File 2 and Print only those lines of File 1 that are in range of > File 2 X and Y > > > How can it be done it in R. > > thanks > Adrian > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD West Hartford, CT
Try this:> file1 <- read.table(textConnection("Name X+ UK 199 + UK 230 + UK 139 + UAE 194 + UAE 94"), header=TRUE, as.is=TRUE)> file2 <- read.table(textConnection("Name X Y+ UK 140 180 + UK 195 240 + UK 304 340"), header=TRUE, as.is=TRUE)> closeAllConnections() > # initial the 'match' to FALSE > file1$match <- FALSE > # loop through the rows of file2 making the test (assuming it is theshorter file)> for (i in seq(nrow(file2))){+ file1$match <- file1$match | ((file1$X >= file2$X[i]) & (file1$X <file2$Y[i])) + }> > file1Name X match 1 UK 199 TRUE 2 UK 230 TRUE 3 UK 139 FALSE 4 UAE 194 FALSE 5 UAE 94 FALSE On Sat, Mar 13, 2010 at 10:14 PM, Adrian Johnson <oriolebaltimore@gmail.com>wrote:> Hi: > > I have a two large files (over 300K lines). > > file 1: > > Name X > UK 199 > UK 230 > UK 139 > ...... > UAE 194 > UAE 94 > > > > > File 2: > > Name X Y > UK 140 180 > UK 195 240 > UK 304 340 > .... > > > I want to select X of File 1 and search if it falls in range of X and > Y of File 2 and Print only those lines of File 1 that are in range of > File 2 X and Y > > > How can it be done it in R. > > thanks > Adrian > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? [[alternative HTML version deleted]]
On Sat, 13 Mar 2010, Adrian Johnson wrote:> Hi: > > I have a two large files (over 300K lines). > > file 1: > > Name X > UK 199 > UK 230 > UK 139 > ...... > UAE 194 > UAE 94 > > > > > File 2: > > Name X Y > UK 140 180 > UK 195 240 > UK 304 340 > .... > > > I want to select X of File 1 and search if it falls in range of X and > Y of File 2 and Print only those lines of File 1 that are in range of > File 2 X and YProbably, I'd use findOverlaps() in the IRanges BioConductor package. If you want to do the UK search apart from the UAE search and so on, the use of RangeData objects provided by IRanges is nice, clean way to go. Something like: library(IRanges) file1 <- read.table("File1", header=TRUE) file2 <- read.table("File2", header=TRUE) file1.rl <- RangedData( IRanges(start=file1$X, width=1), space = Name ) file2.rl <- RangedData( IRanges(start=file2$X, width=file2$Y), space = Name ) find.1.in.2 <- as.matrix( findOverlaps( file1.rl , file2.rl ) ) new.rl <- cbind( file1.rl[ find.1.in.2[,1], ], file2X = start(file2.rl)[ find.1.in.2[,2] ], file2Y = end(file2.rl)[ find.1.in.2[,2] ]) find.1.in.2 will be a matrix with one row for every match. The first column will be the index of the row in file1.rl and the second that of file2.rl. new.rl will have on row for each match. The order of the rows in the RangedData objects may not match the original data frames, so beware. For 300K rows, this would run pretty fast, I think. (caveat: This is all untested code.) Otherwise, without the IRanges package something like gt.x <- findInterval( file1$X, file2$X ) gt.y <- findInterval( file1$X, file2$Y ) is.in.interval <- gt.x == gt.y + 1 will work iff the intervals defined in file2 do not overlap one another. If you need to keep 'Name's separate, rolling this into mapply() would be needed. HTH, Chuck> > > How can it be done it in R. > > thanks > Adrian > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901