thr3ads.net - R help - [R] range and intersection [Mar 2010]

If this information is useful, please help other people find it:
Share via:

Adrian Johnson

2010-Mar-14 03:14 UTC

[R] range and intersection

Hi:

I have a two large files (over 300K lines).

file 1:

Name    X
UK       199
UK       230
UK       139
......
UAE    194
UAE     94




File 2:

Name   X    Y
UK    140   180
UK    195    240
UK    304    340
....


I want to select X of File 1 and search if it falls in range of X and
Y of File 2 and Print only those lines of File 1 that are in range of
File 2 X and Y


How can it be done it in R.

thanks
Adrian

David Winsemius

2010-Mar-14 04:21 UTC

head link

[R] range and intersection

On Mar 13, 2010, at 10:14 PM, Adrian Johnson wrote:
> Hi:
>
> I have a two large files (over 300K lines).
>
> file 1:
>
> Name    X
> UK       199
> UK       230
> UK       139
> ......
> UAE    194
> UAE     94
>
>
>
>
> File 2:
>
> Name   X    Y
> UK    140   180
> UK    195    240
> UK    304    340
> ....
>
I haven't figured out what you are expecting. Cannot tell whether you  
want to make this test a) all file1$X within values of "Name" or b)  
all file1$X across all values, or c) to pick a specific line in file1,  
or d) file1$X  on a line by line basis in file2. This implements that  
last of those three and has the downside of generating warnings.

 > file1[file2[, "X"] < file1[, "X"] & file1[,
"X"] < file2[,"Y"], ]
   Name   X
2   UK 230
Warning messages:
1: In file2[, "X"] < file1[, "X"] :
   longer object length is not a multiple of shorter object length
2: In file1[, "X"] < file2[, "Y"] :
   longer object length is not a multiple of shorter object length
>
> I want to select X of File 1 and search if it falls in range of X and
> Y of File 2 and Print only those lines of File 1 that are in range of
> File 2 X and Y
>
>
> How can it be done it in R.
>
> thanks
> Adrian
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT

jim holtman

2010-Mar-14 04:29 UTC

head link

[R] range and intersection

Try this:
> file1 <- read.table(textConnection("Name    X+ UK       199
+ UK       230
+ UK       139
+ UAE    194
+ UAE     94"), header=TRUE, as.is=TRUE)> file2 <- read.table(textConnection("Name   X    Y+ UK    140   180
+ UK    195    240
+ UK    304    340"), header=TRUE, as.is=TRUE)> closeAllConnections()
> # initial the 'match' to FALSE
> file1$match <- FALSE
> # loop through the rows of file2 making the test (assuming it is the
shorter file)> for (i in seq(nrow(file2))){+     file1$match <- file1$match | ((file1$X >= file2$X[i]) & (file1$X
<file2$Y[i]))
+ }>
> file1  Name   X match
1   UK 199  TRUE
2   UK 230  TRUE
3   UK 139 FALSE
4  UAE 194 FALSE
5  UAE  94 FALSE


On Sat, Mar 13, 2010 at 10:14 PM, Adrian Johnson
<oriolebaltimore@gmail.com>wrote:
> Hi:
>
> I have a two large files (over 300K lines).
>
> file 1:
>
> Name    X
> UK       199
> UK       230
> UK       139
> ......
> UAE    194
> UAE     94
>
>
>
>
> File 2:
>
> Name   X    Y
> UK    140   180
> UK    195    240
> UK    304    340
> ....
>
>
> I want to select X of File 1 and search if it falls in range of X and
> Y of File 2 and Print only those lines of File 1 that are in range of
> File 2 X and Y
>
>
> How can it be done it in R.
>
> thanks
> Adrian
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
>
http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

	[[alternative HTML version deleted]]

Charles C. Berry

2010-Mar-14 06:45 UTC

head link

[R] range and intersection

On Sat, 13 Mar 2010, Adrian Johnson wrote:
> Hi:
>
> I have a two large files (over 300K lines).
>
> file 1:
>
> Name    X
> UK       199
> UK       230
> UK       139
> ......
> UAE    194
> UAE     94
>
>
>
>
> File 2:
>
> Name   X    Y
> UK    140   180
> UK    195    240
> UK    304    340
> ....
>
>
> I want to select X of File 1 and search if it falls in range of X and
> Y of File 2 and Print only those lines of File 1 that are in range of
> File 2 X and Y
Probably, I'd use findOverlaps() in the IRanges BioConductor package.

If you want to do the UK search apart from the UAE search and so on, the 
use of RangeData objects provided by IRanges is nice, clean way to go.

Something like:

library(IRanges)

file1 <- read.table("File1", header=TRUE)
file2 <- read.table("File2", header=TRUE)

file1.rl <- RangedData( IRanges(start=file1$X, width=1), space = Name )
file2.rl <- RangedData( IRanges(start=file2$X, width=file2$Y),
 			space = Name )

find.1.in.2 <- as.matrix( findOverlaps( file1.rl , file2.rl ) )

new.rl <- cbind( file1.rl[ find.1.in.2[,1], ],
 			file2X = start(file2.rl)[ find.1.in.2[,2] ],
 			file2Y = end(file2.rl)[ find.1.in.2[,2] ])

find.1.in.2 will be a matrix with one row for every match. The first 
column will be the index of the row in file1.rl and the second that of 
file2.rl.

new.rl will have on row for each match.

The order of the rows in the RangedData objects may not match the original 
data frames, so beware.

For 300K rows, this would run pretty fast, I think.

(caveat: This is all untested code.)

Otherwise, without the IRanges package something like


gt.x <- findInterval( file1$X, file2$X )
gt.y <- findInterval( file1$X, file2$Y )

is.in.interval <- gt.x == gt.y + 1

will work iff the intervals defined in file2 do not overlap one another.

If you need to keep 'Name's  separate, rolling this into mapply() would
be
needed.

HTH,

Chuck
>
>
> How can it be done it in R.
>
> thanks
> Adrian
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

Possibly Parallel Threads

Search for more maybe matching threads

R help - Mar 2010 - range and intersection

[R] range and intersection

[R] range and intersection

[R] range and intersection

[R] range and intersection

Possibly Parallel Threads