Hello, I was hoping that someone would be able to help me or at least point me in the right direction regarding a problem I am having. I am a new R user, and I've been trying to read tutorials but they haven't been much help to me so far. The problem is relatively simple as I've already created working solutions in Java and Perl, but I need a solution in R as well. I have two text files, say pos.txt and reg.txt. In pos.txt, the data is listed for example: c22 1445 - CG 1 4 c22 1542 + CG 2 3 c22 1678 + CG 13 15 ... etc. for thousands of lines. The most important column is column 2, which lists "position" (e.g. 1445, 1542, 1678). In reg.txt, data is listed as: c22 1440 1500 cpg: 44 56 ...... c22 1520 1700 cpg: 56 87 ...... c22 1800 1900 cpg: 58 90 ...... ... where the values in column 2 is the "start" position and values in column 3 are the "end" position. There are 10 columns total but I just listed the first few. Also, the text files are different lengths. Essentially, my problem is trying to take the position listed in column 2 of pos.txt and try to find the region (based on start and end positions) listed in reg.txt. Then I need to print: c22 "start" "end" "position" + 1 5 where the last 3 columns are from pos.txt as well (i.e. all of the lines don't end in + 1 5, but rather the values for the columns in pos.txt). Also, the position needs to be within the start and end position. So far I've been able to use read.table to create a data frame for each text file, and I've also named each column (e.g. reg.data$end) and I can output each column individually. However, the problem I keep facing is how to compare the numbers for "position" in pos.txt to the numbers for "start" and "end" in reg.txt. I tried to use: if ((pos >= start) | (pos <= end)).. but an error comes up that says the files aren't the same length. In Java and Perl I used nested loops to cycle through each element in one file, and compare it to every element in the other file, and then printed to a new text file. As such, I was trying to learn a bit more about arrays in R, but if you know of a better way in R to do this then please let me know. Any help is greatly appreciated. Thank you, AJ -- View this message in context: http://r.789695.n4.nabble.com/Help-with-isolating-and-comparing-data-from-two-files-tp3543170p3543170.html Sent from the R help mailing list archive at Nabble.com.
jim holtman
2011-May-23 12:23 UTC
[R] Help with isolating and comparing data from two files.
Is this what you are after?> posV1 V2 V3 V4 V5 V6 1 c22 1445 - CG 1 4 2 c22 1542 + CG 2 3 3 c22 1678 + CG 13 15> regV1 V2 V3 V4 V5 V6 V7 1 c22 1440 1500 cpg: 44 56 ...... 2 c22 1520 1700 cpg: 56 87 ...... 3 c22 1800 1900 cpg: 58 90 ......> # iterate through the 'reg' printing put match 'pos' entries > result <- lapply(seq(nrow(reg)), function(i){+ # get indices of match + indx <- (pos$V2 >= reg$V2[i]) & (pos$V2 <= reg$V3[i]) + if (!any(indx)) return(NULL) # no match + # create new dataframe + cbind(reg[rep(i, sum(indx)), 1:3], pos[indx, ]) + })> do.call(rbind, result)V1 V2 V3 V1 V2 V3 V4 V5 V6 1 c22 1440 1500 c22 1445 - CG 1 4 2 c22 1520 1700 c22 1542 + CG 2 3 2.1 c22 1520 1700 c22 1678 + CG 13 15>On Mon, May 23, 2011 at 12:00 AM, ajn21 <ajn21 at case.edu> wrote:> Hello, > > I was hoping that someone would be able to help me or at least point me in > the right direction regarding a problem I am having. I am a new R user, and > I've been trying to read tutorials but they haven't been much help to me so > far. > > The problem is relatively simple as I've already created working solutions > in Java and Perl, but I need a solution in R as well. > > I have two text files, say pos.txt and reg.txt. In pos.txt, the data is > listed for example: > > c22 1445 ?- CG 1 4 > c22 1542 + CG 2 3 > c22 1678 + CG 13 15 > ... > > etc. for thousands of lines. The most important column is column 2, which > lists "position" (e.g. 1445, 1542, 1678). In reg.txt, data is listed as: > > c22 1440 1500 cpg: 44 56 ...... > c22 1520 1700 cpg: 56 87 ...... > c22 1800 1900 cpg: 58 90 ...... > ... > > where the values in column 2 is the "start" position and values in column 3 > are the "end" position. There are 10 columns total but I just listed the > first few. Also, the text files are different lengths. > > > Essentially, my problem is trying to take the position listed in column 2 of > pos.txt and try to find the region (based on start and end positions) listed > in reg.txt. Then I need to print: > > c22 "start" "end" "position" + 1 5 > > where the last 3 columns are from pos.txt as well (i.e. all of the lines > don't end in ?+ 1 5, but rather the values for the columns in pos.txt). > Also, the position needs to be within the start and end position. > > So far I've been able to use read.table to create a data frame for each text > file, and I've also named each column (e.g. reg.data$end) and I can output > each column individually. However, the problem I keep facing is how to > compare the numbers for "position" in pos.txt to the numbers for "start" and > "end" in reg.txt. I tried to use: > > if ((pos >= start) | (pos <= end)).. > > but an error comes up that says the files aren't the same length. > > In Java and Perl I used nested loops to cycle through each element in one > file, and compare it to every element in the other file, and then printed to > a new text file. As such, I was trying to learn a bit more about arrays in > R, but if you know of a better way in R to do this then please let me know. > > Any help is greatly appreciated. > > Thank you, > AJ > > -- > View this message in context: http://r.789695.n4.nabble.com/Help-with-isolating-and-comparing-data-from-two-files-tp3543170p3543170.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve?