thr3ads.net - R help - [R] Identifying a change in events between bins [Mar 2012]

If this information is useful, please help other people find it:
Share via:

Mark Hills

2012-Mar-16 20:55 UTC

[R] Identifying a change in events between bins

Hi there,

First off, despite this being my first post here, I have scanned the R help
forums a lot in the past few months to help with some questions, so a big thank
you to the community as a whole for being so helpful!

I'm somewhat of an R newbie, and have run up against a problem that I
can't seem to solve.  If anyone is able to help I would really appreciate
it!

I'm looking at a number of events across a chromosome, and have written a
program that collects them into different bins, based on a specified binsize. 
The events are directional, either forward or reverse, and a chromosome can
either be fwd/fwd (all the events fall into the fwd bins), rev/rev (all the
events fall into the rev bins) or fwd/rev (events are evenly split).  In some
cases, chromosomes switch from one state to another (eg fwd/fwd to fwd/rev). 
There are a number of rules that dictate my data.  First, while there is
stochastic variation, the sum of fwd and rev in each bin should have
approximately the same value. If I were to take the total number of events and
divide them by the number of bins to get an average count per bin, I would
expect approximately that value in each bin; in the case of fwd/fwd it would be
about average number in the fwd column and close to zero in the rev column, in
rev/rev it would be about the average number in the rev column and close to zero
in the fwd column, and in fwd/rev it would be about half the average number in
both.

Hopefully my png attachment worked and you can see an example.  The top plot
shows fwd reads, the 2nd shows rev reads and the third shows fwd minus rev
reads.

What I would like to be able to do is to automatically assign regions in which
the chromosome switches from one state to another. From the graphs (and from the
read.table output below) you can see that this particular chromosome is fwd/fwd
from bin 1 to 59, fwd/rev from bin 61 to 73, and rev/rev for the remainder of
the chromosomes.

These are generated from a read.table that looks like this:

bin  fwd  rev
50  484   2
51  366   4
52  527   6
53  635   2
54  573   6
55  506   4
56  600   6
57  560   2
58  504   2
59  545   0
60  501  68
61  419 223
62  252 109
63  259 138
64  355 189
65  218 125
66  140  57
67   45  31
68  276 144
69  263 152
70  330 193
71  439 204
72  347 207
73   10 611
74    6 619
75    2 578
76    7 372
77    6 436
78    4 373
79    8 417
80    2 276

My question is this:

1. Is there an obvious way to automatically identify these regions?  

I am not sure how I can go about scanning previous lines within a read.table to
find a point at which the values change.  In the above example, I would like the
program to identify that the fwd graph shifts from ~1x the average to ~0.5x the
average between bin 61 and 62, and from ~0.5x the average to ~0x the average
between bin 72 and 73. Conversely I'd like to identify the rev graph
shifting from ~0x average to ~0.5x average between bins 59 and 60, and from 0.5x
average to 1x average from bin 72 to 73.  Finally, I'd like to
cross-reference the output from fwd and rev to  only pull out reciprocal
switches (ie those that occur within 3 bins of each other in both fwd and rev
data sets).

What I've been trying to gt to work is to generate values based on 0, 0.5
and 1x the average events, and trying to pull out the range of bins that fall
into each of those categories (possibly 1 SD higher or lower to account for the
stochastic variation), but I'm not really sure how to go about that.

2. If I can find a way to identify a shift between bins, is there any way to
then look in smaller bin sizes across those regions.  The bins shown above are
for 200,000 bases of DNA.  If my program automatically found an event between
bin 72 and 73 (14,400,000 bases to 14,600,000), is it possible to feed that
region into say 50,000 base bins to narrow down the location of the event?

I'm not sure if either of these questions is a simple task or incredibly
time consuming and inappropriate to ask a message board.  If you can offer any
advice though, I would really appreciate it!

Thanks,

Mark
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Chr13.png
Type: image/png
Size: 7618 bytes
Desc: Chr13.png
URL:
<https://stat.ethz.ch/pipermail/r-help/attachments/20120316/ec3d18e3/attachment.png>

Robert Baer

2012-Mar-19 15:07 UTC

head link

[R] Identifying a change in events between bins

Your description was too general for me to know exactly what you want but 
perhaps this will help you solve your own problem> set.seed(123)
> evtlist = sample(c('fwd','rev'),100,replace=TRUE)
> evtlist  [1] "fwd" "rev" "fwd" "rev"
"rev" "fwd" "rev" "rev" "rev"
"fwd" "rev"
"fwd" "rev"
[14] "rev" "fwd" "rev" "fwd"
"fwd" "fwd" "rev" "rev" "rev"
"rev" "rev" "rev"
"rev"
[27] "rev" "rev" "fwd" "fwd"
"rev" "rev" "rev" "rev" "fwd"
"fwd" "rev" "fwd"
"fwd"
[40] "fwd" "fwd" "fwd" "fwd"
"fwd" "fwd" "fwd" "fwd" "fwd"
"fwd" "rev" "fwd"
"fwd"
[53] "rev" "fwd" "rev" "fwd"
"fwd" "rev" "rev" "fwd" "rev"
"fwd" "fwd" "fwd"
"rev"
[66] "fwd" "rev" "rev" "rev"
"fwd" "rev" "rev" "rev" "fwd"
"fwd" "fwd" "fwd"
"rev"
[79] "fwd" "fwd" "fwd" "rev"
"fwd" "rev" "fwd" "fwd" "rev"
"rev" "rev" "fwd"
"fwd"
[92] "rev" "fwd" "rev" "fwd"
"fwd" "rev" "fwd" "fwd"
"rev"> rle(evtlist)Run Length Encoding
  lengths: int [1:50] 1 1 1 2 1 3 1 1 1 2 ...
  values : chr [1:50] "fwd" "rev" "fwd"
"rev" "fwd" "rev" "fwd" "rev"
"fwd"
...> rle(evtlist)$lengths[1]  1  1  1  2  1  3  1  1  1  2  1  1  3  9  2  4  2  1 12  1  2  1  1  1 
2  2  1  1
[29]  3  1  1  3  1  3  4  1  3  1  1  1  2  3  2  1  1  1  2  1  2 
1>see ?rle

Rob
-----Original Message----- 
From: Mark Hills
Sent: Friday, March 16, 2012 3:55 PM
To: r-help at r-project.org
Subject: [R] Identifying a change in events between bins

Hi there,

First off, despite this being my first post here, I have scanned the R help 
forums a lot in the past few months to help with some questions, so a big 
thank you to the community as a whole for being so helpful!

I'm somewhat of an R newbie, and have run up against a problem that I
can't
seem to solve.  If anyone is able to help I would really appreciate it!

I'm looking at a number of events across a chromosome, and have written a 
program that collects them into different bins, based on a specified 
binsize.  The events are directional, either forward or reverse, and a 
chromosome can either be fwd/fwd (all the events fall into the fwd bins), 
rev/rev (all the events fall into the rev bins) or fwd/rev (events are 
evenly split).  In some cases, chromosomes switch from one state to another 
(eg fwd/fwd to fwd/rev).  There are a number of rules that dictate my data. 
First, while there is stochastic variation, the sum of fwd and rev in each 
bin should have approximately the same value. If I were to take the total 
number of events and divide them by the number of bins to get an average 
count per bin, I would expect approximately that value in each bin; in the 
case of fwd/fwd it would be about average number in the fwd column and close 
to zero in the rev column, in rev/rev it would be about the average number 
in the rev column and close to zero in the fwd column, and in fwd/rev it 
would be about half the average number in both.

Hopefully my png attachment worked and you can see an example.  The top plot 
shows fwd reads, the 2nd shows rev reads and the third shows fwd minus rev 
reads.

What I would like to be able to do is to automatically assign regions in 
which the chromosome switches from one state to another. From the graphs 
(and from the read.table output below) you can see that this particular 
chromosome is fwd/fwd from bin 1 to 59, fwd/rev from bin 61 to 73, and 
rev/rev for the remainder of the chromosomes.

These are generated from a read.table that looks like this:

bin  fwd  rev
50  484   2
51  366   4
52  527   6
53  635   2
54  573   6
55  506   4
56  600   6
57  560   2
58  504   2
59  545   0
60  501  68
61  419 223
62  252 109
63  259 138
64  355 189
65  218 125
66  140  57
67   45  31
68  276 144
69  263 152
70  330 193
71  439 204
72  347 207
73   10 611
74    6 619
75    2 578
76    7 372
77    6 436
78    4 373
79    8 417
80    2 276

My question is this:

1. Is there an obvious way to automatically identify these regions?

I am not sure how I can go about scanning previous lines within a read.table 
to find a point at which the values change.  In the above example, I would 
like the program to identify that the fwd graph shifts from ~1x the average 
to ~0.5x the average between bin 61 and 62, and from ~0.5x the average to 
~0x the average between bin 72 and 73. Conversely I'd like to identify the 
rev graph shifting from ~0x average to ~0.5x average between bins 59 and 60, 
and from 0.5x average to 1x average from bin 72 to 73.  Finally, I'd like to
cross-reference the output from fwd and rev to  only pull out reciprocal 
switches (ie those that occur within 3 bins of each other in both fwd and 
rev data sets).

What I've been trying to gt to work is to generate values based on 0, 0.5 
and 1x the average events, and trying to pull out the range of bins that 
fall into each of those categories (possibly 1 SD higher or lower to account 
for the stochastic variation), but I'm not really sure how to go about that.

2. If I can find a way to identify a shift between bins, is there any way to 
then look in smaller bin sizes across those regions.  The bins shown above 
are for 200,000 bases of DNA.  If my program automatically found an event 
between bin 72 and 73 (14,400,000 bases to 14,600,000), is it possible to 
feed that region into say 50,000 base bins to narrow down the location of 
the event?

I'm not sure if either of these questions is a simple task or incredibly 
time consuming and inappropriate to ask a message board.  If you can offer 
any advice though, I would really appreciate it!

Thanks,

Mark





______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


------------------------------------------
Robert W. Baer, Ph.D.
Professor of Physiology
Kirksville College of Osteopathic Medicine
A. T. Still University of Health Sciences
800 W. Jefferson St.
Kirksville, MO 63501
660-626-2322
FAX 660-626-2965

Seemingly Similar Threads

Search for more seemingly similar threads

R help - Mar 2012 - Identifying a change in events between bins

[R] Identifying a change in events between bins

[R] Identifying a change in events between bins

Seemingly Similar Threads