thr3ads.net - R help - [R] two difficult loop [Jun 2016]

If this information is useful, please help other people find it:
Share via:

greg holly

2016-Jun-12 20:14 UTC

[R] two difficult loop

Dear all;



I have two data sets, data=map and data=ref). A small part of each data set
are given below. Data map has more than 27 million and data ref has about
560 rows. Basically I need run two different task. My R codes for these
task are given below but they do not work properly.

I sincerely do appreciate your helps.


Regards,

Greg



Task 1)

For example, the first and second columns for row 1 in data ref are 29220
63933. So I need write an R code normally first look the first row in ref
(which they are 29220 and 63933) than summing the column of "map$rate"
and
give the number of rows that >0.85. Then do the same for the second,
third....in ref. At the end I would like a table gave below (the results I
need). Please notice the all value specified in ref data file are exist in
map$reg column.



Task2)

Again example, the first and second columns for row 1 in data ref are 29220
63933. So I need write an R code give the minimum map$p for the 29220
-63933 intervals in map file. Than

do the same for the second, third....in ref.




#my attempt for the first question

temp<-map[order(map$reg, map$p),]

count<-1

temp<-unique(temp$reg

for(i in 1:length(ref) {

  for(j in 1:length(ref)

  {

temp1<-if (temp[pos[i]==ref[ref$reg1,] & (temp[pos[j]==ref[ref$reg2,]
& temp[cumsum(temp$rate)>0.70,])
count=count+1

    }

}

#my attempt for the second question



temp<-map[order(map$reg, map$p),]

count<-1

temp<-unique(temp$reg

for(i in 1:length(ref) {

  for(j in 1:length(ref)

  {

temp2<-if (temp[pos[i]==ref[ref$reg1,] & (temp[pos[j]==ref[ref$reg2,])

output<-temp2[temp2$p==min(temp2$p),]

    }

}



Data sets


  Data= map

  reg   p      rate

 10276 0.700  3.867e-18

 71608 0.830  4.542e-16

 29220 0.430  1.948e-15

 99542 0.220  1.084e-15

 26441 0.880  9.675e-14

 95082 0.090  7.349e-13

 36169 0.480  9.715e-13

 55572 0.500  9.071e-12

 65255 0.300  1.688e-11

 51960 0.970  1.163e-10

 55652 0.388  3.750e-10

 63933 0.250  9.128e-10

 35170 0.720  7.355e-09

 06491 0.370  1.634e-08

 85508 0.470  1.057e-07

 86666 0.580  7.862e-07

 04758 0.810  9.501e-07

 06169 0.440  1.104e-06

 63933 0.750  2.624e-06

 41838 0.960  8.119e-06


 data=ref

  reg1         reg2

  29220     63933

  26441     41838

  06169     10276

  74806     92643

  73732     82451

  86042     93502

  85508     95082



       the results I need

     reg1      reg2 n

   29220   63933  12

   26441   41838   78

   06169 10276  125

   74806 92643   11

   73732 82451   47

   86042 93502   98

   85508 95082  219

	[[alternative HTML version deleted]]

Bert Gunter

2016-Jun-12 22:36 UTC

head link

[R] two difficult loop

Greg:

I was not able to understand your task 1. Perhaps others can.

My understanding of your task 2 is that for each row of ref, you wish
to find all rows,of map such that the reg values in those rows fall
between the reg1 and reg2 values in ref (inclusive change <= to < if
you don't want the endpoints), and then you want the minimum map$p
values of all those rows. If that is correct, I believe this will do
it (but caution, untested, as you failed to provide data in a
convenient form, e.g. using dput() )

task2 <- with(map,vapply(seq_len(nrow(ref)),function(i)
min(p[ref[i,1]<=reg & reg <= ref[i,2] ]),0))


If my understanding is incorrect, please ignore both the above and the
following:


The "solution" I have given above seems inefficient, so others may be
able to significantly improve it if you find that it takes too long.
OTOH, my understanding of your specification is that you need to
search for all rows in map data frame that meet the criterion for each
row of ref, and without further information, I don't know how to do
this without just repeating the search 560 times.


Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sun, Jun 12, 2016 at 1:14 PM, greg holly <mak.hholly at gmail.com>
wrote:> Dear all;
>
>
>
> I have two data sets, data=map and data=ref). A small part of each data set
> are given below. Data map has more than 27 million and data ref has about
> 560 rows. Basically I need run two different task. My R codes for these
> task are given below but they do not work properly.
>
> I sincerely do appreciate your helps.
>
>
> Regards,
>
> Greg
>
>
>
> Task 1)
>
> For example, the first and second columns for row 1 in data ref are 29220
> 63933. So I need write an R code normally first look the first row in ref
> (which they are 29220 and 63933) than summing the column of
"map$rate" and
> give the number of rows that >0.85. Then do the same for the second,
> third....in ref. At the end I would like a table gave below (the results I
> need). Please notice the all value specified in ref data file are exist in
> map$reg column.
>
>
>
> Task2)
>
> Again example, the first and second columns for row 1 in data ref are 29220
> 63933. So I need write an R code give the minimum map$p for the 29220
> -63933 intervals in map file. Than
>
> do the same for the second, third....in ref.
>
>
>
>
> #my attempt for the first question
>
> temp<-map[order(map$reg, map$p),]
>
> count<-1
>
> temp<-unique(temp$reg
>
> for(i in 1:length(ref) {
>
>   for(j in 1:length(ref)
>
>   {
>
> temp1<-if (temp[pos[i]==ref[ref$reg1,] &
(temp[pos[j]==ref[ref$reg2,]
> & temp[cumsum(temp$rate)
>>0.70,])
>
> count=count+1
>
>     }
>
> }
>
> #my attempt for the second question
>
>
>
> temp<-map[order(map$reg, map$p),]
>
> count<-1
>
> temp<-unique(temp$reg
>
> for(i in 1:length(ref) {
>
>   for(j in 1:length(ref)
>
>   {
>
> temp2<-if (temp[pos[i]==ref[ref$reg1,] &
(temp[pos[j]==ref[ref$reg2,])
>
> output<-temp2[temp2$p==min(temp2$p),]
>
>     }
>
> }
>
>
>
> Data sets
>
>
>   Data= map
>
>   reg   p      rate
>
>  10276 0.700  3.867e-18
>
>  71608 0.830  4.542e-16
>
>  29220 0.430  1.948e-15
>
>  99542 0.220  1.084e-15
>
>  26441 0.880  9.675e-14
>
>  95082 0.090  7.349e-13
>
>  36169 0.480  9.715e-13
>
>  55572 0.500  9.071e-12
>
>  65255 0.300  1.688e-11
>
>  51960 0.970  1.163e-10
>
>  55652 0.388  3.750e-10
>
>  63933 0.250  9.128e-10
>
>  35170 0.720  7.355e-09
>
>  06491 0.370  1.634e-08
>
>  85508 0.470  1.057e-07
>
>  86666 0.580  7.862e-07
>
>  04758 0.810  9.501e-07
>
>  06169 0.440  1.104e-06
>
>  63933 0.750  2.624e-06
>
>  41838 0.960  8.119e-06
>
>
>  data=ref
>
>   reg1         reg2
>
>   29220     63933
>
>   26441     41838
>
>   06169     10276
>
>   74806     92643
>
>   73732     82451
>
>   86042     93502
>
>   85508     95082
>
>
>
>        the results I need
>
>      reg1      reg2 n
>
>    29220   63933  12
>
>    26441   41838   78
>
>    06169 10276  125
>
>    74806 92643   11
>
>    73732 82451   47
>
>    86042 93502   98
>
>    85508 95082  219
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Jim Lemon

2016-Jun-12 23:06 UTC

head link

[R] two difficult loop

Hi Greg,
You've got a problem that you don't seem to have identified. Your
"reg" field in the "map" data frame can define at most
100000 unique
values. This means that each value will be repeated about 270 times.
Unless there are constraints you haven't mentioned, we would expect
that in 135 cases for each value, the values in each "ref" row will be
in the reverse order and the spans may overlap. I notice that you may
have tried to get around this by sorting the "map" data frame, but
then the order of the rows is different, and the number of rows
"between" any two values changes. Apart from this, it is almost
certain that the number of values of "p > 0.85" in the multiple
runs
between each set of "ref" values will be different. It is possible to
perform both tasks that you mention, but only the second will yield an
unique or tied value for all of the cases. So your result data frame
will have an unspecified number of values for each row in "ref" for
the first task.

Jim


On Mon, Jun 13, 2016 at 6:14 AM, greg holly <mak.hholly at gmail.com>
wrote:> Dear all;
>
>
>
> I have two data sets, data=map and data=ref). A small part of each data set
> are given below. Data map has more than 27 million and data ref has about
> 560 rows. Basically I need run two different task. My R codes for these
> task are given below but they do not work properly.
>
> I sincerely do appreciate your helps.
>
>
> Regards,
>
> Greg
>
>
>
> Task 1)
>
> For example, the first and second columns for row 1 in data ref are 29220
> 63933. So I need write an R code normally first look the first row in ref
> (which they are 29220 and 63933) than summing the column of
"map$rate" and
> give the number of rows that >0.85. Then do the same for the second,
> third....in ref. At the end I would like a table gave below (the results I
> need). Please notice the all value specified in ref data file are exist in
> map$reg column.
>
>
>
> Task2)
>
> Again example, the first and second columns for row 1 in data ref are 29220
> 63933. So I need write an R code give the minimum map$p for the 29220
> -63933 intervals in map file. Than
>
> do the same for the second, third....in ref.
>
>
>
>
> #my attempt for the first question
>
> temp<-map[order(map$reg, map$p),]
>
> count<-1
>
> temp<-unique(temp$reg
>
> for(i in 1:length(ref) {
>
>   for(j in 1:length(ref)
>
>   {
>
> temp1<-if (temp[pos[i]==ref[ref$reg1,] &
(temp[pos[j]==ref[ref$reg2,]
> & temp[cumsum(temp$rate)
>>0.70,])
>
> count=count+1
>
>     }
>
> }
>
> #my attempt for the second question
>
>
>
> temp<-map[order(map$reg, map$p),]
>
> count<-1
>
> temp<-unique(temp$reg
>
> for(i in 1:length(ref) {
>
>   for(j in 1:length(ref)
>
>   {
>
> temp2<-if (temp[pos[i]==ref[ref$reg1,] &
(temp[pos[j]==ref[ref$reg2,])
>
> output<-temp2[temp2$p==min(temp2$p),]
>
>     }
>
> }
>
>
>
> Data sets
>
>
>   Data= map
>
>   reg   p      rate
>
>  10276 0.700  3.867e-18
>
>  71608 0.830  4.542e-16
>
>  29220 0.430  1.948e-15
>
>  99542 0.220  1.084e-15
>
>  26441 0.880  9.675e-14
>
>  95082 0.090  7.349e-13
>
>  36169 0.480  9.715e-13
>
>  55572 0.500  9.071e-12
>
>  65255 0.300  1.688e-11
>
>  51960 0.970  1.163e-10
>
>  55652 0.388  3.750e-10
>
>  63933 0.250  9.128e-10
>
>  35170 0.720  7.355e-09
>
>  06491 0.370  1.634e-08
>
>  85508 0.470  1.057e-07
>
>  86666 0.580  7.862e-07
>
>  04758 0.810  9.501e-07
>
>  06169 0.440  1.104e-06
>
>  63933 0.750  2.624e-06
>
>  41838 0.960  8.119e-06
>
>
>  data=ref
>
>   reg1         reg2
>
>   29220     63933
>
>   26441     41838
>
>   06169     10276
>
>   74806     92643
>
>   73732     82451
>
>   86042     93502
>
>   85508     95082
>
>
>
>        the results I need
>
>      reg1      reg2 n
>
>    29220   63933  12
>
>    26441   41838   78
>
>    06169 10276  125
>
>    74806 92643   11
>
>    73732 82451   47
>
>    86042 93502   98
>
>    85508 95082  219
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

greg holly

2016-Jun-13 02:35 UTC

head link

[R] two difficult loop

Hi Bert;

I do appreciate for this. I need check your codes on task2 tomorrow at my
office on the real data as I have difficulty (because a technical issue) to
remote connection. I am sure it will work well.

I am sorry that I was not able to explain my first question. Basically

Values in ref data represent the region of chromosome. I need choose these
regions in map (all regions values in ref data are exist in map data in the
first column -column map$reg). And then summing up the column "map$rate and
count the numbers that gives >0.85. For example, consider  the first row in
data ref. They are 29220   and  63933. After sorting the first column in
map then summing column "map$rate" only between 29220   to  63933 in
sorted
map and cut off at >0.85. Then count how many rows in sorted map
gives>0.85. For example consider there are 38 rows between 29220   in  63933 in
sortedmap$reg and only summing first 12 of them  gives>0.85. Then my answer is
going to be 12 for 29220   -  63933 in ref.

Thanks I lot for your patience.

Cheers,
Greg

On Sun, Jun 12, 2016 at 10:35 PM, greg holly <mak.hholly at gmail.com>
wrote:
> Hi Bert;
>
> I do appreciate for this. I need check your codes on task2 tomorrow at my
> office on the real data as I have difficulty (because a technical issue) to
> remote connection. I am sure it will work well.
>
> I am sorry that I was not able to explain my first question. Basically
>
> Values in ref data represent the region of chromosome. I need choose these
> regions in map (all regions values in ref data are exist in map data in the
> first column -column map$reg). And then summing up the column
"map$rate and
> count the numbers that gives >0.85. For example, consider  the first row
in
> data ref. They are 29220   and  63933. After sorting the first column in
> map then summing column "map$rate" only between 29220   to  63933
in
> sorted map and cut off at >0.85. Then count how many rows in sorted map
> gives >0.85. For example consider there are 38 rows between 29220   in
>  63933 in sorted map$reg and only summing first 12 of them  gives>0.85.
> Then my answer is going to be 12 for 29220   -  63933 in ref.
>
> Thanks I lot for your patience.
>
> Cheers,
> Greg
>
> On Sun, Jun 12, 2016 at 6:36 PM, Bert Gunter <bgunter.4567 at
gmail.com>
> wrote:
>
>> Greg:
>>
>> I was not able to understand your task 1. Perhaps others can.
>>
>> My understanding of your task 2 is that for each row of ref, you wish
>> to find all rows,of map such that the reg values in those rows fall
>> between the reg1 and reg2 values in ref (inclusive change <= to <
if
>> you don't want the endpoints), and then you want the minimum map$p
>> values of all those rows. If that is correct, I believe this will do
>> it (but caution, untested, as you failed to provide data in a
>> convenient form, e.g. using dput() )
>>
>> task2 <- with(map,vapply(seq_len(nrow(ref)),function(i)
>> min(p[ref[i,1]<=reg & reg <= ref[i,2] ]),0))
>>
>>
>> If my understanding is incorrect, please ignore both the above and the
>> following:
>>
>>
>> The "solution" I have given above seems inefficient, so
others may be
>> able to significantly improve it if you find that it takes too long.
>> OTOH, my understanding of your specification is that you need to
>> search for all rows in map data frame that meet the criterion for each
>> row of ref, and without further information, I don't know how to do
>> this without just repeating the search 560 times.
>>
>>
>> Cheers,
>> Bert
>>
>>
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming
along
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic
strip )
>>
>>
>> On Sun, Jun 12, 2016 at 1:14 PM, greg holly <mak.hholly at
gmail.com> wrote:
>> > Dear all;
>> >
>> >
>> >
>> > I have two data sets, data=map and data=ref). A small part of each
data
>> set
>> > are given below. Data map has more than 27 million and data ref
has
>> about
>> > 560 rows. Basically I need run two different task. My R codes for
these
>> > task are given below but they do not work properly.
>> >
>> > I sincerely do appreciate your helps.
>> >
>> >
>> > Regards,
>> >
>> > Greg
>> >
>> >
>> >
>> > Task 1)
>> >
>> > For example, the first and second columns for row 1 in data ref
are
>> 29220
>> > 63933. So I need write an R code normally first look the first row
in
>> ref
>> > (which they are 29220 and 63933) than summing the column of
"map$rate"
>> and
>> > give the number of rows that >0.85. Then do the same for the
second,
>> > third....in ref. At the end I would like a table gave below (the
>> results I
>> > need). Please notice the all value specified in ref data file are
exist
>> in
>> > map$reg column.
>> >
>> >
>> >
>> > Task2)
>> >
>> > Again example, the first and second columns for row 1 in data ref
are
>> 29220
>> > 63933. So I need write an R code give the minimum map$p for the
29220
>> > -63933 intervals in map file. Than
>> >
>> > do the same for the second, third....in ref.
>> >
>> >
>> >
>> >
>> > #my attempt for the first question
>> >
>> > temp<-map[order(map$reg, map$p),]
>> >
>> > count<-1
>> >
>> > temp<-unique(temp$reg
>> >
>> > for(i in 1:length(ref) {
>> >
>> >   for(j in 1:length(ref)
>> >
>> >   {
>> >
>> > temp1<-if (temp[pos[i]==ref[ref$reg1,] &
(temp[pos[j]==ref[ref$reg2,]
>> > & temp[cumsum(temp$rate)
>> >>0.70,])
>> >
>> > count=count+1
>> >
>> >     }
>> >
>> > }
>> >
>> > #my attempt for the second question
>> >
>> >
>> >
>> > temp<-map[order(map$reg, map$p),]
>> >
>> > count<-1
>> >
>> > temp<-unique(temp$reg
>> >
>> > for(i in 1:length(ref) {
>> >
>> >   for(j in 1:length(ref)
>> >
>> >   {
>> >
>> > temp2<-if (temp[pos[i]==ref[ref$reg1,] &
(temp[pos[j]==ref[ref$reg2,])
>> >
>> > output<-temp2[temp2$p==min(temp2$p),]
>> >
>> >     }
>> >
>> > }
>> >
>> >
>> >
>> > Data sets
>> >
>> >
>> >   Data= map
>> >
>> >   reg   p      rate
>> >
>> >  10276 0.700  3.867e-18
>> >
>> >  71608 0.830  4.542e-16
>> >
>> >  29220 0.430  1.948e-15
>> >
>> >  99542 0.220  1.084e-15
>> >
>> >  26441 0.880  9.675e-14
>> >
>> >  95082 0.090  7.349e-13
>> >
>> >  36169 0.480  9.715e-13
>> >
>> >  55572 0.500  9.071e-12
>> >
>> >  65255 0.300  1.688e-11
>> >
>> >  51960 0.970  1.163e-10
>> >
>> >  55652 0.388  3.750e-10
>> >
>> >  63933 0.250  9.128e-10
>> >
>> >  35170 0.720  7.355e-09
>> >
>> >  06491 0.370  1.634e-08
>> >
>> >  85508 0.470  1.057e-07
>> >
>> >  86666 0.580  7.862e-07
>> >
>> >  04758 0.810  9.501e-07
>> >
>> >  06169 0.440  1.104e-06
>> >
>> >  63933 0.750  2.624e-06
>> >
>> >  41838 0.960  8.119e-06
>> >
>> >
>> >  data=ref
>> >
>> >   reg1         reg2
>> >
>> >   29220     63933
>> >
>> >   26441     41838
>> >
>> >   06169     10276
>> >
>> >   74806     92643
>> >
>> >   73732     82451
>> >
>> >   86042     93502
>> >
>> >   85508     95082
>> >
>> >
>> >
>> >        the results I need
>> >
>> >      reg1      reg2 n
>> >
>> >    29220   63933  12
>> >
>> >    26441   41838   78
>> >
>> >    06169 10276  125
>> >
>> >    74806 92643   11
>> >
>> >    73732 82451   47
>> >
>> >    86042 93502   98
>> >
>> >    85508 95082  219
>> >
>> >         [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>
>
	[[alternative HTML version deleted]]

greg holly

2016-Jun-13 02:41 UTC

head link

[R] two difficult loop

Hi Jim;

Thanks so much for this info. I did not know this as I am very much new in
R, So do you think that, rather than using unique !duplicated would be
better to use?

Thanks in advance,

Greg

On Sun, Jun 12, 2016 at 7:06 PM, Jim Lemon <drjimlemon at gmail.com>
wrote:
> Hi Greg,
> You've got a problem that you don't seem to have identified. Your
> "reg" field in the "map" data frame can define at most
100000 unique
> values. This means that each value will be repeated about 270 times.
> Unless there are constraints you haven't mentioned, we would expect
> that in 135 cases for each value, the values in each "ref" row
will be
> in the reverse order and the spans may overlap. I notice that you may
> have tried to get around this by sorting the "map" data frame,
but
> then the order of the rows is different, and the number of rows
> "between" any two values changes. Apart from this, it is almost
> certain that the number of values of "p > 0.85" in the
multiple runs
> between each set of "ref" values will be different. It is
possible to
> perform both tasks that you mention, but only the second will yield an
> unique or tied value for all of the cases. So your result data frame
> will have an unspecified number of values for each row in "ref"
for
> the first task.
>
> Jim
>
>
> On Mon, Jun 13, 2016 at 6:14 AM, greg holly <mak.hholly at gmail.com>
wrote:
> > Dear all;
> >
> >
> >
> > I have two data sets, data=map and data=ref). A small part of each
data
> set
> > are given below. Data map has more than 27 million and data ref has
about
> > 560 rows. Basically I need run two different task. My R codes for
these
> > task are given below but they do not work properly.
> >
> > I sincerely do appreciate your helps.
> >
> >
> > Regards,
> >
> > Greg
> >
> >
> >
> > Task 1)
> >
> > For example, the first and second columns for row 1 in data ref are
29220
> > 63933. So I need write an R code normally first look the first row in
ref
> > (which they are 29220 and 63933) than summing the column of
"map$rate"
> and
> > give the number of rows that >0.85. Then do the same for the
second,
> > third....in ref. At the end I would like a table gave below (the
results
> I
> > need). Please notice the all value specified in ref data file are
exist
> in
> > map$reg column.
> >
> >
> >
> > Task2)
> >
> > Again example, the first and second columns for row 1 in data ref are
> 29220
> > 63933. So I need write an R code give the minimum map$p for the 29220
> > -63933 intervals in map file. Than
> >
> > do the same for the second, third....in ref.
> >
> >
> >
> >
> > #my attempt for the first question
> >
> > temp<-map[order(map$reg, map$p),]
> >
> > count<-1
> >
> > temp<-unique(temp$reg
> >
> > for(i in 1:length(ref) {
> >
> >   for(j in 1:length(ref)
> >
> >   {
> >
> > temp1<-if (temp[pos[i]==ref[ref$reg1,] &
(temp[pos[j]==ref[ref$reg2,]
> > & temp[cumsum(temp$rate)
> >>0.70,])
> >
> > count=count+1
> >
> >     }
> >
> > }
> >
> > #my attempt for the second question
> >
> >
> >
> > temp<-map[order(map$reg, map$p),]
> >
> > count<-1
> >
> > temp<-unique(temp$reg
> >
> > for(i in 1:length(ref) {
> >
> >   for(j in 1:length(ref)
> >
> >   {
> >
> > temp2<-if (temp[pos[i]==ref[ref$reg1,] &
(temp[pos[j]==ref[ref$reg2,])
> >
> > output<-temp2[temp2$p==min(temp2$p),]
> >
> >     }
> >
> > }
> >
> >
> >
> > Data sets
> >
> >
> >   Data= map
> >
> >   reg   p      rate
> >
> >  10276 0.700  3.867e-18
> >
> >  71608 0.830  4.542e-16
> >
> >  29220 0.430  1.948e-15
> >
> >  99542 0.220  1.084e-15
> >
> >  26441 0.880  9.675e-14
> >
> >  95082 0.090  7.349e-13
> >
> >  36169 0.480  9.715e-13
> >
> >  55572 0.500  9.071e-12
> >
> >  65255 0.300  1.688e-11
> >
> >  51960 0.970  1.163e-10
> >
> >  55652 0.388  3.750e-10
> >
> >  63933 0.250  9.128e-10
> >
> >  35170 0.720  7.355e-09
> >
> >  06491 0.370  1.634e-08
> >
> >  85508 0.470  1.057e-07
> >
> >  86666 0.580  7.862e-07
> >
> >  04758 0.810  9.501e-07
> >
> >  06169 0.440  1.104e-06
> >
> >  63933 0.750  2.624e-06
> >
> >  41838 0.960  8.119e-06
> >
> >
> >  data=ref
> >
> >   reg1         reg2
> >
> >   29220     63933
> >
> >   26441     41838
> >
> >   06169     10276
> >
> >   74806     92643
> >
> >   73732     82451
> >
> >   86042     93502
> >
> >   85508     95082
> >
> >
> >
> >        the results I need
> >
> >      reg1      reg2 n
> >
> >    29220   63933  12
> >
> >    26441   41838   78
> >
> >    06169 10276  125
> >
> >    74806 92643   11
> >
> >    73732 82451   47
> >
> >    86042 93502   98
> >
> >    85508 95082  219
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Jim Lemon

2016-Jun-13 07:19 UTC

head link

[R] two difficult loop

Hi Greg,
Okay, I have a better idea now of what you want. The problem of
multiple matches is still there, but here is a start:

# this data frame actually contains all the values in ref in the "reg"
field
map<-read.table(text="reg p rate
 10276 0.700  3.867e-18
 71608 0.830  4.542e-16
 29220 0.430  1.948e-15
 99542 0.220  1.084e-15
 26441 0.880  9.675e-14
 95082 0.090  7.349e-13
 36169 0.480  9.715e-13
 55572 0.500  9.071e-12
 65255 0.300  1.688e-11
 51960 0.970  1.163e-10
 55652 0.388  3.750e-10
 63933 0.250  9.128e-10
 35170 0.720  7.355e-09
 06491 0.370  1.634e-08
 85508 0.470  1.057e-07
 86666 0.580  7.862e-07
 04758 0.810  9.501e-07
 06169 0.440  1.104e-06
 63933 0.750  2.624e-06
 41838 0.960  8.119e-06
 74806 0.810  9.501e-07
 92643 0.470  1.057e-07
 73732 0.090  7.349e-13
 82451 0.960  8.119e-06
 86042 0.480  9.715e-13
 93502 0.500  9.071e-12
 85508 0.370  1.634e-08
 95082 0.830  4.542e-16",
 header=TRUE)
# same as in your example
ref<-read.table(text="reg1 reg2
 29220     63933
 26441     41838
 06169     10276
 74806     92643
 73732     82451
 86042     93502
 85508     95082",
 header=TRUE)
# sort the "map" data frame
map2<-map[order(map$reg),]
# get a field for the counts
ref$n<-NA
# and a field for the minimum p values
ref$min_p<-NA
# get the number of rows in "ref"
nref<-dim(ref)[1]
for(i in 1:nref) {
 start<-which(map2$reg==ref$reg1[i])
 end<-which(map2$reg==ref$reg2[i])
 cat("start",start,"end",end,"\n")
 # get the range of matches
 regrange<-range(c(start,end))
 # convert this to a sequence spanning all matches
 allreg<-regrange[1]:regrange[2]
 ref$n[i]<-sum(map2$p[allreg] > 0.85)
 ref$min_p[i]<-min(map2$p[allreg])
}

This example uses the span from the first match of "reg1" to the last
match of "reg2". This may not be what you want, so let me know if
there are further constraints.

Jim

On Mon, Jun 13, 2016 at 12:35 PM, greg holly <mak.hholly at gmail.com>
wrote:> Hi Bert;
>
> I do appreciate for this. I need check your codes on task2 tomorrow at my
> office on the real data as I have difficulty (because a technical issue) to
> remote connection. I am sure it will work well.
>
> I am sorry that I was not able to explain my first question. Basically
>
> Values in ref data represent the region of chromosome. I need choose these
> regions in map (all regions values in ref data are exist in map data in the
> first column -column map$reg). And then summing up the column
"map$rate and
> count the numbers that gives >0.85. For example, consider  the first row
in
> data ref. They are 29220   and  63933. After sorting the first column in
> map then summing column "map$rate" only between 29220   to  63933
in sorted
> map and cut off at >0.85. Then count how many rows in sorted map gives
>>0.85. For example consider there are 38 rows between 29220   in  63933
in sorted
> map$reg and only summing first 12 of them  gives>0.85. Then my answer is
> going to be 12 for 29220   -  63933 in ref.
>
> Thanks I lot for your patience.
>
> Cheers,
> Greg
>
> On Sun, Jun 12, 2016 at 10:35 PM, greg holly <mak.hholly at
gmail.com> wrote:
>
>> Hi Bert;
>>
>> I do appreciate for this. I need check your codes on task2 tomorrow at
my
>> office on the real data as I have difficulty (because a technical
issue) to
>> remote connection. I am sure it will work well.
>>
>> I am sorry that I was not able to explain my first question. Basically
>>
>> Values in ref data represent the region of chromosome. I need choose
these
>> regions in map (all regions values in ref data are exist in map data in
the
>> first column -column map$reg). And then summing up the column
"map$rate and
>> count the numbers that gives >0.85. For example, consider  the first
row in
>> data ref. They are 29220   and  63933. After sorting the first column
in
>> map then summing column "map$rate" only between 29220   to 
63933 in
>> sorted map and cut off at >0.85. Then count how many rows in sorted
map
>> gives >0.85. For example consider there are 38 rows between 29220  
in
>>  63933 in sorted map$reg and only summing first 12 of them 
gives>0.85.
>> Then my answer is going to be 12 for 29220   -  63933 in ref.
>>
>> Thanks I lot for your patience.
>>
>> Cheers,
>> Greg
>>
>> On Sun, Jun 12, 2016 at 6:36 PM, Bert Gunter <bgunter.4567 at
gmail.com>
>> wrote:
>>
>>> Greg:
>>>
>>> I was not able to understand your task 1. Perhaps others can.
>>>
>>> My understanding of your task 2 is that for each row of ref, you
wish
>>> to find all rows,of map such that the reg values in those rows fall
>>> between the reg1 and reg2 values in ref (inclusive change <= to
< if
>>> you don't want the endpoints), and then you want the minimum
map$p
>>> values of all those rows. If that is correct, I believe this will
do
>>> it (but caution, untested, as you failed to provide data in a
>>> convenient form, e.g. using dput() )
>>>
>>> task2 <- with(map,vapply(seq_len(nrow(ref)),function(i)
>>> min(p[ref[i,1]<=reg & reg <= ref[i,2] ]),0))
>>>
>>>
>>> If my understanding is incorrect, please ignore both the above and
the
>>> following:
>>>
>>>
>>> The "solution" I have given above seems inefficient, so
others may be
>>> able to significantly improve it if you find that it takes too
long.
>>> OTOH, my understanding of your specification is that you need to
>>> search for all rows in map data frame that meet the criterion for
each
>>> row of ref, and without further information, I don't know how
to do
>>> this without just repeating the search 560 times.
>>>
>>>
>>> Cheers,
>>> Bert
>>>
>>>
>>> Bert Gunter
>>>
>>> "The trouble with having an open mind is that people keep
coming along
>>> and sticking things into it."
>>> -- Opus (aka Berkeley Breathed in his "Bloom County"
comic strip )
>>>
>>>
>>> On Sun, Jun 12, 2016 at 1:14 PM, greg holly <mak.hholly at
gmail.com> wrote:
>>> > Dear all;
>>> >
>>> >
>>> >
>>> > I have two data sets, data=map and data=ref). A small part of
each data
>>> set
>>> > are given below. Data map has more than 27 million and data
ref has
>>> about
>>> > 560 rows. Basically I need run two different task. My R codes
for these
>>> > task are given below but they do not work properly.
>>> >
>>> > I sincerely do appreciate your helps.
>>> >
>>> >
>>> > Regards,
>>> >
>>> > Greg
>>> >
>>> >
>>> >
>>> > Task 1)
>>> >
>>> > For example, the first and second columns for row 1 in data
ref are
>>> 29220
>>> > 63933. So I need write an R code normally first look the first
row in
>>> ref
>>> > (which they are 29220 and 63933) than summing the column of
"map$rate"
>>> and
>>> > give the number of rows that >0.85. Then do the same for
the second,
>>> > third....in ref. At the end I would like a table gave below
(the
>>> results I
>>> > need). Please notice the all value specified in ref data file
are exist
>>> in
>>> > map$reg column.
>>> >
>>> >
>>> >
>>> > Task2)
>>> >
>>> > Again example, the first and second columns for row 1 in data
ref are
>>> 29220
>>> > 63933. So I need write an R code give the minimum map$p for
the 29220
>>> > -63933 intervals in map file. Than
>>> >
>>> > do the same for the second, third....in ref.
>>> >
>>> >
>>> >
>>> >
>>> > #my attempt for the first question
>>> >
>>> > temp<-map[order(map$reg, map$p),]
>>> >
>>> > count<-1
>>> >
>>> > temp<-unique(temp$reg
>>> >
>>> > for(i in 1:length(ref) {
>>> >
>>> >   for(j in 1:length(ref)
>>> >
>>> >   {
>>> >
>>> > temp1<-if (temp[pos[i]==ref[ref$reg1,] &
(temp[pos[j]==ref[ref$reg2,]
>>> > & temp[cumsum(temp$rate)
>>> >>0.70,])
>>> >
>>> > count=count+1
>>> >
>>> >     }
>>> >
>>> > }
>>> >
>>> > #my attempt for the second question
>>> >
>>> >
>>> >
>>> > temp<-map[order(map$reg, map$p),]
>>> >
>>> > count<-1
>>> >
>>> > temp<-unique(temp$reg
>>> >
>>> > for(i in 1:length(ref) {
>>> >
>>> >   for(j in 1:length(ref)
>>> >
>>> >   {
>>> >
>>> > temp2<-if (temp[pos[i]==ref[ref$reg1,] &
(temp[pos[j]==ref[ref$reg2,])
>>> >
>>> > output<-temp2[temp2$p==min(temp2$p),]
>>> >
>>> >     }
>>> >
>>> > }
>>> >
>>> >
>>> >
>>> > Data sets
>>> >
>>> >
>>> >   Data= map
>>> >
>>> >   reg   p      rate
>>> >
>>> >  10276 0.700  3.867e-18
>>> >
>>> >  71608 0.830  4.542e-16
>>> >
>>> >  29220 0.430  1.948e-15
>>> >
>>> >  99542 0.220  1.084e-15
>>> >
>>> >  26441 0.880  9.675e-14
>>> >
>>> >  95082 0.090  7.349e-13
>>> >
>>> >  36169 0.480  9.715e-13
>>> >
>>> >  55572 0.500  9.071e-12
>>> >
>>> >  65255 0.300  1.688e-11
>>> >
>>> >  51960 0.970  1.163e-10
>>> >
>>> >  55652 0.388  3.750e-10
>>> >
>>> >  63933 0.250  9.128e-10
>>> >
>>> >  35170 0.720  7.355e-09
>>> >
>>> >  06491 0.370  1.634e-08
>>> >
>>> >  85508 0.470  1.057e-07
>>> >
>>> >  86666 0.580  7.862e-07
>>> >
>>> >  04758 0.810  9.501e-07
>>> >
>>> >  06169 0.440  1.104e-06
>>> >
>>> >  63933 0.750  2.624e-06
>>> >
>>> >  41838 0.960  8.119e-06
>>> >
>>> >
>>> >  data=ref
>>> >
>>> >   reg1         reg2
>>> >
>>> >   29220     63933
>>> >
>>> >   26441     41838
>>> >
>>> >   06169     10276
>>> >
>>> >   74806     92643
>>> >
>>> >   73732     82451
>>> >
>>> >   86042     93502
>>> >
>>> >   85508     95082
>>> >
>>> >
>>> >
>>> >        the results I need
>>> >
>>> >      reg1      reg2 n
>>> >
>>> >    29220   63933  12
>>> >
>>> >    26441   41838   78
>>> >
>>> >    06169 10276  125
>>> >
>>> >    74806 92643   11
>>> >
>>> >    73732 82451   47
>>> >
>>> >    86042 93502   98
>>> >
>>> >    85508 95082  219
>>> >
>>> >         [[alternative HTML version deleted]]
>>> >
>>> > ______________________________________________
>>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>> > PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> > and provide commented, minimal, self-contained, reproducible
code.
>>>
>>
>>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

greg holly

2016-Jun-13 14:29 UTC

head link

[R] two difficult loop

Hi Jim;

I do apologize if bothering. I have run on the real data and here is the
error message I got:

Thanks,

Greg

start  end
Error in regrange[1]:regrange[2] : result would be too long a vector
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf

On Mon, Jun 13, 2016 at 10:28 AM, greg holly <mak.hholly at gmail.com>
wrote:
> Hi Jim;
>
> I do apologize if bothering. I have run on the real data and here is the
> error message I got:
>
> Thanks,
>
> Greg
>
> start  end
> Error in regrange[1]:regrange[2] : result would be too long a vector
> In addition: Warning messages:
> 1: In min(x) : no non-missing arguments to min; returning Inf
> 2: In max(x) : no non-missing arguments to max; returning -Inf
>
>
> On Mon, Jun 13, 2016 at 3:19 AM, Jim Lemon <drjimlemon at gmail.com>
wrote:
>
>> Hi Greg,
>> Okay, I have a better idea now of what you want. The problem of
>> multiple matches is still there, but here is a start:
>>
>> # this data frame actually contains all the values in ref in the
"reg"
>> field
>> map<-read.table(text="reg p rate
>>  10276 0.700  3.867e-18
>>  71608 0.830  4.542e-16
>>  29220 0.430  1.948e-15
>>  99542 0.220  1.084e-15
>>  26441 0.880  9.675e-14
>>  95082 0.090  7.349e-13
>>  36169 0.480  9.715e-13
>>  55572 0.500  9.071e-12
>>  65255 0.300  1.688e-11
>>  51960 0.970  1.163e-10
>>  55652 0.388  3.750e-10
>>  63933 0.250  9.128e-10
>>  35170 0.720  7.355e-09
>>  06491 0.370  1.634e-08
>>  85508 0.470  1.057e-07
>>  86666 0.580  7.862e-07
>>  04758 0.810  9.501e-07
>>  06169 0.440  1.104e-06
>>  63933 0.750  2.624e-06
>>  41838 0.960  8.119e-06
>>  74806 0.810  9.501e-07
>>  92643 0.470  1.057e-07
>>  73732 0.090  7.349e-13
>>  82451 0.960  8.119e-06
>>  86042 0.480  9.715e-13
>>  93502 0.500  9.071e-12
>>  85508 0.370  1.634e-08
>>  95082 0.830  4.542e-16",
>>  header=TRUE)
>> # same as in your example
>> ref<-read.table(text="reg1 reg2
>>  29220     63933
>>  26441     41838
>>  06169     10276
>>  74806     92643
>>  73732     82451
>>  86042     93502
>>  85508     95082",
>>  header=TRUE)
>> # sort the "map" data frame
>> map2<-map[order(map$reg),]
>> # get a field for the counts
>> ref$n<-NA
>> # and a field for the minimum p values
>> ref$min_p<-NA
>> # get the number of rows in "ref"
>> nref<-dim(ref)[1]
>> for(i in 1:nref) {
>>  start<-which(map2$reg==ref$reg1[i])
>>  end<-which(map2$reg==ref$reg2[i])
>>  cat("start",start,"end",end,"\n")
>>  # get the range of matches
>>  regrange<-range(c(start,end))
>>  # convert this to a sequence spanning all matches
>>  allreg<-regrange[1]:regrange[2]
>>  ref$n[i]<-sum(map2$p[allreg] > 0.85)
>>  ref$min_p[i]<-min(map2$p[allreg])
>> }
>>
>> This example uses the span from the first match of "reg1" to
the last
>> match of "reg2". This may not be what you want, so let me
know if
>> there are further constraints.
>>
>> Jim
>>
>> On Mon, Jun 13, 2016 at 12:35 PM, greg holly <mak.hholly at
gmail.com>
>> wrote:
>> > Hi Bert;
>> >
>> > I do appreciate for this. I need check your codes on task2
tomorrow at
>> my
>> > office on the real data as I have difficulty (because a technical
>> issue) to
>> > remote connection. I am sure it will work well.
>> >
>> > I am sorry that I was not able to explain my first question.
Basically
>> >
>> > Values in ref data represent the region of chromosome. I need
choose
>> these
>> > regions in map (all regions values in ref data are exist in map
data in
>> the
>> > first column -column map$reg). And then summing up the column
"map$rate
>> and
>> > count the numbers that gives >0.85. For example, consider  the
first
>> row in
>> > data ref. They are 29220   and  63933. After sorting the first
column in
>> > map then summing column "map$rate" only between 29220  
to  63933 in
>> sorted
>> > map and cut off at >0.85. Then count how many rows in sorted
map gives
>> >>0.85. For example consider there are 38 rows between 29220   in
63933
>> in sorted
>> > map$reg and only summing first 12 of them  gives>0.85. Then my
answer is
>> > going to be 12 for 29220   -  63933 in ref.
>> >
>> > Thanks I lot for your patience.
>> >
>> > Cheers,
>> > Greg
>> >
>> > On Sun, Jun 12, 2016 at 10:35 PM, greg holly <mak.hholly at
gmail.com>
>> wrote:
>> >
>> >> Hi Bert;
>> >>
>> >> I do appreciate for this. I need check your codes on task2
tomorrow at
>> my
>> >> office on the real data as I have difficulty (because a
technical
>> issue) to
>> >> remote connection. I am sure it will work well.
>> >>
>> >> I am sorry that I was not able to explain my first question.
Basically
>> >>
>> >> Values in ref data represent the region of chromosome. I need
choose
>> these
>> >> regions in map (all regions values in ref data are exist in
map data
>> in the
>> >> first column -column map$reg). And then summing up the column
>> "map$rate and
>> >> count the numbers that gives >0.85. For example, consider 
the first
>> row in
>> >> data ref. They are 29220   and  63933. After sorting the first
column
>> in
>> >> map then summing column "map$rate" only between
29220   to  63933 in
>> >> sorted map and cut off at >0.85. Then count how many rows
in sorted map
>> >> gives >0.85. For example consider there are 38 rows between
29220   in
>> >>  63933 in sorted map$reg and only summing first 12 of them 
gives>0.85.
>> >> Then my answer is going to be 12 for 29220   -  63933 in ref.
>> >>
>> >> Thanks I lot for your patience.
>> >>
>> >> Cheers,
>> >> Greg
>> >>
>> >> On Sun, Jun 12, 2016 at 6:36 PM, Bert Gunter <bgunter.4567
at gmail.com>
>> >> wrote:
>> >>
>> >>> Greg:
>> >>>
>> >>> I was not able to understand your task 1. Perhaps others
can.
>> >>>
>> >>> My understanding of your task 2 is that for each row of
ref, you wish
>> >>> to find all rows,of map such that the reg values in those
rows fall
>> >>> between the reg1 and reg2 values in ref (inclusive change
<= to < if
>> >>> you don't want the endpoints), and then you want the
minimum map$p
>> >>> values of all those rows. If that is correct, I believe
this will do
>> >>> it (but caution, untested, as you failed to provide data
in a
>> >>> convenient form, e.g. using dput() )
>> >>>
>> >>> task2 <- with(map,vapply(seq_len(nrow(ref)),function(i)
>> >>> min(p[ref[i,1]<=reg & reg <= ref[i,2] ]),0))
>> >>>
>> >>>
>> >>> If my understanding is incorrect, please ignore both the
above and the
>> >>> following:
>> >>>
>> >>>
>> >>> The "solution" I have given above seems
inefficient, so others may be
>> >>> able to significantly improve it if you find that it takes
too long.
>> >>> OTOH, my understanding of your specification is that you
need to
>> >>> search for all rows in map data frame that meet the
criterion for each
>> >>> row of ref, and without further information, I don't
know how to do
>> >>> this without just repeating the search 560 times.
>> >>>
>> >>>
>> >>> Cheers,
>> >>> Bert
>> >>>
>> >>>
>> >>> Bert Gunter
>> >>>
>> >>> "The trouble with having an open mind is that people
keep coming along
>> >>> and sticking things into it."
>> >>> -- Opus (aka Berkeley Breathed in his "Bloom
County" comic strip )
>> >>>
>> >>>
>> >>> On Sun, Jun 12, 2016 at 1:14 PM, greg holly <mak.hholly
at gmail.com>
>> wrote:
>> >>> > Dear all;
>> >>> >
>> >>> >
>> >>> >
>> >>> > I have two data sets, data=map and data=ref). A small
part of each
>> data
>> >>> set
>> >>> > are given below. Data map has more than 27 million
and data ref has
>> >>> about
>> >>> > 560 rows. Basically I need run two different task. My
R codes for
>> these
>> >>> > task are given below but they do not work properly.
>> >>> >
>> >>> > I sincerely do appreciate your helps.
>> >>> >
>> >>> >
>> >>> > Regards,
>> >>> >
>> >>> > Greg
>> >>> >
>> >>> >
>> >>> >
>> >>> > Task 1)
>> >>> >
>> >>> > For example, the first and second columns for row 1
in data ref are
>> >>> 29220
>> >>> > 63933. So I need write an R code normally first look
the first row
>> in
>> >>> ref
>> >>> > (which they are 29220 and 63933) than summing the
column of
>> "map$rate"
>> >>> and
>> >>> > give the number of rows that >0.85. Then do the
same for the second,
>> >>> > third....in ref. At the end I would like a table gave
below (the
>> >>> results I
>> >>> > need). Please notice the all value specified in ref
data file are
>> exist
>> >>> in
>> >>> > map$reg column.
>> >>> >
>> >>> >
>> >>> >
>> >>> > Task2)
>> >>> >
>> >>> > Again example, the first and second columns for row 1
in data ref
>> are
>> >>> 29220
>> >>> > 63933. So I need write an R code give the minimum
map$p for the
>> 29220
>> >>> > -63933 intervals in map file. Than
>> >>> >
>> >>> > do the same for the second, third....in ref.
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > #my attempt for the first question
>> >>> >
>> >>> > temp<-map[order(map$reg, map$p),]
>> >>> >
>> >>> > count<-1
>> >>> >
>> >>> > temp<-unique(temp$reg
>> >>> >
>> >>> > for(i in 1:length(ref) {
>> >>> >
>> >>> >   for(j in 1:length(ref)
>> >>> >
>> >>> >   {
>> >>> >
>> >>> > temp1<-if (temp[pos[i]==ref[ref$reg1,] &
>> (temp[pos[j]==ref[ref$reg2,]
>> >>> > & temp[cumsum(temp$rate)
>> >>> >>0.70,])
>> >>> >
>> >>> > count=count+1
>> >>> >
>> >>> >     }
>> >>> >
>> >>> > }
>> >>> >
>> >>> > #my attempt for the second question
>> >>> >
>> >>> >
>> >>> >
>> >>> > temp<-map[order(map$reg, map$p),]
>> >>> >
>> >>> > count<-1
>> >>> >
>> >>> > temp<-unique(temp$reg
>> >>> >
>> >>> > for(i in 1:length(ref) {
>> >>> >
>> >>> >   for(j in 1:length(ref)
>> >>> >
>> >>> >   {
>> >>> >
>> >>> > temp2<-if (temp[pos[i]==ref[ref$reg1,] &
>> (temp[pos[j]==ref[ref$reg2,])
>> >>> >
>> >>> > output<-temp2[temp2$p==min(temp2$p),]
>> >>> >
>> >>> >     }
>> >>> >
>> >>> > }
>> >>> >
>> >>> >
>> >>> >
>> >>> > Data sets
>> >>> >
>> >>> >
>> >>> >   Data= map
>> >>> >
>> >>> >   reg   p      rate
>> >>> >
>> >>> >  10276 0.700  3.867e-18
>> >>> >
>> >>> >  71608 0.830  4.542e-16
>> >>> >
>> >>> >  29220 0.430  1.948e-15
>> >>> >
>> >>> >  99542 0.220  1.084e-15
>> >>> >
>> >>> >  26441 0.880  9.675e-14
>> >>> >
>> >>> >  95082 0.090  7.349e-13
>> >>> >
>> >>> >  36169 0.480  9.715e-13
>> >>> >
>> >>> >  55572 0.500  9.071e-12
>> >>> >
>> >>> >  65255 0.300  1.688e-11
>> >>> >
>> >>> >  51960 0.970  1.163e-10
>> >>> >
>> >>> >  55652 0.388  3.750e-10
>> >>> >
>> >>> >  63933 0.250  9.128e-10
>> >>> >
>> >>> >  35170 0.720  7.355e-09
>> >>> >
>> >>> >  06491 0.370  1.634e-08
>> >>> >
>> >>> >  85508 0.470  1.057e-07
>> >>> >
>> >>> >  86666 0.580  7.862e-07
>> >>> >
>> >>> >  04758 0.810  9.501e-07
>> >>> >
>> >>> >  06169 0.440  1.104e-06
>> >>> >
>> >>> >  63933 0.750  2.624e-06
>> >>> >
>> >>> >  41838 0.960  8.119e-06
>> >>> >
>> >>> >
>> >>> >  data=ref
>> >>> >
>> >>> >   reg1         reg2
>> >>> >
>> >>> >   29220     63933
>> >>> >
>> >>> >   26441     41838
>> >>> >
>> >>> >   06169     10276
>> >>> >
>> >>> >   74806     92643
>> >>> >
>> >>> >   73732     82451
>> >>> >
>> >>> >   86042     93502
>> >>> >
>> >>> >   85508     95082
>> >>> >
>> >>> >
>> >>> >
>> >>> >        the results I need
>> >>> >
>> >>> >      reg1      reg2 n
>> >>> >
>> >>> >    29220   63933  12
>> >>> >
>> >>> >    26441   41838   78
>> >>> >
>> >>> >    06169 10276  125
>> >>> >
>> >>> >    74806 92643   11
>> >>> >
>> >>> >    73732 82451   47
>> >>> >
>> >>> >    86042 93502   98
>> >>> >
>> >>> >    85508 95082  219
>> >>> >
>> >>> >         [[alternative HTML version deleted]]
>> >>> >
>> >>> > ______________________________________________
>> >>> > R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more, see
>> >>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> >>> > PLEASE do read the posting guide
>> >>> http://www.R-project.org/posting-guide.html
>> >>> > and provide commented, minimal, self-contained,
reproducible code.
>> >>>
>> >>
>> >>
>> >
>> >         [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>
>
	[[alternative HTML version deleted]]

Jim Lemon

2016-Jun-14 07:39 UTC

head link

[R] two difficult loop

Hi Greg,
This is obviously a problem with the data. The first error indicates
that the sequence of integers regrange[1]:regrange[2] is too long to
be allocated. Most likely this is because one or both of the endpoints
are infinite. Maybe if you can find where the NAs are you can fix it.

Jim

On Tue, Jun 14, 2016 at 12:29 AM, greg holly <mak.hholly at gmail.com>
wrote:> Hi Jim;
>
> I do apologize if bothering. I have run on the real data and here is the
> error message I got:
>
> Thanks,
>
> Greg
>
> start  end
> Error in regrange[1]:regrange[2] : result would be too long a vector
> In addition: Warning messages:
> 1: In min(x) : no non-missing arguments to min; returning Inf
> 2: In max(x) : no non-missing arguments to max; returning -Inf
>
> On Mon, Jun 13, 2016 at 10:28 AM, greg holly <mak.hholly at
gmail.com> wrote:
>>
>> Hi Jim;
>>
>> I do apologize if bothering. I have run on the real data and here is
the
>> error message I got:
>>
>> Thanks,
>>
>> Greg
>>
>> start  end
>> Error in regrange[1]:regrange[2] : result would be too long a vector
>> In addition: Warning messages:
>> 1: In min(x) : no non-missing arguments to min; returning Inf
>> 2: In max(x) : no non-missing arguments to max; returning -Inf
>>
>>
>> On Mon, Jun 13, 2016 at 3:19 AM, Jim Lemon <drjimlemon at
gmail.com> wrote:
>>>
>>> Hi Greg,
>>> Okay, I have a better idea now of what you want. The problem of
>>> multiple matches is still there, but here is a start:
>>>
>>> # this data frame actually contains all the values in ref in the
"reg"
>>> field
>>> map<-read.table(text="reg p rate
>>>  10276 0.700  3.867e-18
>>>  71608 0.830  4.542e-16
>>>  29220 0.430  1.948e-15
>>>  99542 0.220  1.084e-15
>>>  26441 0.880  9.675e-14
>>>  95082 0.090  7.349e-13
>>>  36169 0.480  9.715e-13
>>>  55572 0.500  9.071e-12
>>>  65255 0.300  1.688e-11
>>>  51960 0.970  1.163e-10
>>>  55652 0.388  3.750e-10
>>>  63933 0.250  9.128e-10
>>>  35170 0.720  7.355e-09
>>>  06491 0.370  1.634e-08
>>>  85508 0.470  1.057e-07
>>>  86666 0.580  7.862e-07
>>>  04758 0.810  9.501e-07
>>>  06169 0.440  1.104e-06
>>>  63933 0.750  2.624e-06
>>>  41838 0.960  8.119e-06
>>>  74806 0.810  9.501e-07
>>>  92643 0.470  1.057e-07
>>>  73732 0.090  7.349e-13
>>>  82451 0.960  8.119e-06
>>>  86042 0.480  9.715e-13
>>>  93502 0.500  9.071e-12
>>>  85508 0.370  1.634e-08
>>>  95082 0.830  4.542e-16",
>>>  header=TRUE)
>>> # same as in your example
>>> ref<-read.table(text="reg1 reg2
>>>  29220     63933
>>>  26441     41838
>>>  06169     10276
>>>  74806     92643
>>>  73732     82451
>>>  86042     93502
>>>  85508     95082",
>>>  header=TRUE)
>>> # sort the "map" data frame
>>> map2<-map[order(map$reg),]
>>> # get a field for the counts
>>> ref$n<-NA
>>> # and a field for the minimum p values
>>> ref$min_p<-NA
>>> # get the number of rows in "ref"
>>> nref<-dim(ref)[1]
>>> for(i in 1:nref) {
>>>  start<-which(map2$reg==ref$reg1[i])
>>>  end<-which(map2$reg==ref$reg2[i])
>>>  cat("start",start,"end",end,"\n")
>>>  # get the range of matches
>>>  regrange<-range(c(start,end))
>>>  # convert this to a sequence spanning all matches
>>>  allreg<-regrange[1]:regrange[2]
>>>  ref$n[i]<-sum(map2$p[allreg] > 0.85)
>>>  ref$min_p[i]<-min(map2$p[allreg])
>>> }
>>>
>>> This example uses the span from the first match of "reg1"
to the last
>>> match of "reg2". This may not be what you want, so let me
know if
>>> there are further constraints.
>>>
>>> Jim
>>>
>>> On Mon, Jun 13, 2016 at 12:35 PM, greg holly <mak.hholly at
gmail.com>
>>> wrote:
>>> > Hi Bert;
>>> >
>>> > I do appreciate for this. I need check your codes on task2
tomorrow at
>>> > my
>>> > office on the real data as I have difficulty (because a
technical
>>> > issue) to
>>> > remote connection. I am sure it will work well.
>>> >
>>> > I am sorry that I was not able to explain my first question.
Basically
>>> >
>>> > Values in ref data represent the region of chromosome. I need
choose
>>> > these
>>> > regions in map (all regions values in ref data are exist in
map data in
>>> > the
>>> > first column -column map$reg). And then summing up the column
"map$rate
>>> > and
>>> > count the numbers that gives >0.85. For example, consider 
the first
>>> > row in
>>> > data ref. They are 29220   and  63933. After sorting the first
column
>>> > in
>>> > map then summing column "map$rate" only between
29220   to  63933 in
>>> > sorted
>>> > map and cut off at >0.85. Then count how many rows in
sorted map gives
>>> >>0.85. For example consider there are 38 rows between 29220 
in  63933
>>> >> in sorted
>>> > map$reg and only summing first 12 of them  gives>0.85. Then
my answer
>>> > is
>>> > going to be 12 for 29220   -  63933 in ref.
>>> >
>>> > Thanks I lot for your patience.
>>> >
>>> > Cheers,
>>> > Greg
>>> >
>>> > On Sun, Jun 12, 2016 at 10:35 PM, greg holly <mak.hholly at
gmail.com>
>>> > wrote:
>>> >
>>> >> Hi Bert;
>>> >>
>>> >> I do appreciate for this. I need check your codes on task2
tomorrow at
>>> >> my
>>> >> office on the real data as I have difficulty (because a
technical
>>> >> issue) to
>>> >> remote connection. I am sure it will work well.
>>> >>
>>> >> I am sorry that I was not able to explain my first
question. Basically
>>> >>
>>> >> Values in ref data represent the region of chromosome. I
need choose
>>> >> these
>>> >> regions in map (all regions values in ref data are exist
in map data
>>> >> in the
>>> >> first column -column map$reg). And then summing up the
column
>>> >> "map$rate and
>>> >> count the numbers that gives >0.85. For example,
consider  the first
>>> >> row in
>>> >> data ref. They are 29220   and  63933. After sorting the
first column
>>> >> in
>>> >> map then summing column "map$rate" only between
29220   to  63933 in
>>> >> sorted map and cut off at >0.85. Then count how many
rows in sorted
>>> >> map
>>> >> gives >0.85. For example consider there are 38 rows
between 29220   in
>>> >>  63933 in sorted map$reg and only summing first 12 of them
>>> >> gives>0.85.
>>> >> Then my answer is going to be 12 for 29220   -  63933 in
ref.
>>> >>
>>> >> Thanks I lot for your patience.
>>> >>
>>> >> Cheers,
>>> >> Greg
>>> >>
>>> >> On Sun, Jun 12, 2016 at 6:36 PM, Bert Gunter
<bgunter.4567 at gmail.com>
>>> >> wrote:
>>> >>
>>> >>> Greg:
>>> >>>
>>> >>> I was not able to understand your task 1. Perhaps
others can.
>>> >>>
>>> >>> My understanding of your task 2 is that for each row
of ref, you wish
>>> >>> to find all rows,of map such that the reg values in
those rows fall
>>> >>> between the reg1 and reg2 values in ref (inclusive
change <= to < if
>>> >>> you don't want the endpoints), and then you want
the minimum map$p
>>> >>> values of all those rows. If that is correct, I
believe this will do
>>> >>> it (but caution, untested, as you failed to provide
data in a
>>> >>> convenient form, e.g. using dput() )
>>> >>>
>>> >>> task2 <-
with(map,vapply(seq_len(nrow(ref)),function(i)
>>> >>> min(p[ref[i,1]<=reg & reg <= ref[i,2] ]),0))
>>> >>>
>>> >>>
>>> >>> If my understanding is incorrect, please ignore both
the above and
>>> >>> the
>>> >>> following:
>>> >>>
>>> >>>
>>> >>> The "solution" I have given above seems
inefficient, so others may be
>>> >>> able to significantly improve it if you find that it
takes too long.
>>> >>> OTOH, my understanding of your specification is that
you need to
>>> >>> search for all rows in map data frame that meet the
criterion for
>>> >>> each
>>> >>> row of ref, and without further information, I
don't know how to do
>>> >>> this without just repeating the search 560 times.
>>> >>>
>>> >>>
>>> >>> Cheers,
>>> >>> Bert
>>> >>>
>>> >>>
>>> >>> Bert Gunter
>>> >>>
>>> >>> "The trouble with having an open mind is that
people keep coming
>>> >>> along
>>> >>> and sticking things into it."
>>> >>> -- Opus (aka Berkeley Breathed in his "Bloom
County" comic strip )
>>> >>>
>>> >>>
>>> >>> On Sun, Jun 12, 2016 at 1:14 PM, greg holly
<mak.hholly at gmail.com>
>>> >>> wrote:
>>> >>> > Dear all;
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > I have two data sets, data=map and data=ref). A
small part of each
>>> >>> > data
>>> >>> set
>>> >>> > are given below. Data map has more than 27
million and data ref has
>>> >>> about
>>> >>> > 560 rows. Basically I need run two different
task. My R codes for
>>> >>> > these
>>> >>> > task are given below but they do not work
properly.
>>> >>> >
>>> >>> > I sincerely do appreciate your helps.
>>> >>> >
>>> >>> >
>>> >>> > Regards,
>>> >>> >
>>> >>> > Greg
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > Task 1)
>>> >>> >
>>> >>> > For example, the first and second columns for row
1 in data ref are
>>> >>> 29220
>>> >>> > 63933. So I need write an R code normally first
look the first row
>>> >>> > in
>>> >>> ref
>>> >>> > (which they are 29220 and 63933) than summing the
column of
>>> >>> > "map$rate"
>>> >>> and
>>> >>> > give the number of rows that >0.85. Then do
the same for the
>>> >>> > second,
>>> >>> > third....in ref. At the end I would like a table
gave below (the
>>> >>> results I
>>> >>> > need). Please notice the all value specified in
ref data file are
>>> >>> > exist
>>> >>> in
>>> >>> > map$reg column.
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > Task2)
>>> >>> >
>>> >>> > Again example, the first and second columns for
row 1 in data ref
>>> >>> > are
>>> >>> 29220
>>> >>> > 63933. So I need write an R code give the minimum
map$p for the
>>> >>> > 29220
>>> >>> > -63933 intervals in map file. Than
>>> >>> >
>>> >>> > do the same for the second, third....in ref.
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > #my attempt for the first question
>>> >>> >
>>> >>> > temp<-map[order(map$reg, map$p),]
>>> >>> >
>>> >>> > count<-1
>>> >>> >
>>> >>> > temp<-unique(temp$reg
>>> >>> >
>>> >>> > for(i in 1:length(ref) {
>>> >>> >
>>> >>> >   for(j in 1:length(ref)
>>> >>> >
>>> >>> >   {
>>> >>> >
>>> >>> > temp1<-if (temp[pos[i]==ref[ref$reg1,] &
>>> >>> > (temp[pos[j]==ref[ref$reg2,]
>>> >>> > & temp[cumsum(temp$rate)
>>> >>> >>0.70,])
>>> >>> >
>>> >>> > count=count+1
>>> >>> >
>>> >>> >     }
>>> >>> >
>>> >>> > }
>>> >>> >
>>> >>> > #my attempt for the second question
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > temp<-map[order(map$reg, map$p),]
>>> >>> >
>>> >>> > count<-1
>>> >>> >
>>> >>> > temp<-unique(temp$reg
>>> >>> >
>>> >>> > for(i in 1:length(ref) {
>>> >>> >
>>> >>> >   for(j in 1:length(ref)
>>> >>> >
>>> >>> >   {
>>> >>> >
>>> >>> > temp2<-if (temp[pos[i]==ref[ref$reg1,] &
>>> >>> > (temp[pos[j]==ref[ref$reg2,])
>>> >>> >
>>> >>> > output<-temp2[temp2$p==min(temp2$p),]
>>> >>> >
>>> >>> >     }
>>> >>> >
>>> >>> > }
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > Data sets
>>> >>> >
>>> >>> >
>>> >>> >   Data= map
>>> >>> >
>>> >>> >   reg   p      rate
>>> >>> >
>>> >>> >  10276 0.700  3.867e-18
>>> >>> >
>>> >>> >  71608 0.830  4.542e-16
>>> >>> >
>>> >>> >  29220 0.430  1.948e-15
>>> >>> >
>>> >>> >  99542 0.220  1.084e-15
>>> >>> >
>>> >>> >  26441 0.880  9.675e-14
>>> >>> >
>>> >>> >  95082 0.090  7.349e-13
>>> >>> >
>>> >>> >  36169 0.480  9.715e-13
>>> >>> >
>>> >>> >  55572 0.500  9.071e-12
>>> >>> >
>>> >>> >  65255 0.300  1.688e-11
>>> >>> >
>>> >>> >  51960 0.970  1.163e-10
>>> >>> >
>>> >>> >  55652 0.388  3.750e-10
>>> >>> >
>>> >>> >  63933 0.250  9.128e-10
>>> >>> >
>>> >>> >  35170 0.720  7.355e-09
>>> >>> >
>>> >>> >  06491 0.370  1.634e-08
>>> >>> >
>>> >>> >  85508 0.470  1.057e-07
>>> >>> >
>>> >>> >  86666 0.580  7.862e-07
>>> >>> >
>>> >>> >  04758 0.810  9.501e-07
>>> >>> >
>>> >>> >  06169 0.440  1.104e-06
>>> >>> >
>>> >>> >  63933 0.750  2.624e-06
>>> >>> >
>>> >>> >  41838 0.960  8.119e-06
>>> >>> >
>>> >>> >
>>> >>> >  data=ref
>>> >>> >
>>> >>> >   reg1         reg2
>>> >>> >
>>> >>> >   29220     63933
>>> >>> >
>>> >>> >   26441     41838
>>> >>> >
>>> >>> >   06169     10276
>>> >>> >
>>> >>> >   74806     92643
>>> >>> >
>>> >>> >   73732     82451
>>> >>> >
>>> >>> >   86042     93502
>>> >>> >
>>> >>> >   85508     95082
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> >        the results I need
>>> >>> >
>>> >>> >      reg1      reg2 n
>>> >>> >
>>> >>> >    29220   63933  12
>>> >>> >
>>> >>> >    26441   41838   78
>>> >>> >
>>> >>> >    06169 10276  125
>>> >>> >
>>> >>> >    74806 92643   11
>>> >>> >
>>> >>> >    73732 82451   47
>>> >>> >
>>> >>> >    86042 93502   98
>>> >>> >
>>> >>> >    85508 95082  219
>>> >>> >
>>> >>> >         [[alternative HTML version deleted]]
>>> >>> >
>>> >>> > ______________________________________________
>>> >>> > R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more, see
>>> >>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>> >>> > PLEASE do read the posting guide
>>> >>> http://www.R-project.org/posting-guide.html
>>> >>> > and provide commented, minimal, self-contained,
reproducible code.
>>> >>>
>>> >>
>>> >>
>>> >
>>> >         [[alternative HTML version deleted]]
>>> >
>>> > ______________________________________________
>>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>> > PLEASE do read the posting guide
>>> > http://www.R-project.org/posting-guide.html
>>> > and provide commented, minimal, self-contained, reproducible
code.
>>
>>
>

R help - Jun 2016 - two difficult loop

[R] two difficult loop

[R] two difficult loop

[R] two difficult loop

[R] two difficult loop

[R] two difficult loop

[R] two difficult loop

[R] two difficult loop

[R] two difficult loop