thr3ads.net - R help - [R] queue waiting times comparison [Aug 2011]

If this information is useful, please help other people find it:
Share via:

Petr PIKAL

2011-Aug-18 11:49 UTC

[R] queue waiting times comparison

Hallo all

I try to find a way how to compare set of waiting times during different 
periods. I tried learn something from queueing theory and used also R 
search. There is plenty of ways but I need to find the easiest and quite 
simple.
Here is a list with actual waiting times.

ml <- structure(list(y1 = c(10, 9, 9, 10, 8, 20, 16, 47, 4, 7, 15, 
18, 36, 5, 24, 15, 40, 10), y2 = c(97, 10, 26, 11, 11, 10, 5, 
13, 19, 5, 5, 59, 4, 16, 10)), .Names = c("y1", "y2"))

par(mfrow=c(1,2))
lapply(ml, hist)

shows that in the first year is more longer waiting times

lapply(ml, mean)

shows (incorrectly) that in the second year there is longer average 
waiting time.

lapply(ml, mean)

gives me completely reversed values.

Can you please give me some hints what to use for "correct" and
"simple"
comparison of  waiting times in two or more periods.

Thank you
Petr

jim holtman

2011-Aug-18 12:09 UTC

head link

[R] queue waiting times comparison

I am not sure why you say that "lapply(ml, mean)" shows (incorrectly)
that the second year has a larger average; it is correct for the data:
> lapply(ml, my.func)$y1
    Count      Mean        SD       Min    Median       90%       95%
     Max       Sum
 18.00000  16.83333  12.42980   4.00000  12.50000  37.20000  41.05000
47.00000 303.00000

$y2
    Count      Mean        SD       Min    Median       90%       95%
     Max       Sum
 15.00000  20.06667  25.27694   4.00000  11.00000  45.80000  70.40000
97.00000 301.00000


You have a larger "outlier" in the second year that causes the mean to
be higher.  The median is lower, but I usually look at the 90th
percentile if I am looking at response time from a system and again
the second year has a higher value.

So exactly why do you not "trust" your data?

On Thu, Aug 18, 2011 at 7:49 AM, Petr PIKAL <petr.pikal at precheza.cz>
wrote:> Hallo all
>
> I try to find a way how to compare set of waiting times during different
> periods. I tried learn something from queueing theory and used also R
> search. There is plenty of ways but I need to find the easiest and quite
> simple.
> Here is a list with actual waiting times.
>
> ml <- structure(list(y1 = c(10, 9, 9, 10, 8, 20, 16, 47, 4, 7, 15,
> 18, 36, 5, 24, 15, 40, 10), y2 = c(97, 10, 26, 11, 11, 10, 5,
> 13, 19, 5, 5, 59, 4, 16, 10)), .Names = c("y1", "y2"))
>
> par(mfrow=c(1,2))
> lapply(ml, hist)
>
> shows that in the first year is more longer waiting times
>
> lapply(ml, mean)
>
> shows (incorrectly) that in the second year there is longer average
> waiting time.
>
> lapply(ml, mean)
>
> gives me completely reversed values.
>
> Can you please give me some hints what to use for "correct" and
"simple"
> comparison of ?waiting times in two or more periods.
>
> Thank you
> Petr
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

Petr PIKAL

2011-Aug-18 12:52 UTC

head link

[R] queue waiting times comparison

Hallo Jim

Thank you and see within text.

jim holtman <jholtman at gmail.com> napsal dne 18.08.2011 14:09:11:
> I am not sure why you say that "lapply(ml, mean)" shows
(incorrectly)
> that the second year has a larger average; it is correct for the data:
> 
> > lapply(ml, my.func)
> $y1
>     Count      Mean        SD       Min    Median       90%       95%
>      Max       Sum
>  18.00000  16.83333  12.42980   4.00000  12.50000  37.20000  41.05000
> 47.00000 303.00000
> 
> $y2
>     Count      Mean        SD       Min    Median       90%       95%
>      Max       Sum
>  15.00000  20.06667  25.27694   4.00000  11.00000  45.80000  70.40000
> 97.00000 301.00000
> 
> 
> You have a larger "outlier" in the second year that causes the
mean to
> be higher.  The median is lower, but I usually look at the 90th
> percentile if I am looking at response time from a system and again
> the second year has a higher value.
> 
> So exactly why do you not "trust" your data?
Well. I trust them, however mean is "correct" central value only when
data
are normally distributed or at least symmetrical. As the values are 
heavily  distorted I feel that I shall not use mean for comparison of such 
sets. Anyway t.test tells me that there is no difference between y2 and 
y1.
> t.test(ml[[1]], ml[[2]])
        Welch Two Sample t-test

data:  ml[[1]] and ml[[2]] 
t = -0.452, df = 19.557, p-value = 0.6563
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -18.17781  11.71115 
sample estimates:
mean of x mean of y 
 16.83333  20.06667 

So based on this I probably will never get conclusive result as sd due to 
"outliers" will be quite high.

When I do
plot(ecdf(ml[[2]]))
plot(ecdf(ml[[1]]), add=T, col=2)

it seems to me that both sets are almost the same and they differ 
substantially only with those "outlier" values.

If I decreased small values of y2 (e.g.)

ml[[2]][ml[[2]]<20] <- ml[[2]][ml[[2]]<20]/2

I get same mean

lapply(ml, mean)
$y1
[1] 16.83333

$y2
[1] 16.1

and t.test tells me that there is no difference between those two sets, 
although I know that most events take half of the time and only few last 
longer so for me such set is better (we improved performance for most of 
the time however there are still scarce events which take a long time).

plot(ecdf(ml[[2]]))
plot(ecdf(ml[[1]]), add=T, col=2)

So still the question stays - what procedure to use for comparison of two 
or more sets with such long tailed distribution? - Trimmed mean?, Median?, 
...

Thanks.

Regards
Petr
> 
> On Thu, Aug 18, 2011 at 7:49 AM, Petr PIKAL <petr.pikal at
precheza.cz>
wrote:> > Hallo all
> >
> > I try to find a way how to compare set of waiting times during 
different> > periods. I tried learn something from queueing theory and used also R
> > search. There is plenty of ways but I need to find the easiest and 
quite> > simple.
> > Here is a list with actual waiting times.
> >
> > ml <- structure(list(y1 = c(10, 9, 9, 10, 8, 20, 16, 47, 4, 7, 15,
> > 18, 36, 5, 24, 15, 40, 10), y2 = c(97, 10, 26, 11, 11, 10, 5,
> > 13, 19, 5, 5, 59, 4, 16, 10)), .Names = c("y1",
"y2"))
> >
> > par(mfrow=c(1,2))
> > lapply(ml, hist)
> >
> > shows that in the first year is more longer waiting times
> >
> > lapply(ml, mean)
> >
> > shows (incorrectly) that in the second year there is longer average
> > waiting time.
> >
> > lapply(ml, mean)
> >
> > gives me completely reversed values.
> >
> > Can you please give me some hints what to use for "correct"
and
"simple"> > comparison of  waiting times in two or more periods.
> >
> > Thank you
> > Petr
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> 
> 
> -- 
> Jim Holtman
> Data Munger Guru
> 
> What is the problem that you are trying to solve?

jim holtman

2011-Aug-18 13:39 UTC

head link

[R] queue waiting times comparison

If those values represent response times in a system, then when I was
responsible for characterizing what the system would do from the
viewpoint of an SLA (service level agreement) with customers using the
system, we usually specified that "90% of the transactions would have
a response time of --- or less".  This took care of most "long
tails".
 So it depends on how you are planning to use this data.  We usually
monitored the 90th or 95th percentile to see how a system was
operating day to day.

On Thu, Aug 18, 2011 at 8:52 AM, Petr PIKAL <petr.pikal at precheza.cz>
wrote:> Hallo Jim
>
> Thank you and see within text.
>
> jim holtman <jholtman at gmail.com> napsal dne 18.08.2011 14:09:11:
>
>> I am not sure why you say that "lapply(ml, mean)" shows
(incorrectly)
>> that the second year has a larger average; it is correct for the data:
>>
>> > lapply(ml, my.func)
>> $y1
>> ? ? Count ? ? ?Mean ? ? ? ?SD ? ? ? Min ? ?Median ? ? ? 90% ? ? ? 95%
>> ? ? ?Max ? ? ? Sum
>> ?18.00000 ?16.83333 ?12.42980 ? 4.00000 ?12.50000 ?37.20000 ?41.05000
>> 47.00000 303.00000
>>
>> $y2
>> ? ? Count ? ? ?Mean ? ? ? ?SD ? ? ? Min ? ?Median ? ? ? 90% ? ? ? 95%
>> ? ? ?Max ? ? ? Sum
>> ?15.00000 ?20.06667 ?25.27694 ? 4.00000 ?11.00000 ?45.80000 ?70.40000
>> 97.00000 301.00000
>>
>>
>> You have a larger "outlier" in the second year that causes
the mean to
>> be higher. ?The median is lower, but I usually look at the 90th
>> percentile if I am looking at response time from a system and again
>> the second year has a higher value.
>>
>> So exactly why do you not "trust" your data?
>
> Well. I trust them, however mean is "correct" central value only
when data
> are normally distributed or at least symmetrical. As the values are
> heavily ?distorted I feel that I shall not use mean for comparison of such
> sets. Anyway t.test tells me that there is no difference between y2 and
> y1.
>
>> t.test(ml[[1]], ml[[2]])
>
> ? ? ? ?Welch Two Sample t-test
>
> data: ?ml[[1]] and ml[[2]]
> t = -0.452, df = 19.557, p-value = 0.6563
> alternative hypothesis: true difference in means is not equal to 0
> 95 percent confidence interval:
> ?-18.17781 ?11.71115
> sample estimates:
> mean of x mean of y
> ?16.83333 ?20.06667
>
> So based on this I probably will never get conclusive result as sd due to
> "outliers" will be quite high.
>
> When I do
> plot(ecdf(ml[[2]]))
> plot(ecdf(ml[[1]]), add=T, col=2)
>
> it seems to me that both sets are almost the same and they differ
> substantially only with those "outlier" values.
>
> If I decreased small values of y2 (e.g.)
>
> ml[[2]][ml[[2]]<20] <- ml[[2]][ml[[2]]<20]/2
>
> I get same mean
>
> lapply(ml, mean)
> $y1
> [1] 16.83333
>
> $y2
> [1] 16.1
>
> and t.test tells me that there is no difference between those two sets,
> although I know that most events take half of the time and only few last
> longer so for me such set is better (we improved performance for most of
> the time however there are still scarce events which take a long time).
>
> plot(ecdf(ml[[2]]))
> plot(ecdf(ml[[1]]), add=T, col=2)
>
> So still the question stays - what procedure to use for comparison of two
> or more sets with such long tailed distribution? - Trimmed mean?, Median?,
> ...
>
> Thanks.
>
> Regards
> Petr
>
>>
>> On Thu, Aug 18, 2011 at 7:49 AM, Petr PIKAL <petr.pikal at
precheza.cz>
> wrote:
>> > Hallo all
>> >
>> > I try to find a way how to compare set of waiting times during
> different
>> > periods. I tried learn something from queueing theory and used
also R
>> > search. There is plenty of ways but I need to find the easiest and
> quite
>> > simple.
>> > Here is a list with actual waiting times.
>> >
>> > ml <- structure(list(y1 = c(10, 9, 9, 10, 8, 20, 16, 47, 4, 7,
15,
>> > 18, 36, 5, 24, 15, 40, 10), y2 = c(97, 10, 26, 11, 11, 10, 5,
>> > 13, 19, 5, 5, 59, 4, 16, 10)), .Names = c("y1",
"y2"))
>> >
>> > par(mfrow=c(1,2))
>> > lapply(ml, hist)
>> >
>> > shows that in the first year is more longer waiting times
>> >
>> > lapply(ml, mean)
>> >
>> > shows (incorrectly) that in the second year there is longer
average
>> > waiting time.
>> >
>> > lapply(ml, mean)
>> >
>> > gives me completely reversed values.
>> >
>> > Can you please give me some hints what to use for
"correct" and
> "simple"
>> > comparison of ?waiting times in two or more periods.
>> >
>> > Thank you
>> > Petr
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>>
>>
>> --
>> Jim Holtman
>> Data Munger Guru
>>
>> What is the problem that you are trying to solve?
>
>


-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

Petr PIKAL

2011-Aug-18 14:12 UTC

head link

[R] queue waiting times comparison

Hi Jim
> 
> If those values represent response times in a system, then when I was
> responsible for characterizing what the system would do from the
> viewpoint of an SLA (service level agreement) with customers using the
> system, we usually specified that "90% of the transactions would have
> a response time of --- or less".  This took care of most "long
tails".
>  So it depends on how you are planning to use this data.  We usually
> monitored the 90th or 95th percentile to see how a system was
> operating day to day.
I get the point. This can be an option. I will discuss it with my 
colleagues.

Thank you for your time and an answer.

Best regards
Petr
> 
> On Thu, Aug 18, 2011 at 8:52 AM, Petr PIKAL <petr.pikal at
precheza.cz>
wrote:> > Hallo Jim
> >
> > Thank you and see within text.
> >
> > jim holtman <jholtman at gmail.com> napsal dne 18.08.2011
14:09:11:
> >
> >> I am not sure why you say that "lapply(ml, mean)" shows
(incorrectly)
> >> that the second year has a larger average; it is correct for the 
data:> >>
> >> > lapply(ml, my.func)
> >> $y1
> >>     Count      Mean        SD       Min    Median       90%      
95%
> >>      Max       Sum
> >>  18.00000  16.83333  12.42980   4.00000  12.50000  37.20000 
41.05000
> >> 47.00000 303.00000
> >>
> >> $y2
> >>     Count      Mean        SD       Min    Median       90%      
95%
> >>      Max       Sum
> >>  15.00000  20.06667  25.27694   4.00000  11.00000  45.80000 
70.40000
> >> 97.00000 301.00000
> >>
> >>
> >> You have a larger "outlier" in the second year that
causes the mean
to> >> be higher.  The median is lower, but I usually look at the 90th
> >> percentile if I am looking at response time from a system and
again
> >> the second year has a higher value.
> >>
> >> So exactly why do you not "trust" your data?
> >
> > Well. I trust them, however mean is "correct" central value
only when
data> > are normally distributed or at least symmetrical. As the values are
> > heavily  distorted I feel that I shall not use mean for comparison of 
such> > sets. Anyway t.test tells me that there is no difference between y2 
and> > y1.
> >
> >> t.test(ml[[1]], ml[[2]])
> >
> >        Welch Two Sample t-test
> >
> > data:  ml[[1]] and ml[[2]]
> > t = -0.452, df = 19.557, p-value = 0.6563
> > alternative hypothesis: true difference in means is not equal to 0
> > 95 percent confidence interval:
> >  -18.17781  11.71115
> > sample estimates:
> > mean of x mean of y
> >  16.83333  20.06667
> >
> > So based on this I probably will never get conclusive result as sd due
to> > "outliers" will be quite high.
> >
> > When I do
> > plot(ecdf(ml[[2]]))
> > plot(ecdf(ml[[1]]), add=T, col=2)
> >
> > it seems to me that both sets are almost the same and they differ
> > substantially only with those "outlier" values.
> >
> > If I decreased small values of y2 (e.g.)
> >
> > ml[[2]][ml[[2]]<20] <- ml[[2]][ml[[2]]<20]/2
> >
> > I get same mean
> >
> > lapply(ml, mean)
> > $y1
> > [1] 16.83333
> >
> > $y2
> > [1] 16.1
> >
> > and t.test tells me that there is no difference between those two 
sets,> > although I know that most events take half of the time and only few 
last> > longer so for me such set is better (we improved performance for most 
of> > the time however there are still scarce events which take a long 
time).> >
> > plot(ecdf(ml[[2]]))
> > plot(ecdf(ml[[1]]), add=T, col=2)
> >
> > So still the question stays - what procedure to use for comparison of 
two> > or more sets with such long tailed distribution? - Trimmed mean?, 
Median?,> > ...
> >
> > Thanks.
> >
> > Regards
> > Petr
> >
> >>
> >> On Thu, Aug 18, 2011 at 7:49 AM, Petr PIKAL <petr.pikal at
precheza.cz>
> > wrote:
> >> > Hallo all
> >> >
> >> > I try to find a way how to compare set of waiting times
during
> > different
> >> > periods. I tried learn something from queueing theory and
used also
R> >> > search. There is plenty of ways but I need to find the
easiest and
> > quite
> >> > simple.
> >> > Here is a list with actual waiting times.
> >> >
> >> > ml <- structure(list(y1 = c(10, 9, 9, 10, 8, 20, 16, 47,
4, 7, 15,
> >> > 18, 36, 5, 24, 15, 40, 10), y2 = c(97, 10, 26, 11, 11, 10, 5,
> >> > 13, 19, 5, 5, 59, 4, 16, 10)), .Names = c("y1",
"y2"))
> >> >
> >> > par(mfrow=c(1,2))
> >> > lapply(ml, hist)
> >> >
> >> > shows that in the first year is more longer waiting times
> >> >
> >> > lapply(ml, mean)
> >> >
> >> > shows (incorrectly) that in the second year there is longer
average
> >> > waiting time.
> >> >
> >> > lapply(ml, mean)
> >> >
> >> > gives me completely reversed values.
> >> >
> >> > Can you please give me some hints what to use for
"correct" and
> > "simple"
> >> > comparison of  waiting times in two or more periods.
> >> >
> >> > Thank you
> >> > Petr
> >> >
> >> > ______________________________________________
> >> > R-help at r-project.org mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> >> > and provide commented, minimal, self-contained, reproducible
code.
> >> >
> >>
> >>
> >>
> >> --
> >> Jim Holtman
> >> Data Munger Guru
> >>
> >> What is the problem that you are trying to solve?
> >
> >
> 
> 
> 
> -- 
> Jim Holtman
> Data Munger Guru
> 
> What is the problem that you are trying to solve?

Gabor Grothendieck

2011-Aug-18 15:08 UTC

head link

[R] queue waiting times comparison

On Thu, Aug 18, 2011 at 10:12 AM, Petr PIKAL <petr.pikal at precheza.cz>
wrote:> Hi Jim
>
>>
>> If those values represent response times in a system, then when I was
>> responsible for characterizing what the system would do from the
>> viewpoint of an SLA (service level agreement) with customers using the
>> system, we usually specified that "90% of the transactions would
have
>> a response time of --- or less". ?This took care of most
"long tails".
>> ?So it depends on how you are planning to use this data. ?We usually
>> monitored the 90th or 95th percentile to see how a system was
>> operating day to day.
>
> I get the point. This can be an option. I will discuss it with my
> colleagues.
>
Here are more plots which each show that the main mass of the
distributions are the same but the right tails differ:

# 1
plot(density(ml$y2, adjust = 2), col = 2)
lines(density(ml$y1, adjust = 2), col = 1)
legend("topright", legend = 1:2, col = 1:2, lty = 1)

# 2
qqplot(ml$y1, ml$y2, xlim = c(0, 100), ylim = c(0, 100))
abline(0, 1)

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

R help - Aug 2011 - queue waiting times comparison

[R] queue waiting times comparison

[R] queue waiting times comparison

[R] queue waiting times comparison

[R] queue waiting times comparison

[R] queue waiting times comparison

[R] queue waiting times comparison