Huntsinger, Reid
2005-Apr-28 18:40 UTC
[R] have to point it out again: a distribution question
Stock returns and other financial data have often found to be heavy-tailed.
Even Cauchy distributions (without even a first absolute moment) have been
entertained as models.
Your qq function subtracts numbers on the scale of a normal (0,1)
distribution from the input data. When the input data are scaled so that
they are insignificant compared to 1, say, then you get essentially the
"theoretical quantiles" ie the "x" component of the list
back from l$x -
l$y. l$x is basically a sample from a normal(0,1) distribution so they do
line up perfectly in the second qqnorm(). Is that what's happening?
Reid Huntsinger
-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of WeiWei Shi
Sent: Thursday, April 28, 2005 1:38 PM
To: Vincent ZOONEKYND
Cc: R-help at stat.math.ethz.ch
Subject: [R] have to point it out again: a distribution question
Dear R-helpers:
I pointed out my question last time but it is only partially solved.
So I would like to point it out again since I think it is very
interesting, at least to me.
It is a question not about how to use R, instead it is a kind of
therotical plus practical question, represented by R.
I came with this question when I built model for some stock returns.
That's the reason I cannot post the complete data here. But I would
like to attach some plots here (I zipped them since the original ones
are too big).
The first plot qq1, is qqnorm plot of my sample, giving me some
"S"-shape. Since I am not very experienced, I am not sure what kind of
distribution my sample follows.
The second plot, qq2, is obtained via
qqnorm(rt(10000, 4)) since I run
fitdistr(kk, 't') and got
m s df
9.998789e-01 7.663799e-03 3.759726e+00
(5.332631e-05) (5.411400e-05) (8.684956e-02)
The second plot seems to say my sample distr follows t-distr. (not sure of
this)
BTW, what the commands for simulating other distr like log-norm,
exponential, and so on?
The third one was obtained by running the following R code:
Suppose my data is read into dataset k from file "f392.txt":
k<-read.table("f392.txt", header=F) # read into k
kk<-k[[1]]
qq(kk)
qq function is defined as below:
qq<-function(dataset){
l<-qqnorm(dataset, plot.it=F)
diff<-l$y-l$x # difference b/w sample and it's therotical quantile
qqnorm(diff)
}
The most interesting thing is (if there is not any stupid game here,
and if my sample follows some kind of distribution (no matter if such
distr has been found or not)), my qq function seems like a way to
evaluate it. But what I am worried about, the line is too "perfect",
which indiates there is something goofy here, which can be proved via
some mathematical inference to get it. However I used
qq(rnorm(10000))
qq(rt(10000, 3.7)
qq(rf(....))
None of them gave me this perfect line!
Sorry for the long question but I want to make it clear to everybody
about my question. I tried my best :)
Thanks for your reading,
Weiwei (Ed) Shi, Ph.D
On 4/23/05, Vincent ZOONEKYND <zoonek at gmail.com>
wrote:> If I understand your problem, you are computing the difference between
> your data and the quantiles of a standard gaussian variable -- in
> other words, the difference between the data and the red line, in the
> following picture.
>
> N <- 100 # Sample size
> m <- 1 # Mean
> s <- 2 # dispersion
> x <- m + s * rt(N, df=2) # Non-gaussian data
>
> qqnorm(x)
> abline(0,1, col="red")
>
> And you get
>
> y <- sort(x) - qnorm(ppoints(N))
> hist(y)
>
> This is probably not the right line (not only because your mean is 1,
> the slope is wrong as well -- if the data were gaussian, you could
> estimate it with the standard deviation).
>
> You can use the "qqline" function to get the line passing
throught the
> first and third quartiles, which is probably closer to what you have
> in mind.
>
> qqnorm(x)
> abline(0,1, col="red")
> qqline(x, col="blue")
>
> The differences are
>
> x1 <- quantile(x, .25)
> x2 <- quantile(x, .75)
> b <- (x2-x1) / (qnorm(.75)-qnorm(.25))
> a <- x1 - b * qnorm(.25)
> y <- sort(x) - (a + b * qnorm(ppoints(N)))
> hist(y)
>
> And you want to know when the differences ceases to be
"significantly"
> different from zero.
>
> plot(y)
> abline(h=0, lty=3)
>
> You can use the plot fo fix a threshold, but unless you have a model
> describing how non-gaussian you data are, this will be empirical.
>
> You will note that, in those simulations, the differences (either
> yours or those from the lines through the first and third quartiles)
> are not gaussian at all.
>
> -- Vincent
>
>
> On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote:
> > hope it is not b/c some central limit therory, otherwise my initial
> > plan will fail :)
> >
> > On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote:
> > > Hi, r-gurus:
> > >
> > > I happened to have a question in my work:
> > >
> > > I have a dataset, which has only one dimention, like
> > > 0.99037297527605
> > > 0.991179836732708
> > > 0.995635340631367
> > > 0.997186769599305
> > > 0.991632565640424
> > > 0.984047197106486
> > > 0.99225943762649
> > > 1.00555642128421
> > > 0.993725402926564
> > > ....
> > >
> > > the data is saved in a file called f392.txt.
> > >
> > > I used the following codes to play around :)
> > >
> > > k<-read.table("f392.txt", header=F) # read into k
> > > kk<-k[[1]]
> > > l<-qqnorm(kk)
> > > diff=c()
> > > lenk<-length(kk)
> > > i=1
> > > while (i<=lenk){
> > > diff[i]=l$y[i]-l$x[i] # save the difference of therotical
quantile
> > > and sample quantile
> > > # remember, my sample mean is around 1
> > > while the therotical one, 0
> > > i<-i+1
> > > }
> > > hist(diff, breaks=300) # analyze the distr of such diff
> > > qqnorm(diff)
> > >
> > > my question is:
> > > from l<-qqnorm(kk), I wanted to know, from which point (or
cut), the
> > > sample points start to become away from therotical ones.
That's the
> > > reason I played around the "diff" list, which gives me
the difference.
> > > To my surprise, the diff is perfectly normal. I tried to use some
> > > kk<-c(1, 2, 5, -1 , ...) to test, I concluded it must be some
> > > distribution my sample follows gives this finding.
> > >
> > > So, any suggestion on the distribution of my sample? I think
there
> > > might be some mathematical inference which can leads this
observation,
> > > but not quite sure.
> > >
> > > btw,
> > > > fitdistr(kk, 't')
> > > m s df
> > > 9.999965e-01 7.630770e-03 3.742244e+00
> > > (5.317674e-05) (5.373884e-05) (8.584725e-02)
> > >
> > > btw2, can anyone suggest a way to find the "cut" or
"threshold" from
> > > my sample to discretize them into 3 groups: two tail-group and
one
> > > main group.--------- my focus.
> > >
> > > Thanks,
> > >
> > > Ed
> > >
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> >
>
Here is summary of l<-qqnorm(kk) # kk is my sample l$y (which is my sample) l$x (which is therotical quantile) diff<-l$y-l$x and> summary(l$y)Min. 1st Qu. Median Mean 3rd Qu. Max. 0.9007 0.9942 0.9998 0.9999 1.0060 1.1070> summary(l$x)Min. 1st Qu. Median Mean 3rd Qu. Max. -4.145e+00 -6.745e-01 0.000e+00 2.383e-17 6.745e-01 4.145e+00> summary(diff)Min. 1st Qu. Median Mean 3rd Qu. Max. -3.0380 0.3311 0.9998 0.9999 1.6690 5.0460 Comparing diff with l$x, though the 1st Qu. and 3rd Qu. are different, diff and l$x seem similar to each other, which are proved by qqnorm(l$x) and qqnorm(diff). running the following codes: r<-rnorm(1000)+1 # since my sample shift from zero to 1 qq(r[r>0.9 & r<1.2]) # select the central part this gives me a straight line now. Thanks for the good explanation for the phenomena. Then, Reid, or other r-gurus, is there a good way to descritize the sample into 3 category: 2 tails and the body? Thanks again, Weiwei On 4/28/05, Huntsinger, Reid <reid_huntsinger at merck.com> wrote:> Stock returns and other financial data have often found to be heavy-tailed. > Even Cauchy distributions (without even a first absolute moment) have been > entertained as models. > > Your qq function subtracts numbers on the scale of a normal (0,1) > distribution from the input data. When the input data are scaled so that > they are insignificant compared to 1, say, then you get essentially the > "theoretical quantiles" ie the "x" component of the list back from l$x - > l$y. l$x is basically a sample from a normal(0,1) distribution so they do > line up perfectly in the second qqnorm(). Is that what's happening? > > Reid Huntsinger > > > -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of WeiWei Shi > Sent: Thursday, April 28, 2005 1:38 PM > To: Vincent ZOONEKYND > Cc: R-help at stat.math.ethz.ch > Subject: [R] have to point it out again: a distribution question > > Dear R-helpers: > I pointed out my question last time but it is only partially solved. > So I would like to point it out again since I think it is very > interesting, at least to me. > It is a question not about how to use R, instead it is a kind of > therotical plus practical question, represented by R. > > I came with this question when I built model for some stock returns. > That's the reason I cannot post the complete data here. But I would > like to attach some plots here (I zipped them since the original ones > are too big). > > The first plot qq1, is qqnorm plot of my sample, giving me some > "S"-shape. Since I am not very experienced, I am not sure what kind of > distribution my sample follows. > > The second plot, qq2, is obtained via > qqnorm(rt(10000, 4)) since I run > fitdistr(kk, 't') and got > m s df > 9.998789e-01 7.663799e-03 3.759726e+00 > (5.332631e-05) (5.411400e-05) (8.684956e-02) > > The second plot seems to say my sample distr follows t-distr. (not sure of > this) > > BTW, what the commands for simulating other distr like log-norm, > exponential, and so on? > > The third one was obtained by running the following R code: > > Suppose my data is read into dataset k from file "f392.txt": > k<-read.table("f392.txt", header=F) # read into k > kk<-k[[1]] > qq(kk) > > qq function is defined as below: > qq<-function(dataset){ > l<-qqnorm(dataset, plot.it=F) > diff<-l$y-l$x # difference b/w sample and it's therotical quantile > qqnorm(diff) > } > > The most interesting thing is (if there is not any stupid game here, > and if my sample follows some kind of distribution (no matter if such > distr has been found or not)), my qq function seems like a way to > evaluate it. But what I am worried about, the line is too "perfect", > which indiates there is something goofy here, which can be proved via > some mathematical inference to get it. However I used > qq(rnorm(10000)) > qq(rt(10000, 3.7) > qq(rf(....)) > > None of them gave me this perfect line! > > Sorry for the long question but I want to make it clear to everybody > about my question. I tried my best :) > > Thanks for your reading, > > Weiwei (Ed) Shi, Ph.D > > On 4/23/05, Vincent ZOONEKYND <zoonek at gmail.com> wrote: > > If I understand your problem, you are computing the difference between > > your data and the quantiles of a standard gaussian variable -- in > > other words, the difference between the data and the red line, in the > > following picture. > > > > N <- 100 # Sample size > > m <- 1 # Mean > > s <- 2 # dispersion > > x <- m + s * rt(N, df=2) # Non-gaussian data > > > > qqnorm(x) > > abline(0,1, col="red") > > > > And you get > > > > y <- sort(x) - qnorm(ppoints(N)) > > hist(y) > > > > This is probably not the right line (not only because your mean is 1, > > the slope is wrong as well -- if the data were gaussian, you could > > estimate it with the standard deviation). > > > > You can use the "qqline" function to get the line passing throught the > > first and third quartiles, which is probably closer to what you have > > in mind. > > > > qqnorm(x) > > abline(0,1, col="red") > > qqline(x, col="blue") > > > > The differences are > > > > x1 <- quantile(x, .25) > > x2 <- quantile(x, .75) > > b <- (x2-x1) / (qnorm(.75)-qnorm(.25)) > > a <- x1 - b * qnorm(.25) > > y <- sort(x) - (a + b * qnorm(ppoints(N))) > > hist(y) > > > > And you want to know when the differences ceases to be "significantly" > > different from zero. > > > > plot(y) > > abline(h=0, lty=3) > > > > You can use the plot fo fix a threshold, but unless you have a model > > describing how non-gaussian you data are, this will be empirical. > > > > You will note that, in those simulations, the differences (either > > yours or those from the lines through the first and third quartiles) > > are not gaussian at all. > > > > -- Vincent > > > > > > On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote: > > > hope it is not b/c some central limit therory, otherwise my initial > > > plan will fail :) > > > > > > On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote: > > > > Hi, r-gurus: > > > > > > > > I happened to have a question in my work: > > > > > > > > I have a dataset, which has only one dimention, like > > > > 0.99037297527605 > > > > 0.991179836732708 > > > > 0.995635340631367 > > > > 0.997186769599305 > > > > 0.991632565640424 > > > > 0.984047197106486 > > > > 0.99225943762649 > > > > 1.00555642128421 > > > > 0.993725402926564 > > > > .... > > > > > > > > the data is saved in a file called f392.txt. > > > > > > > > I used the following codes to play around :) > > > > > > > > k<-read.table("f392.txt", header=F) # read into k > > > > kk<-k[[1]] > > > > l<-qqnorm(kk) > > > > diff=c() > > > > lenk<-length(kk) > > > > i=1 > > > > while (i<=lenk){ > > > > diff[i]=l$y[i]-l$x[i] # save the difference of therotical quantile > > > > and sample quantile > > > > # remember, my sample mean is around 1 > > > > while the therotical one, 0 > > > > i<-i+1 > > > > } > > > > hist(diff, breaks=300) # analyze the distr of such diff > > > > qqnorm(diff) > > > > > > > > my question is: > > > > from l<-qqnorm(kk), I wanted to know, from which point (or cut), the > > > > sample points start to become away from therotical ones. That's the > > > > reason I played around the "diff" list, which gives me the difference. > > > > To my surprise, the diff is perfectly normal. I tried to use some > > > > kk<-c(1, 2, 5, -1 , ...) to test, I concluded it must be some > > > > distribution my sample follows gives this finding. > > > > > > > > So, any suggestion on the distribution of my sample? I think there > > > > might be some mathematical inference which can leads this observation, > > > > but not quite sure. > > > > > > > > btw, > > > > > fitdistr(kk, 't') > > > > m s df > > > > 9.999965e-01 7.630770e-03 3.742244e+00 > > > > (5.317674e-05) (5.373884e-05) (8.584725e-02) > > > > > > > > btw2, can anyone suggest a way to find the "cut" or "threshold" from > > > > my sample to discretize them into 3 groups: two tail-group and one > > > > main group.--------- my focus. > > > > > > > > Thanks, > > > > > > > > Ed > > > > > > > > > > ______________________________________________ > > > R-help at stat.math.ethz.ch mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide! > > > http://www.R-project.org/posting-guide.html > > > > > > > ------------------------------------------------------------------------------ > Notice: This e-mail message, together with any attachment...{{dropped}}
bogdan romocea
2005-Apr-29 18:30 UTC
[R] have to point it out again: a distribution question
> Then, Reid, or other r-gurus, is there a good way to descritize > the sample into 3 category: 2 tails and the body?Out of curiosity, how do you plan to use that information? What would you do if you knew that the 'body' starts here and ends there? -----Original Message----- From: WeiWei Shi [mailto:helprhelp at gmail.com] Sent: Thursday, April 28, 2005 4:18 PM To: Huntsinger, Reid Cc: R-help at stat.math.ethz.ch Subject: Re: [R] have to point it out again: a distribution question Here is summary of l<-qqnorm(kk) # kk is my sample l$y (which is my sample) l$x (which is therotical quantile) diff<-l$y-l$x and> summary(l$y)Min. 1st Qu. Median Mean 3rd Qu. Max. 0.9007 0.9942 0.9998 0.9999 1.0060 1.1070> summary(l$x)Min. 1st Qu. Median Mean 3rd Qu. Max. -4.145e+00 -6.745e-01 0.000e+00 2.383e-17 6.745e-01 4.145e+00> summary(diff)Min. 1st Qu. Median Mean 3rd Qu. Max. -3.0380 0.3311 0.9998 0.9999 1.6690 5.0460 Comparing diff with l$x, though the 1st Qu. and 3rd Qu. are different, diff and l$x seem similar to each other, which are proved by qqnorm(l$x) and qqnorm(diff). running the following codes: r<-rnorm(1000)+1 # since my sample shift from zero to 1 qq(r[r>0.9 & r<1.2]) # select the central part this gives me a straight line now. Thanks for the good explanation for the phenomena. Then, Reid, or other r-gurus, is there a good way to descritize the sample into 3 category: 2 tails and the body? Thanks again, Weiwei On 4/28/05, Huntsinger, Reid <reid_huntsinger at merck.com> wrote:> Stock returns and other financial data have often found to be heavy-tailed. > Even Cauchy distributions (without even a first absolute moment) have been > entertained as models. > > Your qq function subtracts numbers on the scale of a normal (0,1) > distribution from the input data. When the input data are scaled so that > they are insignificant compared to 1, say, then you get essentially the > "theoretical quantiles" ie the "x" component of the list back from l$x - > l$y. l$x is basically a sample from a normal(0,1) distribution so they do > line up perfectly in the second qqnorm(). Is that what's happening? > > Reid Huntsinger > > > -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of WeiWei Shi > Sent: Thursday, April 28, 2005 1:38 PM > To: Vincent ZOONEKYND > Cc: R-help at stat.math.ethz.ch > Subject: [R] have to point it out again: a distribution question > > Dear R-helpers: > I pointed out my question last time but it is only partially solved. > So I would like to point it out again since I think it is very > interesting, at least to me. > It is a question not about how to use R, instead it is a kind of > therotical plus practical question, represented by R. > > I came with this question when I built model for some stock returns. > That's the reason I cannot post the complete data here. But I would > like to attach some plots here (I zipped them since the original ones > are too big). > > The first plot qq1, is qqnorm plot of my sample, giving me some > "S"-shape. Since I am not very experienced, I am not sure what kind of > distribution my sample follows. > > The second plot, qq2, is obtained via > qqnorm(rt(10000, 4)) since I run > fitdistr(kk, 't') and got > m s df > 9.998789e-01 7.663799e-03 3.759726e+00 > (5.332631e-05) (5.411400e-05) (8.684956e-02) > > The second plot seems to say my sample distr follows t-distr. (not sure of > this) > > BTW, what the commands for simulating other distr like log-norm, > exponential, and so on? > > The third one was obtained by running the following R code: > > Suppose my data is read into dataset k from file "f392.txt": > k<-read.table("f392.txt", header=F) # read into k > kk<-k[[1]] > qq(kk) > > qq function is defined as below: > qq<-function(dataset){ > l<-qqnorm(dataset, plot.it=F) > diff<-l$y-l$x # difference b/w sample and it's therotical quantile > qqnorm(diff) > } > > The most interesting thing is (if there is not any stupid game here, > and if my sample follows some kind of distribution (no matter if such > distr has been found or not)), my qq function seems like a way to > evaluate it. But what I am worried about, the line is too "perfect", > which indiates there is something goofy here, which can be proved via > some mathematical inference to get it. However I used > qq(rnorm(10000)) > qq(rt(10000, 3.7) > qq(rf(....)) > > None of them gave me this perfect line! > > Sorry for the long question but I want to make it clear to everybody > about my question. I tried my best :) > > Thanks for your reading, > > Weiwei (Ed) Shi, Ph.D > > On 4/23/05, Vincent ZOONEKYND <zoonek at gmail.com> wrote: > > If I understand your problem, you are computing the difference between > > your data and the quantiles of a standard gaussian variable -- in > > other words, the difference between the data and the red line, in the > > following picture. > > > > N <- 100 # Sample size > > m <- 1 # Mean > > s <- 2 # dispersion > > x <- m + s * rt(N, df=2) # Non-gaussian data > > > > qqnorm(x) > > abline(0,1, col="red") > > > > And you get > > > > y <- sort(x) - qnorm(ppoints(N)) > > hist(y) > > > > This is probably not the right line (not only because your mean is 1, > > the slope is wrong as well -- if the data were gaussian, you could > > estimate it with the standard deviation). > > > > You can use the "qqline" function to get the line passing throught the > > first and third quartiles, which is probably closer to what you have > > in mind. > > > > qqnorm(x) > > abline(0,1, col="red") > > qqline(x, col="blue") > > > > The differences are > > > > x1 <- quantile(x, .25) > > x2 <- quantile(x, .75) > > b <- (x2-x1) / (qnorm(.75)-qnorm(.25)) > > a <- x1 - b * qnorm(.25) > > y <- sort(x) - (a + b * qnorm(ppoints(N))) > > hist(y) > > > > And you want to know when the differences ceases to be "significantly" > > different from zero. > > > > plot(y) > > abline(h=0, lty=3) > > > > You can use the plot fo fix a threshold, but unless you have a model > > describing how non-gaussian you data are, this will be empirical. > > > > You will note that, in those simulations, the differences (either > > yours or those from the lines through the first and third quartiles) > > are not gaussian at all. > > > > -- Vincent > > > > > > On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote: > > > hope it is not b/c some central limit therory, otherwise my initial > > > plan will fail :) > > > > > > On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote: > > > > Hi, r-gurus: > > > > > > > > I happened to have a question in my work: > > > > > > > > I have a dataset, which has only one dimention, like > > > > 0.99037297527605 > > > > 0.991179836732708 > > > > 0.995635340631367 > > > > 0.997186769599305 > > > > 0.991632565640424 > > > > 0.984047197106486 > > > > 0.99225943762649 > > > > 1.00555642128421 > > > > 0.993725402926564 > > > > .... > > > > > > > > the data is saved in a file called f392.txt. > > > > > > > > I used the following codes to play around :) > > > > > > > > k<-read.table("f392.txt", header=F) # read into k > > > > kk<-k[[1]] > > > > l<-qqnorm(kk) > > > > diff=c() > > > > lenk<-length(kk) > > > > i=1 > > > > while (i<=lenk){ > > > > diff[i]=l$y[i]-l$x[i] # save the difference of therotical quantile > > > > and sample quantile > > > > # remember, my sample mean is around 1 > > > > while the therotical one, 0 > > > > i<-i+1 > > > > } > > > > hist(diff, breaks=300) # analyze the distr of such diff > > > > qqnorm(diff) > > > > > > > > my question is: > > > > from l<-qqnorm(kk), I wanted to know, from which point (or cut), the > > > > sample points start to become away from therotical ones. That's the > > > > reason I played around the "diff" list, which gives me the difference. > > > > To my surprise, the diff is perfectly normal. I tried to use some > > > > kk<-c(1, 2, 5, -1 , ...) to test, I concluded it must be some > > > > distribution my sample follows gives this finding. > > > > > > > > So, any suggestion on the distribution of my sample? I think there > > > > might be some mathematical inference which can leads this observation, > > > > but not quite sure. > > > > > > > > btw, > > > > > fitdistr(kk, 't') > > > > m s df > > > > 9.999965e-01 7.630770e-03 3.742244e+00 > > > > (5.317674e-05) (5.373884e-05) (8.584725e-02) > > > > > > > > btw2, can anyone suggest a way to find the "cut" or "threshold" from > > > > my sample to discretize them into 3 groups: two tail-group and one > > > > main group.--------- my focus. > > > > > > > > Thanks, > > > > > > > > Ed > > > > > > > > > > ______________________________________________ > > > R-help at stat.math.ethz.ch mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide! > > > http://www.R-project.org/posting-guide.html > > > > > > > ------------------------------------------------------------------------------ > Notice: This e-mail message, together with any attachment...{{dropped}}______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Huntsinger, Reid
2005-Apr-29 20:28 UTC
[R] have to point it out again: a distribution question
There are many ways to discretize data. That's one way of looking at
clustering ("vector quantization"). You might also look into modelling
approaches which don't require it: splines, trees, etc. What sort of data
mining are you trying to do?
Reid Huntsinger
-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of WeiWei Shi
Sent: Friday, April 29, 2005 3:22 PM
To: bogdan romocea
Cc: R-help at stat.math.ethz.ch
Subject: Re: [R] have to point it out again: a distribution question
discretization from continuous domain to categorical one so that some
data mining algorithm can be applied on it. Maybe there should be
more than 3 categories, I don't know.
I googled some papers in financial field, and any more suggestions or
references will be helpful.
Ed
On 4/29/05, bogdan romocea <br44114 at gmail.com>
wrote:> > Then, Reid, or other r-gurus, is there a good way to descritize
> > the sample into 3 category: 2 tails and the body?
>
> Out of curiosity, how do you plan to use that information? What would
> you do if you knew that the 'body' starts here and ends there?
>
>
> -----Original Message-----
> From: WeiWei Shi [mailto:helprhelp at gmail.com]
> Sent: Thursday, April 28, 2005 4:18 PM
> To: Huntsinger, Reid
> Cc: R-help at stat.math.ethz.ch
> Subject: Re: [R] have to point it out again: a distribution question
>
> Here is summary of
> l<-qqnorm(kk) # kk is my sample
> l$y (which is my sample)
> l$x (which is therotical quantile)
> diff<-l$y-l$x
>
> and
> > summary(l$y)
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> 0.9007 0.9942 0.9998 0.9999 1.0060 1.1070
> > summary(l$x)
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> -4.145e+00 -6.745e-01 0.000e+00 2.383e-17 6.745e-01 4.145e+00
> > summary(diff)
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> -3.0380 0.3311 0.9998 0.9999 1.6690 5.0460
>
> Comparing diff with l$x, though the 1st Qu. and 3rd Qu. are different,
> diff and l$x seem similar to each other, which are proved by
> qqnorm(l$x) and qqnorm(diff).
>
> running the following codes:
>
> r<-rnorm(1000)+1 # since my sample shift from zero to 1
> qq(r[r>0.9 & r<1.2]) # select the central part
>
> this gives me a straight line now.
>
> Thanks for the good explanation for the phenomena.
>
> Then, Reid, or other r-gurus, is there a good way to descritize the
> sample into 3 category: 2 tails and the body?
>
> Thanks again,
>
> Weiwei
>
> On 4/28/05, Huntsinger, Reid <reid_huntsinger at merck.com> wrote:
> > Stock returns and other financial data have often found to be
heavy-tailed.> > Even Cauchy distributions (without even a first absolute moment) have
been> > entertained as models.
> >
> > Your qq function subtracts numbers on the scale of a normal (0,1)
> > distribution from the input data. When the input data are scaled so
that
> > they are insignificant compared to 1, say, then you get essentially
the
> > "theoretical quantiles" ie the "x" component of
the list back from l$x -
> > l$y. l$x is basically a sample from a normal(0,1) distribution so they
do> > line up perfectly in the second qqnorm(). Is that what's
happening?
> >
> > Reid Huntsinger
> >
> >
> > -----Original Message-----
> > From: r-help-bounces at stat.math.ethz.ch
> > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of WeiWei Shi
> > Sent: Thursday, April 28, 2005 1:38 PM
> > To: Vincent ZOONEKYND
> > Cc: R-help at stat.math.ethz.ch
> > Subject: [R] have to point it out again: a distribution question
> >
> > Dear R-helpers:
> > I pointed out my question last time but it is only partially solved.
> > So I would like to point it out again since I think it is very
> > interesting, at least to me.
> > It is a question not about how to use R, instead it is a kind of
> > therotical plus practical question, represented by R.
> >
> > I came with this question when I built model for some stock returns.
> > That's the reason I cannot post the complete data here. But I
would
> > like to attach some plots here (I zipped them since the original ones
> > are too big).
> >
> > The first plot qq1, is qqnorm plot of my sample, giving me some
> > "S"-shape. Since I am not very experienced, I am not sure
what kind of
> > distribution my sample follows.
> >
> > The second plot, qq2, is obtained via
> > qqnorm(rt(10000, 4)) since I run
> > fitdistr(kk, 't') and got
> > m s df
> > 9.998789e-01 7.663799e-03 3.759726e+00
> > (5.332631e-05) (5.411400e-05) (8.684956e-02)
> >
> > The second plot seems to say my sample distr follows t-distr. (not
sure
of> > this)
> >
> > BTW, what the commands for simulating other distr like log-norm,
> > exponential, and so on?
> >
> > The third one was obtained by running the following R code:
> >
> > Suppose my data is read into dataset k from file "f392.txt":
> > k<-read.table("f392.txt", header=F) # read into k
> > kk<-k[[1]]
> > qq(kk)
> >
> > qq function is defined as below:
> > qq<-function(dataset){
> > l<-qqnorm(dataset, plot.it=F)
> > diff<-l$y-l$x # difference b/w sample and it's therotical
quantile
> > qqnorm(diff)
> > }
> >
> > The most interesting thing is (if there is not any stupid game here,
> > and if my sample follows some kind of distribution (no matter if such
> > distr has been found or not)), my qq function seems like a way to
> > evaluate it. But what I am worried about, the line is too
"perfect",
> > which indiates there is something goofy here, which can be proved via
> > some mathematical inference to get it. However I used
> > qq(rnorm(10000))
> > qq(rt(10000, 3.7)
> > qq(rf(....))
> >
> > None of them gave me this perfect line!
> >
> > Sorry for the long question but I want to make it clear to everybody
> > about my question. I tried my best :)
> >
> > Thanks for your reading,
> >
> > Weiwei (Ed) Shi, Ph.D
> >
> > On 4/23/05, Vincent ZOONEKYND <zoonek at gmail.com> wrote:
> > > If I understand your problem, you are computing the difference
between
> > > your data and the quantiles of a standard gaussian variable -- in
> > > other words, the difference between the data and the red line, in
the
> > > following picture.
> > >
> > > N <- 100 # Sample size
> > > m <- 1 # Mean
> > > s <- 2 # dispersion
> > > x <- m + s * rt(N, df=2) # Non-gaussian data
> > >
> > > qqnorm(x)
> > > abline(0,1, col="red")
> > >
> > > And you get
> > >
> > > y <- sort(x) - qnorm(ppoints(N))
> > > hist(y)
> > >
> > > This is probably not the right line (not only because your mean
is 1,
> > > the slope is wrong as well -- if the data were gaussian, you
could
> > > estimate it with the standard deviation).
> > >
> > > You can use the "qqline" function to get the line
passing throught the
> > > first and third quartiles, which is probably closer to what you
have
> > > in mind.
> > >
> > > qqnorm(x)
> > > abline(0,1, col="red")
> > > qqline(x, col="blue")
> > >
> > > The differences are
> > >
> > > x1 <- quantile(x, .25)
> > > x2 <- quantile(x, .75)
> > > b <- (x2-x1) / (qnorm(.75)-qnorm(.25))
> > > a <- x1 - b * qnorm(.25)
> > > y <- sort(x) - (a + b * qnorm(ppoints(N)))
> > > hist(y)
> > >
> > > And you want to know when the differences ceases to be
"significantly"
> > > different from zero.
> > >
> > > plot(y)
> > > abline(h=0, lty=3)
> > >
> > > You can use the plot fo fix a threshold, but unless you have a
model
> > > describing how non-gaussian you data are, this will be empirical.
> > >
> > > You will note that, in those simulations, the differences (either
> > > yours or those from the lines through the first and third
quartiles)
> > > are not gaussian at all.
> > >
> > > -- Vincent
> > >
> > >
> > > On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote:
> > > > hope it is not b/c some central limit therory, otherwise my
initial
> > > > plan will fail :)
> > > >
> > > > On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote:
> > > > > Hi, r-gurus:
> > > > >
> > > > > I happened to have a question in my work:
> > > > >
> > > > > I have a dataset, which has only one dimention, like
> > > > > 0.99037297527605
> > > > > 0.991179836732708
> > > > > 0.995635340631367
> > > > > 0.997186769599305
> > > > > 0.991632565640424
> > > > > 0.984047197106486
> > > > > 0.99225943762649
> > > > > 1.00555642128421
> > > > > 0.993725402926564
> > > > > ....
> > > > >
> > > > > the data is saved in a file called f392.txt.
> > > > >
> > > > > I used the following codes to play around :)
> > > > >
> > > > > k<-read.table("f392.txt", header=F) #
read into k
> > > > > kk<-k[[1]]
> > > > > l<-qqnorm(kk)
> > > > > diff=c()
> > > > > lenk<-length(kk)
> > > > > i=1
> > > > > while (i<=lenk){
> > > > > diff[i]=l$y[i]-l$x[i] # save the difference of
therotical
quantile> > > > > and sample quantile
> > > > > # remember, my sample mean
is around 1
> > > > > while the therotical one, 0
> > > > > i<-i+1
> > > > > }
> > > > > hist(diff, breaks=300) # analyze the distr of such
diff
> > > > > qqnorm(diff)
> > > > >
> > > > > my question is:
> > > > > from l<-qqnorm(kk), I wanted to know, from which
point (or cut),
the> > > > > sample points start to become away from therotical
ones. That's
the> > > > > reason I played around the "diff" list, which
gives me the
difference.> > > > > To my surprise, the diff is perfectly normal. I tried
to use some
> > > > > kk<-c(1, 2, 5, -1 , ...) to test, I concluded it
must be some
> > > > > distribution my sample follows gives this finding.
> > > > >
> > > > > So, any suggestion on the distribution of my sample?
I think
there> > > > > might be some mathematical inference which can leads
this
observation,> > > > > but not quite sure.
> > > > >
> > > > > btw,
> > > > > > fitdistr(kk, 't')
> > > > > m s df
> > > > > 9.999965e-01 7.630770e-03 3.742244e+00
> > > > > (5.317674e-05) (5.373884e-05) (8.584725e-02)
> > > > >
> > > > > btw2, can anyone suggest a way to find the
"cut" or "threshold"
from> > > > > my sample to discretize them into 3 groups: two
tail-group and one
> > > > > main group.--------- my focus.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Ed
> > > > >
> > > >
> > > > ______________________________________________
> > > > R-help at stat.math.ethz.ch mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide!
> > > > http://www.R-project.org/posting-guide.html
> > > >
> > >
> >
> >
----------------------------------------------------------------------------
--> > Notice: This e-mail message, together with any
attachment...{{dropped}}
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html>
______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html