Huntsinger, Reid
2005-Apr-28 18:40 UTC
[R] have to point it out again: a distribution question
Stock returns and other financial data have often found to be heavy-tailed. Even Cauchy distributions (without even a first absolute moment) have been entertained as models. Your qq function subtracts numbers on the scale of a normal (0,1) distribution from the input data. When the input data are scaled so that they are insignificant compared to 1, say, then you get essentially the "theoretical quantiles" ie the "x" component of the list back from l$x - l$y. l$x is basically a sample from a normal(0,1) distribution so they do line up perfectly in the second qqnorm(). Is that what's happening? Reid Huntsinger -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of WeiWei Shi Sent: Thursday, April 28, 2005 1:38 PM To: Vincent ZOONEKYND Cc: R-help at stat.math.ethz.ch Subject: [R] have to point it out again: a distribution question Dear R-helpers: I pointed out my question last time but it is only partially solved. So I would like to point it out again since I think it is very interesting, at least to me. It is a question not about how to use R, instead it is a kind of therotical plus practical question, represented by R. I came with this question when I built model for some stock returns. That's the reason I cannot post the complete data here. But I would like to attach some plots here (I zipped them since the original ones are too big). The first plot qq1, is qqnorm plot of my sample, giving me some "S"-shape. Since I am not very experienced, I am not sure what kind of distribution my sample follows. The second plot, qq2, is obtained via qqnorm(rt(10000, 4)) since I run fitdistr(kk, 't') and got m s df 9.998789e-01 7.663799e-03 3.759726e+00 (5.332631e-05) (5.411400e-05) (8.684956e-02) The second plot seems to say my sample distr follows t-distr. (not sure of this) BTW, what the commands for simulating other distr like log-norm, exponential, and so on? The third one was obtained by running the following R code: Suppose my data is read into dataset k from file "f392.txt": k<-read.table("f392.txt", header=F) # read into k kk<-k[[1]] qq(kk) qq function is defined as below: qq<-function(dataset){ l<-qqnorm(dataset, plot.it=F) diff<-l$y-l$x # difference b/w sample and it's therotical quantile qqnorm(diff) } The most interesting thing is (if there is not any stupid game here, and if my sample follows some kind of distribution (no matter if such distr has been found or not)), my qq function seems like a way to evaluate it. But what I am worried about, the line is too "perfect", which indiates there is something goofy here, which can be proved via some mathematical inference to get it. However I used qq(rnorm(10000)) qq(rt(10000, 3.7) qq(rf(....)) None of them gave me this perfect line! Sorry for the long question but I want to make it clear to everybody about my question. I tried my best :) Thanks for your reading, Weiwei (Ed) Shi, Ph.D On 4/23/05, Vincent ZOONEKYND <zoonek at gmail.com> wrote:> If I understand your problem, you are computing the difference between > your data and the quantiles of a standard gaussian variable -- in > other words, the difference between the data and the red line, in the > following picture. > > N <- 100 # Sample size > m <- 1 # Mean > s <- 2 # dispersion > x <- m + s * rt(N, df=2) # Non-gaussian data > > qqnorm(x) > abline(0,1, col="red") > > And you get > > y <- sort(x) - qnorm(ppoints(N)) > hist(y) > > This is probably not the right line (not only because your mean is 1, > the slope is wrong as well -- if the data were gaussian, you could > estimate it with the standard deviation). > > You can use the "qqline" function to get the line passing throught the > first and third quartiles, which is probably closer to what you have > in mind. > > qqnorm(x) > abline(0,1, col="red") > qqline(x, col="blue") > > The differences are > > x1 <- quantile(x, .25) > x2 <- quantile(x, .75) > b <- (x2-x1) / (qnorm(.75)-qnorm(.25)) > a <- x1 - b * qnorm(.25) > y <- sort(x) - (a + b * qnorm(ppoints(N))) > hist(y) > > And you want to know when the differences ceases to be "significantly" > different from zero. > > plot(y) > abline(h=0, lty=3) > > You can use the plot fo fix a threshold, but unless you have a model > describing how non-gaussian you data are, this will be empirical. > > You will note that, in those simulations, the differences (either > yours or those from the lines through the first and third quartiles) > are not gaussian at all. > > -- Vincent > > > On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote: > > hope it is not b/c some central limit therory, otherwise my initial > > plan will fail :) > > > > On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote: > > > Hi, r-gurus: > > > > > > I happened to have a question in my work: > > > > > > I have a dataset, which has only one dimention, like > > > 0.99037297527605 > > > 0.991179836732708 > > > 0.995635340631367 > > > 0.997186769599305 > > > 0.991632565640424 > > > 0.984047197106486 > > > 0.99225943762649 > > > 1.00555642128421 > > > 0.993725402926564 > > > .... > > > > > > the data is saved in a file called f392.txt. > > > > > > I used the following codes to play around :) > > > > > > k<-read.table("f392.txt", header=F) # read into k > > > kk<-k[[1]] > > > l<-qqnorm(kk) > > > diff=c() > > > lenk<-length(kk) > > > i=1 > > > while (i<=lenk){ > > > diff[i]=l$y[i]-l$x[i] # save the difference of therotical quantile > > > and sample quantile > > > # remember, my sample mean is around 1 > > > while the therotical one, 0 > > > i<-i+1 > > > } > > > hist(diff, breaks=300) # analyze the distr of such diff > > > qqnorm(diff) > > > > > > my question is: > > > from l<-qqnorm(kk), I wanted to know, from which point (or cut), the > > > sample points start to become away from therotical ones. That's the > > > reason I played around the "diff" list, which gives me the difference. > > > To my surprise, the diff is perfectly normal. I tried to use some > > > kk<-c(1, 2, 5, -1 , ...) to test, I concluded it must be some > > > distribution my sample follows gives this finding. > > > > > > So, any suggestion on the distribution of my sample? I think there > > > might be some mathematical inference which can leads this observation, > > > but not quite sure. > > > > > > btw, > > > > fitdistr(kk, 't') > > > m s df > > > 9.999965e-01 7.630770e-03 3.742244e+00 > > > (5.317674e-05) (5.373884e-05) (8.584725e-02) > > > > > > btw2, can anyone suggest a way to find the "cut" or "threshold" from > > > my sample to discretize them into 3 groups: two tail-group and one > > > main group.--------- my focus. > > > > > > Thanks, > > > > > > Ed > > > > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > >
Here is summary of l<-qqnorm(kk) # kk is my sample l$y (which is my sample) l$x (which is therotical quantile) diff<-l$y-l$x and> summary(l$y)Min. 1st Qu. Median Mean 3rd Qu. Max. 0.9007 0.9942 0.9998 0.9999 1.0060 1.1070> summary(l$x)Min. 1st Qu. Median Mean 3rd Qu. Max. -4.145e+00 -6.745e-01 0.000e+00 2.383e-17 6.745e-01 4.145e+00> summary(diff)Min. 1st Qu. Median Mean 3rd Qu. Max. -3.0380 0.3311 0.9998 0.9999 1.6690 5.0460 Comparing diff with l$x, though the 1st Qu. and 3rd Qu. are different, diff and l$x seem similar to each other, which are proved by qqnorm(l$x) and qqnorm(diff). running the following codes: r<-rnorm(1000)+1 # since my sample shift from zero to 1 qq(r[r>0.9 & r<1.2]) # select the central part this gives me a straight line now. Thanks for the good explanation for the phenomena. Then, Reid, or other r-gurus, is there a good way to descritize the sample into 3 category: 2 tails and the body? Thanks again, Weiwei On 4/28/05, Huntsinger, Reid <reid_huntsinger at merck.com> wrote:> Stock returns and other financial data have often found to be heavy-tailed. > Even Cauchy distributions (without even a first absolute moment) have been > entertained as models. > > Your qq function subtracts numbers on the scale of a normal (0,1) > distribution from the input data. When the input data are scaled so that > they are insignificant compared to 1, say, then you get essentially the > "theoretical quantiles" ie the "x" component of the list back from l$x - > l$y. l$x is basically a sample from a normal(0,1) distribution so they do > line up perfectly in the second qqnorm(). Is that what's happening? > > Reid Huntsinger > > > -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of WeiWei Shi > Sent: Thursday, April 28, 2005 1:38 PM > To: Vincent ZOONEKYND > Cc: R-help at stat.math.ethz.ch > Subject: [R] have to point it out again: a distribution question > > Dear R-helpers: > I pointed out my question last time but it is only partially solved. > So I would like to point it out again since I think it is very > interesting, at least to me. > It is a question not about how to use R, instead it is a kind of > therotical plus practical question, represented by R. > > I came with this question when I built model for some stock returns. > That's the reason I cannot post the complete data here. But I would > like to attach some plots here (I zipped them since the original ones > are too big). > > The first plot qq1, is qqnorm plot of my sample, giving me some > "S"-shape. Since I am not very experienced, I am not sure what kind of > distribution my sample follows. > > The second plot, qq2, is obtained via > qqnorm(rt(10000, 4)) since I run > fitdistr(kk, 't') and got > m s df > 9.998789e-01 7.663799e-03 3.759726e+00 > (5.332631e-05) (5.411400e-05) (8.684956e-02) > > The second plot seems to say my sample distr follows t-distr. (not sure of > this) > > BTW, what the commands for simulating other distr like log-norm, > exponential, and so on? > > The third one was obtained by running the following R code: > > Suppose my data is read into dataset k from file "f392.txt": > k<-read.table("f392.txt", header=F) # read into k > kk<-k[[1]] > qq(kk) > > qq function is defined as below: > qq<-function(dataset){ > l<-qqnorm(dataset, plot.it=F) > diff<-l$y-l$x # difference b/w sample and it's therotical quantile > qqnorm(diff) > } > > The most interesting thing is (if there is not any stupid game here, > and if my sample follows some kind of distribution (no matter if such > distr has been found or not)), my qq function seems like a way to > evaluate it. But what I am worried about, the line is too "perfect", > which indiates there is something goofy here, which can be proved via > some mathematical inference to get it. However I used > qq(rnorm(10000)) > qq(rt(10000, 3.7) > qq(rf(....)) > > None of them gave me this perfect line! > > Sorry for the long question but I want to make it clear to everybody > about my question. I tried my best :) > > Thanks for your reading, > > Weiwei (Ed) Shi, Ph.D > > On 4/23/05, Vincent ZOONEKYND <zoonek at gmail.com> wrote: > > If I understand your problem, you are computing the difference between > > your data and the quantiles of a standard gaussian variable -- in > > other words, the difference between the data and the red line, in the > > following picture. > > > > N <- 100 # Sample size > > m <- 1 # Mean > > s <- 2 # dispersion > > x <- m + s * rt(N, df=2) # Non-gaussian data > > > > qqnorm(x) > > abline(0,1, col="red") > > > > And you get > > > > y <- sort(x) - qnorm(ppoints(N)) > > hist(y) > > > > This is probably not the right line (not only because your mean is 1, > > the slope is wrong as well -- if the data were gaussian, you could > > estimate it with the standard deviation). > > > > You can use the "qqline" function to get the line passing throught the > > first and third quartiles, which is probably closer to what you have > > in mind. > > > > qqnorm(x) > > abline(0,1, col="red") > > qqline(x, col="blue") > > > > The differences are > > > > x1 <- quantile(x, .25) > > x2 <- quantile(x, .75) > > b <- (x2-x1) / (qnorm(.75)-qnorm(.25)) > > a <- x1 - b * qnorm(.25) > > y <- sort(x) - (a + b * qnorm(ppoints(N))) > > hist(y) > > > > And you want to know when the differences ceases to be "significantly" > > different from zero. > > > > plot(y) > > abline(h=0, lty=3) > > > > You can use the plot fo fix a threshold, but unless you have a model > > describing how non-gaussian you data are, this will be empirical. > > > > You will note that, in those simulations, the differences (either > > yours or those from the lines through the first and third quartiles) > > are not gaussian at all. > > > > -- Vincent > > > > > > On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote: > > > hope it is not b/c some central limit therory, otherwise my initial > > > plan will fail :) > > > > > > On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote: > > > > Hi, r-gurus: > > > > > > > > I happened to have a question in my work: > > > > > > > > I have a dataset, which has only one dimention, like > > > > 0.99037297527605 > > > > 0.991179836732708 > > > > 0.995635340631367 > > > > 0.997186769599305 > > > > 0.991632565640424 > > > > 0.984047197106486 > > > > 0.99225943762649 > > > > 1.00555642128421 > > > > 0.993725402926564 > > > > .... > > > > > > > > the data is saved in a file called f392.txt. > > > > > > > > I used the following codes to play around :) > > > > > > > > k<-read.table("f392.txt", header=F) # read into k > > > > kk<-k[[1]] > > > > l<-qqnorm(kk) > > > > diff=c() > > > > lenk<-length(kk) > > > > i=1 > > > > while (i<=lenk){ > > > > diff[i]=l$y[i]-l$x[i] # save the difference of therotical quantile > > > > and sample quantile > > > > # remember, my sample mean is around 1 > > > > while the therotical one, 0 > > > > i<-i+1 > > > > } > > > > hist(diff, breaks=300) # analyze the distr of such diff > > > > qqnorm(diff) > > > > > > > > my question is: > > > > from l<-qqnorm(kk), I wanted to know, from which point (or cut), the > > > > sample points start to become away from therotical ones. That's the > > > > reason I played around the "diff" list, which gives me the difference. > > > > To my surprise, the diff is perfectly normal. I tried to use some > > > > kk<-c(1, 2, 5, -1 , ...) to test, I concluded it must be some > > > > distribution my sample follows gives this finding. > > > > > > > > So, any suggestion on the distribution of my sample? I think there > > > > might be some mathematical inference which can leads this observation, > > > > but not quite sure. > > > > > > > > btw, > > > > > fitdistr(kk, 't') > > > > m s df > > > > 9.999965e-01 7.630770e-03 3.742244e+00 > > > > (5.317674e-05) (5.373884e-05) (8.584725e-02) > > > > > > > > btw2, can anyone suggest a way to find the "cut" or "threshold" from > > > > my sample to discretize them into 3 groups: two tail-group and one > > > > main group.--------- my focus. > > > > > > > > Thanks, > > > > > > > > Ed > > > > > > > > > > ______________________________________________ > > > R-help at stat.math.ethz.ch mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide! > > > http://www.R-project.org/posting-guide.html > > > > > > > ------------------------------------------------------------------------------ > Notice: This e-mail message, together with any attachment...{{dropped}}
bogdan romocea
2005-Apr-29 18:30 UTC
[R] have to point it out again: a distribution question
> Then, Reid, or other r-gurus, is there a good way to descritize > the sample into 3 category: 2 tails and the body?Out of curiosity, how do you plan to use that information? What would you do if you knew that the 'body' starts here and ends there? -----Original Message----- From: WeiWei Shi [mailto:helprhelp at gmail.com] Sent: Thursday, April 28, 2005 4:18 PM To: Huntsinger, Reid Cc: R-help at stat.math.ethz.ch Subject: Re: [R] have to point it out again: a distribution question Here is summary of l<-qqnorm(kk) # kk is my sample l$y (which is my sample) l$x (which is therotical quantile) diff<-l$y-l$x and> summary(l$y)Min. 1st Qu. Median Mean 3rd Qu. Max. 0.9007 0.9942 0.9998 0.9999 1.0060 1.1070> summary(l$x)Min. 1st Qu. Median Mean 3rd Qu. Max. -4.145e+00 -6.745e-01 0.000e+00 2.383e-17 6.745e-01 4.145e+00> summary(diff)Min. 1st Qu. Median Mean 3rd Qu. Max. -3.0380 0.3311 0.9998 0.9999 1.6690 5.0460 Comparing diff with l$x, though the 1st Qu. and 3rd Qu. are different, diff and l$x seem similar to each other, which are proved by qqnorm(l$x) and qqnorm(diff). running the following codes: r<-rnorm(1000)+1 # since my sample shift from zero to 1 qq(r[r>0.9 & r<1.2]) # select the central part this gives me a straight line now. Thanks for the good explanation for the phenomena. Then, Reid, or other r-gurus, is there a good way to descritize the sample into 3 category: 2 tails and the body? Thanks again, Weiwei On 4/28/05, Huntsinger, Reid <reid_huntsinger at merck.com> wrote:> Stock returns and other financial data have often found to be heavy-tailed. > Even Cauchy distributions (without even a first absolute moment) have been > entertained as models. > > Your qq function subtracts numbers on the scale of a normal (0,1) > distribution from the input data. When the input data are scaled so that > they are insignificant compared to 1, say, then you get essentially the > "theoretical quantiles" ie the "x" component of the list back from l$x - > l$y. l$x is basically a sample from a normal(0,1) distribution so they do > line up perfectly in the second qqnorm(). Is that what's happening? > > Reid Huntsinger > > > -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of WeiWei Shi > Sent: Thursday, April 28, 2005 1:38 PM > To: Vincent ZOONEKYND > Cc: R-help at stat.math.ethz.ch > Subject: [R] have to point it out again: a distribution question > > Dear R-helpers: > I pointed out my question last time but it is only partially solved. > So I would like to point it out again since I think it is very > interesting, at least to me. > It is a question not about how to use R, instead it is a kind of > therotical plus practical question, represented by R. > > I came with this question when I built model for some stock returns. > That's the reason I cannot post the complete data here. But I would > like to attach some plots here (I zipped them since the original ones > are too big). > > The first plot qq1, is qqnorm plot of my sample, giving me some > "S"-shape. Since I am not very experienced, I am not sure what kind of > distribution my sample follows. > > The second plot, qq2, is obtained via > qqnorm(rt(10000, 4)) since I run > fitdistr(kk, 't') and got > m s df > 9.998789e-01 7.663799e-03 3.759726e+00 > (5.332631e-05) (5.411400e-05) (8.684956e-02) > > The second plot seems to say my sample distr follows t-distr. (not sure of > this) > > BTW, what the commands for simulating other distr like log-norm, > exponential, and so on? > > The third one was obtained by running the following R code: > > Suppose my data is read into dataset k from file "f392.txt": > k<-read.table("f392.txt", header=F) # read into k > kk<-k[[1]] > qq(kk) > > qq function is defined as below: > qq<-function(dataset){ > l<-qqnorm(dataset, plot.it=F) > diff<-l$y-l$x # difference b/w sample and it's therotical quantile > qqnorm(diff) > } > > The most interesting thing is (if there is not any stupid game here, > and if my sample follows some kind of distribution (no matter if such > distr has been found or not)), my qq function seems like a way to > evaluate it. But what I am worried about, the line is too "perfect", > which indiates there is something goofy here, which can be proved via > some mathematical inference to get it. However I used > qq(rnorm(10000)) > qq(rt(10000, 3.7) > qq(rf(....)) > > None of them gave me this perfect line! > > Sorry for the long question but I want to make it clear to everybody > about my question. I tried my best :) > > Thanks for your reading, > > Weiwei (Ed) Shi, Ph.D > > On 4/23/05, Vincent ZOONEKYND <zoonek at gmail.com> wrote: > > If I understand your problem, you are computing the difference between > > your data and the quantiles of a standard gaussian variable -- in > > other words, the difference between the data and the red line, in the > > following picture. > > > > N <- 100 # Sample size > > m <- 1 # Mean > > s <- 2 # dispersion > > x <- m + s * rt(N, df=2) # Non-gaussian data > > > > qqnorm(x) > > abline(0,1, col="red") > > > > And you get > > > > y <- sort(x) - qnorm(ppoints(N)) > > hist(y) > > > > This is probably not the right line (not only because your mean is 1, > > the slope is wrong as well -- if the data were gaussian, you could > > estimate it with the standard deviation). > > > > You can use the "qqline" function to get the line passing throught the > > first and third quartiles, which is probably closer to what you have > > in mind. > > > > qqnorm(x) > > abline(0,1, col="red") > > qqline(x, col="blue") > > > > The differences are > > > > x1 <- quantile(x, .25) > > x2 <- quantile(x, .75) > > b <- (x2-x1) / (qnorm(.75)-qnorm(.25)) > > a <- x1 - b * qnorm(.25) > > y <- sort(x) - (a + b * qnorm(ppoints(N))) > > hist(y) > > > > And you want to know when the differences ceases to be "significantly" > > different from zero. > > > > plot(y) > > abline(h=0, lty=3) > > > > You can use the plot fo fix a threshold, but unless you have a model > > describing how non-gaussian you data are, this will be empirical. > > > > You will note that, in those simulations, the differences (either > > yours or those from the lines through the first and third quartiles) > > are not gaussian at all. > > > > -- Vincent > > > > > > On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote: > > > hope it is not b/c some central limit therory, otherwise my initial > > > plan will fail :) > > > > > > On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote: > > > > Hi, r-gurus: > > > > > > > > I happened to have a question in my work: > > > > > > > > I have a dataset, which has only one dimention, like > > > > 0.99037297527605 > > > > 0.991179836732708 > > > > 0.995635340631367 > > > > 0.997186769599305 > > > > 0.991632565640424 > > > > 0.984047197106486 > > > > 0.99225943762649 > > > > 1.00555642128421 > > > > 0.993725402926564 > > > > .... > > > > > > > > the data is saved in a file called f392.txt. > > > > > > > > I used the following codes to play around :) > > > > > > > > k<-read.table("f392.txt", header=F) # read into k > > > > kk<-k[[1]] > > > > l<-qqnorm(kk) > > > > diff=c() > > > > lenk<-length(kk) > > > > i=1 > > > > while (i<=lenk){ > > > > diff[i]=l$y[i]-l$x[i] # save the difference of therotical quantile > > > > and sample quantile > > > > # remember, my sample mean is around 1 > > > > while the therotical one, 0 > > > > i<-i+1 > > > > } > > > > hist(diff, breaks=300) # analyze the distr of such diff > > > > qqnorm(diff) > > > > > > > > my question is: > > > > from l<-qqnorm(kk), I wanted to know, from which point (or cut), the > > > > sample points start to become away from therotical ones. That's the > > > > reason I played around the "diff" list, which gives me the difference. > > > > To my surprise, the diff is perfectly normal. I tried to use some > > > > kk<-c(1, 2, 5, -1 , ...) to test, I concluded it must be some > > > > distribution my sample follows gives this finding. > > > > > > > > So, any suggestion on the distribution of my sample? I think there > > > > might be some mathematical inference which can leads this observation, > > > > but not quite sure. > > > > > > > > btw, > > > > > fitdistr(kk, 't') > > > > m s df > > > > 9.999965e-01 7.630770e-03 3.742244e+00 > > > > (5.317674e-05) (5.373884e-05) (8.584725e-02) > > > > > > > > btw2, can anyone suggest a way to find the "cut" or "threshold" from > > > > my sample to discretize them into 3 groups: two tail-group and one > > > > main group.--------- my focus. > > > > > > > > Thanks, > > > > > > > > Ed > > > > > > > > > > ______________________________________________ > > > R-help at stat.math.ethz.ch mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide! > > > http://www.R-project.org/posting-guide.html > > > > > > > ------------------------------------------------------------------------------ > Notice: This e-mail message, together with any attachment...{{dropped}}______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Huntsinger, Reid
2005-Apr-29 20:28 UTC
[R] have to point it out again: a distribution question
There are many ways to discretize data. That's one way of looking at clustering ("vector quantization"). You might also look into modelling approaches which don't require it: splines, trees, etc. What sort of data mining are you trying to do? Reid Huntsinger -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of WeiWei Shi Sent: Friday, April 29, 2005 3:22 PM To: bogdan romocea Cc: R-help at stat.math.ethz.ch Subject: Re: [R] have to point it out again: a distribution question discretization from continuous domain to categorical one so that some data mining algorithm can be applied on it. Maybe there should be more than 3 categories, I don't know. I googled some papers in financial field, and any more suggestions or references will be helpful. Ed On 4/29/05, bogdan romocea <br44114 at gmail.com> wrote:> > Then, Reid, or other r-gurus, is there a good way to descritize > > the sample into 3 category: 2 tails and the body? > > Out of curiosity, how do you plan to use that information? What would > you do if you knew that the 'body' starts here and ends there? > > > -----Original Message----- > From: WeiWei Shi [mailto:helprhelp at gmail.com] > Sent: Thursday, April 28, 2005 4:18 PM > To: Huntsinger, Reid > Cc: R-help at stat.math.ethz.ch > Subject: Re: [R] have to point it out again: a distribution question > > Here is summary of > l<-qqnorm(kk) # kk is my sample > l$y (which is my sample) > l$x (which is therotical quantile) > diff<-l$y-l$x > > and > > summary(l$y) > Min. 1st Qu. Median Mean 3rd Qu. Max. > 0.9007 0.9942 0.9998 0.9999 1.0060 1.1070 > > summary(l$x) > Min. 1st Qu. Median Mean 3rd Qu. Max. > -4.145e+00 -6.745e-01 0.000e+00 2.383e-17 6.745e-01 4.145e+00 > > summary(diff) > Min. 1st Qu. Median Mean 3rd Qu. Max. > -3.0380 0.3311 0.9998 0.9999 1.6690 5.0460 > > Comparing diff with l$x, though the 1st Qu. and 3rd Qu. are different, > diff and l$x seem similar to each other, which are proved by > qqnorm(l$x) and qqnorm(diff). > > running the following codes: > > r<-rnorm(1000)+1 # since my sample shift from zero to 1 > qq(r[r>0.9 & r<1.2]) # select the central part > > this gives me a straight line now. > > Thanks for the good explanation for the phenomena. > > Then, Reid, or other r-gurus, is there a good way to descritize the > sample into 3 category: 2 tails and the body? > > Thanks again, > > Weiwei > > On 4/28/05, Huntsinger, Reid <reid_huntsinger at merck.com> wrote: > > Stock returns and other financial data have often found to beheavy-tailed.> > Even Cauchy distributions (without even a first absolute moment) havebeen> > entertained as models. > > > > Your qq function subtracts numbers on the scale of a normal (0,1) > > distribution from the input data. When the input data are scaled so that > > they are insignificant compared to 1, say, then you get essentially the > > "theoretical quantiles" ie the "x" component of the list back from l$x - > > l$y. l$x is basically a sample from a normal(0,1) distribution so theydo> > line up perfectly in the second qqnorm(). Is that what's happening? > > > > Reid Huntsinger > > > > > > -----Original Message----- > > From: r-help-bounces at stat.math.ethz.ch > > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of WeiWei Shi > > Sent: Thursday, April 28, 2005 1:38 PM > > To: Vincent ZOONEKYND > > Cc: R-help at stat.math.ethz.ch > > Subject: [R] have to point it out again: a distribution question > > > > Dear R-helpers: > > I pointed out my question last time but it is only partially solved. > > So I would like to point it out again since I think it is very > > interesting, at least to me. > > It is a question not about how to use R, instead it is a kind of > > therotical plus practical question, represented by R. > > > > I came with this question when I built model for some stock returns. > > That's the reason I cannot post the complete data here. But I would > > like to attach some plots here (I zipped them since the original ones > > are too big). > > > > The first plot qq1, is qqnorm plot of my sample, giving me some > > "S"-shape. Since I am not very experienced, I am not sure what kind of > > distribution my sample follows. > > > > The second plot, qq2, is obtained via > > qqnorm(rt(10000, 4)) since I run > > fitdistr(kk, 't') and got > > m s df > > 9.998789e-01 7.663799e-03 3.759726e+00 > > (5.332631e-05) (5.411400e-05) (8.684956e-02) > > > > The second plot seems to say my sample distr follows t-distr. (not sureof> > this) > > > > BTW, what the commands for simulating other distr like log-norm, > > exponential, and so on? > > > > The third one was obtained by running the following R code: > > > > Suppose my data is read into dataset k from file "f392.txt": > > k<-read.table("f392.txt", header=F) # read into k > > kk<-k[[1]] > > qq(kk) > > > > qq function is defined as below: > > qq<-function(dataset){ > > l<-qqnorm(dataset, plot.it=F) > > diff<-l$y-l$x # difference b/w sample and it's therotical quantile > > qqnorm(diff) > > } > > > > The most interesting thing is (if there is not any stupid game here, > > and if my sample follows some kind of distribution (no matter if such > > distr has been found or not)), my qq function seems like a way to > > evaluate it. But what I am worried about, the line is too "perfect", > > which indiates there is something goofy here, which can be proved via > > some mathematical inference to get it. However I used > > qq(rnorm(10000)) > > qq(rt(10000, 3.7) > > qq(rf(....)) > > > > None of them gave me this perfect line! > > > > Sorry for the long question but I want to make it clear to everybody > > about my question. I tried my best :) > > > > Thanks for your reading, > > > > Weiwei (Ed) Shi, Ph.D > > > > On 4/23/05, Vincent ZOONEKYND <zoonek at gmail.com> wrote: > > > If I understand your problem, you are computing the difference between > > > your data and the quantiles of a standard gaussian variable -- in > > > other words, the difference between the data and the red line, in the > > > following picture. > > > > > > N <- 100 # Sample size > > > m <- 1 # Mean > > > s <- 2 # dispersion > > > x <- m + s * rt(N, df=2) # Non-gaussian data > > > > > > qqnorm(x) > > > abline(0,1, col="red") > > > > > > And you get > > > > > > y <- sort(x) - qnorm(ppoints(N)) > > > hist(y) > > > > > > This is probably not the right line (not only because your mean is 1, > > > the slope is wrong as well -- if the data were gaussian, you could > > > estimate it with the standard deviation). > > > > > > You can use the "qqline" function to get the line passing throught the > > > first and third quartiles, which is probably closer to what you have > > > in mind. > > > > > > qqnorm(x) > > > abline(0,1, col="red") > > > qqline(x, col="blue") > > > > > > The differences are > > > > > > x1 <- quantile(x, .25) > > > x2 <- quantile(x, .75) > > > b <- (x2-x1) / (qnorm(.75)-qnorm(.25)) > > > a <- x1 - b * qnorm(.25) > > > y <- sort(x) - (a + b * qnorm(ppoints(N))) > > > hist(y) > > > > > > And you want to know when the differences ceases to be "significantly" > > > different from zero. > > > > > > plot(y) > > > abline(h=0, lty=3) > > > > > > You can use the plot fo fix a threshold, but unless you have a model > > > describing how non-gaussian you data are, this will be empirical. > > > > > > You will note that, in those simulations, the differences (either > > > yours or those from the lines through the first and third quartiles) > > > are not gaussian at all. > > > > > > -- Vincent > > > > > > > > > On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote: > > > > hope it is not b/c some central limit therory, otherwise my initial > > > > plan will fail :) > > > > > > > > On 4/22/05, WeiWei Shi <helprhelp at gmail.com> wrote: > > > > > Hi, r-gurus: > > > > > > > > > > I happened to have a question in my work: > > > > > > > > > > I have a dataset, which has only one dimention, like > > > > > 0.99037297527605 > > > > > 0.991179836732708 > > > > > 0.995635340631367 > > > > > 0.997186769599305 > > > > > 0.991632565640424 > > > > > 0.984047197106486 > > > > > 0.99225943762649 > > > > > 1.00555642128421 > > > > > 0.993725402926564 > > > > > .... > > > > > > > > > > the data is saved in a file called f392.txt. > > > > > > > > > > I used the following codes to play around :) > > > > > > > > > > k<-read.table("f392.txt", header=F) # read into k > > > > > kk<-k[[1]] > > > > > l<-qqnorm(kk) > > > > > diff=c() > > > > > lenk<-length(kk) > > > > > i=1 > > > > > while (i<=lenk){ > > > > > diff[i]=l$y[i]-l$x[i] # save the difference of theroticalquantile> > > > > and sample quantile > > > > > # remember, my sample mean is around 1 > > > > > while the therotical one, 0 > > > > > i<-i+1 > > > > > } > > > > > hist(diff, breaks=300) # analyze the distr of such diff > > > > > qqnorm(diff) > > > > > > > > > > my question is: > > > > > from l<-qqnorm(kk), I wanted to know, from which point (or cut),the> > > > > sample points start to become away from therotical ones. That'sthe> > > > > reason I played around the "diff" list, which gives me thedifference.> > > > > To my surprise, the diff is perfectly normal. I tried to use some > > > > > kk<-c(1, 2, 5, -1 , ...) to test, I concluded it must be some > > > > > distribution my sample follows gives this finding. > > > > > > > > > > So, any suggestion on the distribution of my sample? I thinkthere> > > > > might be some mathematical inference which can leads thisobservation,> > > > > but not quite sure. > > > > > > > > > > btw, > > > > > > fitdistr(kk, 't') > > > > > m s df > > > > > 9.999965e-01 7.630770e-03 3.742244e+00 > > > > > (5.317674e-05) (5.373884e-05) (8.584725e-02) > > > > > > > > > > btw2, can anyone suggest a way to find the "cut" or "threshold"from> > > > > my sample to discretize them into 3 groups: two tail-group and one > > > > > main group.--------- my focus. > > > > > > > > > > Thanks, > > > > > > > > > > Ed > > > > > > > > > > > > > ______________________________________________ > > > > R-help at stat.math.ethz.ch mailing list > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > PLEASE do read the posting guide! > > > > http://www.R-project.org/posting-guide.html > > > > > > > > > > >---------------------------------------------------------------------------- --> > Notice: This e-mail message, together with any attachment...{{dropped}} > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide!http://www.R-project.org/posting-guide.html>______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html