Dylan Beaudette
2006-Jan-26 19:11 UTC
[R] understanding patterns in categorical vs. continuous data
Greetings, I have a set of bivariate data: one variable (vegetation type) which is categorical, and one (computed annual insolation) which is continuous. Plotting veg_type ~ insolation produces a nice overview of the patterns that I can see in the source data. However, due to the large number of samples (1,000), and the apparent "spread" in the distribution of a single vegetation type over a range of insolation values- I having a hard time quantitatively describing the relationship between the two variables. Here is a link to a sample graph: http://casoilresource.lawr.ucdavis.edu/drupal/node/162 Since the data along each vegetation type "line" is not a distribution in the traditional sense, I am having problems applying descriptive statistical methods. Conceptually, I would like to some how describe the variation with insolation, along each vegetation type "line". Any guidance, or suggested reading material would be greatly appreciated. -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341
Berton Gunter
2006-Jan-26 19:25 UTC
[R] understanding patterns in categorical vs. continuous data
UC Davis has a statistical department, I would suggest you get consulting help from them. Do they have a consulting service? -- Bert Gunter Genentech Non-Clinical Statistics South San Francisco, CA "The business of the statistician is to catalyze the scientific learning process." - George E. P. Box> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Dylan Beaudette > Sent: Thursday, January 26, 2006 11:11 AM > To: r-help at stat.math.ethz.ch > Subject: [R] understanding patterns in categorical vs. continuous data > > Greetings, > > I have a set of bivariate data: one variable (vegetation > type) which is > categorical, and one (computed annual insolation) which is > continuous. > Plotting veg_type ~ insolation produces a nice overview of > the patterns that > I can see in the source data. However, due to the large > number of samples > (1,000), and the apparent "spread" in the distribution of a > single vegetation > type over a range of insolation values- I having a hard time > quantitatively > describing the relationship between the two variables. > > Here is a link to a sample graph: > http://casoilresource.lawr.ucdavis.edu/drupal/node/162 > > Since the data along each vegetation type "line" is not a > distribution in the > traditional sense, I am having problems applying descriptive > statistical > methods. Conceptually, I would like to some how describe the > variation with > insolation, along each vegetation type "line". > > Any guidance, or suggested reading material would be greatly > appreciated. > > > -- > Dylan Beaudette > Soils and Biogeochemistry Graduate Group > University of California at Davis > 530.754.7341 > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >
Dave Roberts
2006-Jan-26 19:48 UTC
[R] understanding patterns in categorical vs. continuous data
You might prefer boxplot(insolation~veg_type) as a graphic. That will give you quantiles. To get the actual numeric values you could for (i in levels(veg_type)) { print(i) quantile(insolation[veg_type==i]) } see ?quantile for more help. Dylan Beaudette wrote:> Greetings, > > I have a set of bivariate data: one variable (vegetation type) which is > categorical, and one (computed annual insolation) which is continuous. > Plotting veg_type ~ insolation produces a nice overview of the patterns that > I can see in the source data. However, due to the large number of samples > (1,000), and the apparent "spread" in the distribution of a single vegetation > type over a range of insolation values- I having a hard time quantitatively > describing the relationship between the two variables. > > Here is a link to a sample graph: > http://casoilresource.lawr.ucdavis.edu/drupal/node/162 > > Since the data along each vegetation type "line" is not a distribution in the > traditional sense, I am having problems applying descriptive statistical > methods. Conceptually, I would like to some how describe the variation with > insolation, along each vegetation type "line". > > Any guidance, or suggested reading material would be greatly appreciated. > >-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ David W. Roberts office 406-994-4548 Professor and Head FAX 406-994-3190 Department of Ecology email droberts at montana.edu Montana State University Bozeman, MT 59717-3460
Gabor Grothendieck
2006-Jan-26 20:03 UTC
[R] understanding patterns in categorical vs. continuous data
Would this do? boxplot(Sepal.Length ~ Species, iris, horizontal = TRUE) library(Hmisc) summary(Sepal.Length ~ Species, iris, fun = summary) On 1/26/06, Dylan Beaudette <dylan.beaudette at gmail.com> wrote:> Greetings, > > I have a set of bivariate data: one variable (vegetation type) which is > categorical, and one (computed annual insolation) which is continuous. > Plotting veg_type ~ insolation produces a nice overview of the patterns that > I can see in the source data. However, due to the large number of samples > (1,000), and the apparent "spread" in the distribution of a single vegetation > type over a range of insolation values- I having a hard time quantitatively > describing the relationship between the two variables. > > Here is a link to a sample graph: > http://casoilresource.lawr.ucdavis.edu/drupal/node/162 > > Since the data along each vegetation type "line" is not a distribution in the > traditional sense, I am having problems applying descriptive statistical > methods. Conceptually, I would like to some how describe the variation with > insolation, along each vegetation type "line". > > Any guidance, or suggested reading material would be greatly appreciated. > > > -- > Dylan Beaudette > Soils and Biogeochemistry Graduate Group > University of California at Davis > 530.754.7341 > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
Liaw, Andy
2006-Jan-27 03:07 UTC
[R] understanding patterns in categorical vs. continuous data
From: Dave Roberts> > You might prefer boxplot(insolation~veg_type) as a graphic. > That will > give you quantiles. To get the actual numeric values you could > > for (i in levels(veg_type)) { > print(i) > quantile(insolation[veg_type==i]) > } > > see ?quantile for more help.If you want the five-number summaries plotted in the boxplots, just look at the returned object of boxplot():> g <- factor(rep(1:3, 10)) > y <- rnorm(30) > res <- boxplot(y ~ g) > str(res)List of 6 $ stats: num [1:5, 1:3] -1.135 -0.757 -0.536 0.499 0.996 ... $ n : num [1:3] 10 10 10 $ conf : num [1:2, 1:3] -1.1639 0.0918 -0.5208 1.6546 -1.2487 ... $ out : num(0) $ group: num(0) $ names: chr [1:3] "1" "2" "3" If you just want to compute the summaries without the boxplots, use fivenum():> tapply(y, g, fivenum)$"1" [1] -1.1352456 -0.7571895 -0.5360496 0.4994445 0.9956749 $"2" [1] -1.1408493 -0.3751730 0.5668747 1.8018146 2.0019303 $"3" [1] -2.2309983 -0.9333305 -0.3402786 0.8849042 0.9833057 ... and if you really want the quantiles, you can do that, too:> tapply(y, g, quantile)$"1" 0% 25% 50% 75% 100% -1.1352456 -0.7391977 -0.5360496 0.3378861 0.9956749 $"2" 0% 25% 50% 75% 100% -1.1408493 -0.3039648 0.5668747 1.6669879 2.0019303 $"3" 0% 25% 50% 75% 100% -2.2309983 -0.8389260 -0.3402786 0.6746950 0.9833057 ... but note how the quartiles and hinges are not necessarily the same. Andy> Dylan Beaudette wrote: > > Greetings, > > > > I have a set of bivariate data: one variable (vegetation > type) which is > > categorical, and one (computed annual insolation) which is > continuous. > > Plotting veg_type ~ insolation produces a nice overview of > the patterns that > > I can see in the source data. However, due to the large > number of samples > > (1,000), and the apparent "spread" in the distribution of a > single vegetation > > type over a range of insolation values- I having a hard > time quantitatively > > describing the relationship between the two variables. > > > > Here is a link to a sample graph: > > http://casoilresource.lawr.ucdavis.edu/drupal/node/162 > > > > Since the data along each vegetation type "line" is not a > distribution in the > > traditional sense, I am having problems applying > descriptive statistical > > methods. Conceptually, I would like to some how describe > the variation with > > insolation, along each vegetation type "line". > > > > Any guidance, or suggested reading material would be > greatly appreciated. > > > > > > > -- > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > ~~~~~~~~~~ > David W. Roberts office > 406-994-4548 > Professor and Head FAX > 406-994-3190 > Department of Ecology email > droberts at montana.edu > Montana State University > Bozeman, MT 59717-3460 > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >
Dylan Beaudette
2006-Jan-27 21:09 UTC
[R] understanding patterns in categorical vs. continuous data
Thanks to all for the helpful suggestions, I was able to get good start from there. Cheers, Dylan On Thursday 26 January 2006 12:03 pm, Gabor Grothendieck wrote:> Would this do? > > boxplot(Sepal.Length ~ Species, iris, horizontal = TRUE) > library(Hmisc) > summary(Sepal.Length ~ Species, iris, fun = summary) > > On 1/26/06, Dylan Beaudette <dylan.beaudette at gmail.com> wrote: > > Greetings, > > > > I have a set of bivariate data: one variable (vegetation type) which is > > categorical, and one (computed annual insolation) which is continuous. > > Plotting veg_type ~ insolation produces a nice overview of the patterns > > that I can see in the source data. However, due to the large number of > > samples (1,000), and the apparent "spread" in the distribution of a > > single vegetation type over a range of insolation values- I having a hard > > time quantitatively describing the relationship between the two > > variables. > > > > Here is a link to a sample graph: > > http://casoilresource.lawr.ucdavis.edu/drupal/node/162 > > > > Since the data along each vegetation type "line" is not a distribution in > > the traditional sense, I am having problems applying descriptive > > statistical methods. Conceptually, I would like to some how describe the > > variation with insolation, along each vegetation type "line". > > > > Any guidance, or suggested reading material would be greatly appreciated. > > > > > > -- > > Dylan Beaudette > > Soils and Biogeochemistry Graduate Group > > University of California at Davis > > 530.754.7341 > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html-- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341