Hello all, I'm trying to graph a scatterplot of a large (5,000 x,y coordinates) of data with the caveat that many of the data points overlap with each other (share the same x AND y coordinates). In using the usual "plot" command,> plot(education, xlab="etc", ylab="etc")it seems that the overlap of points is not shown in the graph. Namely, there are 5,000 points that should be plotted, as I mentioned above, but because so many of the points overlap with each other exactly, only about 50-60 points are actually plotted on the graph. Thus, there's no indication that Point A shares its coordinates with 200 other pieces of data and thus is very common while Point B doesn't share its coordinates with any other pieces of data and thus isn't common at all. Is there anyway to indicate the frequency of such points on such a graph? Should I be using a different command than "plot"? Thanks, Wayne
Use 'hexbin' from bioconductor to show how many points are in a grid on the graph. On Dec 17, 2007 8:14 PM, Wayne Aldo Gavioli <wgavioli at fas.harvard.edu> wrote:> > > Hello all, > > > I'm trying to graph a scatterplot of a large (5,000 x,y coordinates) of data > with the caveat that many of the data points overlap with each other (share the > same x AND y coordinates). In using the usual "plot" command, > > > > plot(education, xlab="etc", ylab="etc") > > > it seems that the overlap of points is not shown in the graph. Namely, there > are 5,000 points that should be plotted, as I mentioned above, but because so > many of the points overlap with each other exactly, only about 50-60 points are > actually plotted on the graph. Thus, there's no indication that Point A shares > its coordinates with 200 other pieces of data and thus is very common while > Point B doesn't share its coordinates with any other pieces of data and thus > isn't common at all. Is there anyway to indicate the frequency of such points > on such a graph? Should I be using a different command than "plot"? > > > Thanks, > > > Wayne > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve?
On 17/12/2007 8:14 PM, Wayne Aldo Gavioli wrote:> > Hello all, > > > I'm trying to graph a scatterplot of a large (5,000 x,y coordinates) of data > with the caveat that many of the data points overlap with each other (share the > same x AND y coordinates). In using the usual "plot" command, > > >> plot(education, xlab="etc", ylab="etc") > > > it seems that the overlap of points is not shown in the graph. Namely, there > are 5,000 points that should be plotted, as I mentioned above, but because so > many of the points overlap with each other exactly, only about 50-60 points are > actually plotted on the graph. Thus, there's no indication that Point A shares > its coordinates with 200 other pieces of data and thus is very common while > Point B doesn't share its coordinates with any other pieces of data and thus > isn't common at all. Is there anyway to indicate the frequency of such points > on such a graph? Should I be using a different command than "plot"?The jitter() function can add a bit of noise to your data, so that repeated points show up as groupings instead of isolated points. Duncan Murdoch
Wayne Aldo Gavioli <wgavioli at fas.harvard.edu> [Tue, Dec 18, 2007 at 02:14:23AM CET]:> Is there anyway to indicate the frequency of such points > on such a graph? Should I be using a different command than "plot"??sunflowerplot -- Johannes H?sing There is something fascinating about science. One gets such wholesale returns of conjecture mailto:johannes at huesing.name from such a trifling investment of fact. http://derwisch.wikidot.com (Mark Twain, "Life on the Mississippi")
Wayne, I am fond of the bagplot (think 2D box plot) to replace scatter plots for large N. See http://www.wiwi.uni-bielefeld.de/~wolf/software/aplpack/ and aplpack in CRAN. -- HTH, Jim Porzak Responsys, Inc. San Francisco, CA http://www.linkedin.com/in/jimporzak On Dec 17, 2007 5:14 PM, Wayne Aldo Gavioli <wgavioli at fas.harvard.edu> wrote:> > > Hello all, > > > I'm trying to graph a scatterplot of a large (5,000 x,y coordinates) of data > with the caveat that many of the data points overlap with each other (share the > same x AND y coordinates). In using the usual "plot" command, > > > > plot(education, xlab="etc", ylab="etc") > > > it seems that the overlap of points is not shown in the graph. Namely, there > are 5,000 points that should be plotted, as I mentioned above, but because so > many of the points overlap with each other exactly, only about 50-60 points are > actually plotted on the graph. Thus, there's no indication that Point A shares > its coordinates with 200 other pieces of data and thus is very common while > Point B doesn't share its coordinates with any other pieces of data and thus > isn't common at all. Is there anyway to indicate the frequency of such points > on such a graph? Should I be using a different command than "plot"? > > > Thanks, > > > Wayne > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Wayne Aldo Gavioli wrote:> > Hello all, > > > I'm trying to graph a scatterplot of a large (5,000 x,y coordinates) of data > with the caveat that many of the data points overlap with each other (share the > same x AND y coordinates). In using the usual "plot" command, > > > >>plot(education, xlab="etc", ylab="etc") > > > > it seems that the overlap of points is not shown in the graph. Namely, there > are 5,000 points that should be plotted, as I mentioned above, but because so > many of the points overlap with each other exactly, only about 50-60 points are > actually plotted on the graph. Thus, there's no indication that Point A shares > its coordinates with 200 other pieces of data and thus is very common while > Point B doesn't share its coordinates with any other pieces of data and thus > isn't common at all. Is there anyway to indicate the frequency of such points > on such a graph? Should I be using a different command than "plot"? >Hi Wayne, While this is not a really pretty picture, you can get a viewable plot with count.overplot if the first two elements of "education" are named "x" and "y" and they are the coordinates you want to plot. Otherwise, pass the x and y coordinates separately. library(plotrix) count.overplot(education, tol=c(diff(range(education$x))/10, diff(range(education$y))/10)) Jim
Wayne Aldo Gavioli <wgavioli <at> fas.harvard.edu> writes:> > > Hello all, > > I'm trying to graph a scatterplot of a large (5,000 x,y coordinates) of data > with the caveat that many of the data points overlap with each other (share the > same x AND y coordinates). In using the usual "plot" command, > > > plot(education, xlab="etc", ylab="etc") > > it seems that the overlap of points is not shown in the graph. Namely, there > are 5,000 points that should be plotted, as I mentioned above, but because so > many of the points overlap with each other exactly, only about 50-60 points are > actually plotted on the graph. Thus, there's no indication that Point A shares > its coordinates with 200 other pieces of data and thus is very common while > Point B doesn't share its coordinates with any other pieces of data and thus > isn't common at all. Is there anyway to indicate the frequency of such points > on such a graph? Should I be using a different command than "plot"? > >One suggestion seems to be still missing: 'sunflowerplot' of base R. May look taggy, though, if you have 200 "petals". Actually the documentation of sunflowerplot is wrong in botanical sense. Sunflowers have composite flowers in capitula, and the things called 'petals' in documentation are ligulate, sterile ray-florets (each with vestigial petals which are not easily visible in sunflower, but in some other species you may see three (occasionally two) teeth). cheers, jari oksanen
Wayne, Try the iplot command in iPlots. You can then vary both the pointsize and the transparency of your scatterplot interactively and decide which scatterplot conveys the information best. Sometimes it's helpful to use more than one scatterplot when presenting your results. (I must admit to being very surprised that jittering and sunflower plots have been suggested for a dataset of 5000 points. Do those who mentioned these methods have examples on that scale where they are effective?) Antony Unwin Professor of Computer-Oriented Statistics and Data Analysis, University of Augsburg, Germany [[alternative HTML version deleted]]
>> Antony Unwin <unwin at math.uni-augsburg.de> >> >I must admit to being very surprised that jittering and sunflower >plots have been suggested for a dataset of 5000 points. Do those who>mentioned these methods have examples on that scale where they are >effective?)You have a point. haha. But check the microarray literature; scatterplots have been used - often - to display microarray data with 10000 observations at a time. And in their defence, even on screen, a 600x600 pixel plot window holds 360000 pixels - 5000 is not a large fraction of that. Jittering has visible effects on data at that resolution. Compare the two plots in library(MASS) Sigma <- matrix(c(10,4,4,2),2,2) xy<- round(mvrnorm(n=5000, rep(0, 2), Sigma), 1) plot(xy,pch=".") plot(jitter(xy, factor=2),pch=".") But you're of course right to question how sensible this is. The best you can get is a visual impression of the 'shape' of the data with a greater perceived density at multiple observations which otherwise overlapped. S.
Another approach which I'm pleased with but was not suggested so far is jitter + kde2d from MASS: plot(jitter(x), jitter(y)) if (!exists("kde2d")) require(MASS) kdesamp <- 20000 #depending on your RAM forkde <- if (kdesamp < length(x)) sample(1:length(x), kdesamp, replace=FALSE) else 1:length(x) d <- kde2d(x[forkde], y[forkde]) contour(d, add=TRUE)> -----Original Message----- > From: r-help-bounces at r-project.org > Subject: Re: [R] Scatterplot Showing All Points >