Witold Eryk Wolski
2004-Nov-24 15:34 UTC
[R] scatterplot of 100000 points and pdf file format
Hi, I want to draw a scatter plot with 1M and more points and save it as pdf. This makes the pdf file large. So i tried to save the file first as png and than convert it to pdf. This looks OK if printed but if viewed e.g. with acrobat as document figure the quality is bad. Anyone knows a way to reduce the size but keep the quality? /E -- Dipl. bio-chem. Witold Eryk Wolski MPI-Moleculare Genetic Ihnestrasse 63-73 14195 Berlin tel: 0049-30-83875219 __("< _ http://www.molgen.mpg.de/~wolski \__/ 'v' http://r4proteomics.sourceforge.net || / \ mail: witek96 at users.sourceforge.net ^^ m m wolski at molgen.mpg.de
On 24-Nov-04 Witold Eryk Wolski wrote:> Hi, > I want to draw a scatter plot with 1M and more points > and save it as pdf. > This makes the pdf file large. > So i tried to save the file first as png and than convert > it to pdf. This looks OK if printed but if viewed e.g. with > acrobat as document figure the quality is bad. > > Anyone knows a way to reduce the size but keep the quality?If you want the PDF file to preserve the info about all the 1M points then the problem has no solution. The png file will already have suppressed most of this (which is one reason for poor quality). I think you should give thought to reducing what you need to plot. Think about it: suppose you plot with a resolution of 1/200 points per inch (about the limit at which the eye begins to see rough edges). Then you have 40000 points per square inch. If your 1M points are separate but as closely packed as possible, this requires 25 square inches, or a 5x5 inch (= 12.7x12.7 cm) square. And this would be solid black! Presumably in your plot there is a very large number of points which are effectively indistinguisable from other points, so these could be eliminated without spoiling the plot. I don't have an obviously best strategy for reducing what you actually plot, but perhaps one line to think along might be the following: 1. Multiply the data by some factor and then round the results to an integer (to avoid problems in step 2). Factor chosen so that the result of (4) below is satisfactory. 2. Eliminate duplicates in the result of (1). 3. Divide by the factor you used in (1). 4. Plot the result; save plot to PDF. As to how to do it in R: the critical step is (2), which with so many points could be very heavy unless done by a well-chosen procedure. I'm not expert enough to advise about that, but no doubt others are. Good luck! Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 [NB: New number!] Date: 24-Nov-04 Time: 16:16:28 ------------------------------ XFMail ------------------------------
On Wed, 2004-11-24 at 16:34 +0100, Witold Eryk Wolski wrote:> Hi, > > I want to draw a scatter plot with 1M and more points and save it as pdf. > This makes the pdf file large. > So i tried to save the file first as png and than convert it to pdf. > This looks OK if printed but if viewed e.g. with acrobat as document > figure the quality is bad. > > Anyone knows a way to reduce the size but keep the quality?Hi Eryk! Part of the problem is that in a pdf file, the vector based instructions will need to be defined for each of your 10 ^ 6 points in order to draw them. When trying to create a simple example: pdf() plot(rnorm(1000000), rnorm(1000000)) dev.off() The pdf file is 55 Mb in size. One immediate thought was to try a ps file and using the above plot, the ps file was "only" 23 Mb in size. So note that ps can be more efficient. Going to a bitmap might result in a much smaller file, but as you note, the quality does degrade as compared to a vector based image. I tried the above to a png, then converted to a pdf (using 'convert') and as expected, the image both viewed and printed was "pixelated", since the pdf instructions are presumably drawing pixels and not vector based objects. Depending upon what you plan to do with the image, you may have to choose among several options, resulting in tradeoffs between image quality and file size. If you can create the bitmap file explicitly in the size that you require for printing or incorporating in a document, that is one way to go and will preserve, to an extent, the overall fixed size image quality, while keeping file size small. Another option to consider for the pdf approach, if it does not compromise the integrity of your plot, is to remove any duplicate data points if any exist. Thus, you will not need what are in effect redundant instructions in the pdf file. This may not be possible depending upon the nature of your data (ie. doubles) without considering some tolerance level for "equivalence". Perhaps others will have additional ideas. HTH, Marc Schwartz
Marc/Eryk, I have no experience with it, but I believe the hexbin package in BioC was there for this purpose: avoid heavy over-plotting lots of points. You might want to look into that, if you have not done so yet. Best, Andy> From: Marc Schwartz > > On Wed, 2004-11-24 at 16:34 +0100, Witold Eryk Wolski wrote: > > Hi, > > > > I want to draw a scatter plot with 1M and more points and > save it as pdf. > > This makes the pdf file large. > > So i tried to save the file first as png and than convert > it to pdf. > > This looks OK if printed but if viewed e.g. with acrobat as > document > > figure the quality is bad. > > > > Anyone knows a way to reduce the size but keep the quality? > > Hi Eryk! > > Part of the problem is that in a pdf file, the vector based > instructions > will need to be defined for each of your 10 ^ 6 points in > order to draw > them. > > When trying to create a simple example: > > pdf() > plot(rnorm(1000000), rnorm(1000000)) > dev.off() > > The pdf file is 55 Mb in size. > > One immediate thought was to try a ps file and using the > above plot, the > ps file was "only" 23 Mb in size. So note that ps can be more > efficient. > > Going to a bitmap might result in a much smaller file, but as > you note, > the quality does degrade as compared to a vector based image. > > I tried the above to a png, then converted to a pdf (using 'convert') > and as expected, the image both viewed and printed was "pixelated", > since the pdf instructions are presumably drawing pixels and > not vector > based objects. > > Depending upon what you plan to do with the image, you may have to > choose among several options, resulting in tradeoffs between image > quality and file size. > > If you can create the bitmap file explicitly in the size that you > require for printing or incorporating in a document, that is > one way to > go and will preserve, to an extent, the overall fixed size image > quality, while keeping file size small. > > Another option to consider for the pdf approach, if it does not > compromise the integrity of your plot, is to remove any duplicate data > points if any exist. Thus, you will not need what are in effect > redundant instructions in the pdf file. This may not be possible > depending upon the nature of your data (ie. doubles) without > considering > some tolerance level for "equivalence". > > Perhaps others will have additional ideas. > > HTH, > > Marc Schwartz > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >
On Wed, 24 Nov 2004, Witold Eryk Wolski wrote:> Hi, > > I want to draw a scatter plot with 1M and more points and save it as pdf.Try the "hexbin" Bioconductor package, which gives hexagonally-binned density scatterplots. Even for tens of thousands of points this is often much better than a scatterplot. -thomas
Witold, I have found that plotting more than a few thousand data points at a time quickly becomes a loosing proposition. That is, the dense overlap of data points tends to obscure the patterns of interest, with only outliers distinctly visible. I typically deal with this in two ways. The most straight forward is to select a much smaller subset data points to plot, say on the order of 100-1000, depending on the nature of the data and the features you want to illustrate. How you sample depends on the structure of your data set. E.g. you may want to sample fixed proportions within subgroups. You can add loess lines or confidence ellipses estimated from the complete data. Another approach is to estimate the two dimensional density using kde2d() (MASS package) and represent the result with a contour or image plot. See ?kde2d for an example. Both of these will result in much more manageable (and likely more informative) figures. Regards, Matt Matthew R. Nelson, Ph.D. Director, Biostatistics Sequenom, Inc.> -----Original Message----- > From: Witold Eryk Wolski [mailto:wolski at molgen.mpg.de] > Sent: Wednesday, November 24, 2004 7:35 AM > To: R Help Mailing List > Subject: [R] scatterplot of 100000 points and pdf file format > > > Hi, > > I want to draw a scatter plot with 1M and more points and > save it as pdf. > This makes the pdf file large. > So i tried to save the file first as png and than convert it to pdf. > This looks OK if printed but if viewed e.g. with acrobat as document > figure the quality is bad. > > Anyone knows a way to reduce the size but keep the quality? > > > /E > > -- > Dipl. bio-chem. Witold Eryk Wolski > MPI-Moleculare Genetic > Ihnestrasse 63-73 14195 Berlin > tel: 0049-30-83875219 __("< _ > http://www.molgen.mpg.de/~wolski \__/ 'v' > http://r4proteomics.sourceforge.net || / \ > mail: witek96 at users.sourceforge.net ^^ m m > wolski at molgen.mpg.de > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >
james.holtman@convergys.com
2004-Nov-24 17:09 UTC
[R] scatterplot of 100000 points and pdf file format
Have you tried plot(...,pch='.') This will use the period as the plotting character instead of the 'circle' which is drawn. This should reduce the size of the PDF file. I have done scatter plots with 2M points and they are typically meaningless with that many points overlaid. Check out 'hexbin' on Bioconductor (you can download the package from the RGUI window. This is a much better way of showing some information since it will plot the number of points that are within a hexagon. I have found this to be a better way of looking at some data. __________________________________________________________ James Holtman "What is the problem you are trying to solve?" Executive Technical Consultant -- Office of Technology, Convergys james.holtman at convergys.com +1 (513) 723-2929 Witold Eryk Wolski <wolski at molgen.mpg.de To: R Help Mailing List <r-help at stat.math.ethz.ch> > cc: Sent by: Subject: [R] scatterplot of 100000 points and pdf file format r-help-bounces at stat.m ath.ethz.ch 11/24/2004 10:34 Hi, I want to draw a scatter plot with 1M and more points and save it as pdf. This makes the pdf file large. So i tried to save the file first as png and than convert it to pdf. This looks OK if printed but if viewed e.g. with acrobat as document figure the quality is bad. Anyone knows a way to reduce the size but keep the quality? /E -- Dipl. bio-chem. Witold Eryk Wolski MPI-Moleculare Genetic Ihnestrasse 63-73 14195 Berlin tel: 0049-30-83875219 __("< _ http://www.molgen.mpg.de/~wolski \__/ 'v' http://r4proteomics.sourceforge.net || / \ mail: witek96 at users.sourceforge.net ^^ m m wolski at molgen.mpg.de ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
On Wednesday 24 November 2004 07:34, Witold Eryk Wolski wrote:> Hi, > > I want to draw a scatter plot with 1M and more points and save it as pdf. > This makes the pdf file large. > So i tried to save the file first as png and than convert it to pdf. > This looks OK if printed but if viewed e.g. with acrobat as document > figure the quality is bad. > > Anyone knows a way to reduce the size but keep the quality? >I would strongly suggest a different method to present the data such as a contour plot or 3D bar plot. An XY plot with a million points is unlikely to be readable unless it is produced as a large format print. At 200 DPI printed, 1,000,000 discrete points requires a minimum of a 5 inch (12.7 cm) by 5 inch area. Besides, other than being visually overwhelming, what information would such a plot offer a viewer? John
Do you have a measures of "scatter" or can you pick "outliers" that could allow you to produce a "mixed" plot using either density or hexbinned data with only outliers placed after-the-fact using points()? Sean>> -----Original Message----- >> From: Witold Eryk Wolski [mailto:wolski at molgen.mpg.de] >> Sent: Wednesday, November 24, 2004 7:35 AM >> To: R Help Mailing List >> Subject: [R] scatterplot of 100000 points and pdf file format >> >> >> Hi, >> >> I want to draw a scatter plot with 1M and more points and >> save it as pdf. >> This makes the pdf file large. >> So i tried to save the file first as png and than convert it to pdf. >> This looks OK if printed but if viewed e.g. with acrobat as document >> figure the quality is bad. >> >> Anyone knows a way to reduce the size but keep the quality?
How about the following to plot only the 1,000 or so most extreem points (the outliers): x <- rnorm(1e6) y <- 2*x+rnorm(1e6) plot(x,y,pch='.') tmp <- chull(x,y) while( length(tmp) < 1000 ){ tmp <- c(tmp, seq(along=x)[-tmp][ chull(x[-tmp],y[-tmp]) ] ) } points(x[tmp],y[tmp], col='red') now just replace the initial plot with a hexbin or contour plot and you should have something that takes a lot less room but still shows the locations of the outer points. Greg Snow, Ph.D. Statistical Data Center greg.snow at ihc.com (801) 408-8111
> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch]On Behalf Of > Ted.Harding at nessie.mcc.ac.uk > Sent: Wednesday, November 24, 2004 16:37 PM > To: R Help Mailing List > Subject: RE: [R] scatterplot of 100000 points and pdf file format > > > On 24-Nov-04 Prof Brian Ripley wrote: > > On Wed, 24 Nov 2004 Ted.Harding at nessie.mcc.ac.uk wrote: > > > >> 1. Multiply the data by some factor and then round the > >> results to an integer (to avoid problems in step 2). > >> Factor chosen so that the result of (4) below is > >> satisfactory. > >> > >> 2. Eliminate duplicates in the result of (1). > >> > >> 3. Divide by the factor you used in (1). > >> > >> 4. Plot the result; save plot to PDF. > >> > >> As to how to do it in R: the critical step is (2), > >> which with so many points could be very heavy unless > >> done by a well-chosen procedure. I'm not expert enough > >> to advise about that, but no doubt others are. > > > > unique will eat that for breakfast > > > >> x <- runif(1e6) > >> system.time(xx <- unique(round(x, 4))) > > [1] 0.55 0.09 0.64 0.00 0.00 > >> length(xx) > > [1] 10001 > > 'unique' will eat x for breakfast, indeed, but will have some > trouble chewing (x,y). >> xx <- data.frame(x=round(runif(1000000),4), y=round(runif(1000000),4)) > system.time(xx2 <- unique(xx))[1] 14.23 0.06 14.34 NA NA The time does not seem too bad, depending on how many times it has to be performed. --Matt Matt Austin Statistician Amgen One Amgen Center Drive M/S 24-2-C Thousand Oaks CA 93021 (805) 447 - 7431> I still can't think of a neat way of doing that. > > Best wishes, > Ted. > > > -------------------------------------------------------------------- > E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> > Fax-to-email: +44 (0)870 094 0861 [NB: New number!] > Date: 25-Nov-04 Time: 00:37:15 > ------------------------------ XFMail ------------------------------ > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >
> From: Ted.Harding at nessie.mcc.ac.uk > > On 25-Nov-04 Ted Harding wrote: > > 'unique' will eat x for breakfast, indeed, but will have some > > trouble chewing (x,y). > > > > I still can't think of a neat way of doing that. > > > > Best wishes, > > Ted. > > Sorry, I don't want to be misunderstood. > I didn't mean that 'unique' won't work for arrays. > What I meant was: > > > X<-round(rnorm(1e6),3);Y<-round(rnorm(1e6),3) > > system.time(unique(X)) > [1] 0.74 0.07 0.81 0.00 0.00 > > system.time(unique(cbind(X,Y))) > [1] 350.81 4.56 356.54 0.00 0.00Do you know if majority of that time is spent in unique() itself? If so, which method? What I see is:> X<-round(rnorm(1e6),3);Y<-round(rnorm(1e6),3) > system.time(unique(X), gcFirst=TRUE)[1] 0.25 0.01 0.26 NA NA> system.time(unique(cbind(X,Y)), gcFirst=TRUE)[1] 101.80 0.34 104.61 NA NA> system.time(dat <- data.frame(x=X, y=Y), gcFirst=TRUE)[1] 10.17 0.00 10.24 NA NA> system.time(unique(dat), gcFirst=TRUE)[1] 23.94 0.11 24.15 NA NA Andy> However, still rounding to 3 d.p. we can try packing: > > > Z<-100000000*X + 1000*Y > > system.time(W<-unique(Z)) > [1] 0.83 0.05 0.88 0.00 0.00 > > length(W) > [1] 961523 > > Though the runtime is small we don't get much reduction > and still W has to be unpacked. > > With rounding to 2 d.p. > > > X<-round(rnorm(1e6),2);Y<-round(rnorm(1e6),2) > > Z<-100000000*X + 1000*Y > > system.time(W<-unique(Z)) > [1] 1.31 0.01 1.32 0.00 0.00 > > length(W) > [1] 209882 > > so now it's about 1/5, but visible discretisation must be > getting close. > > With 1 d.p. > > > X<-round(rnorm(1e6),1);Y<-round(rnorm(1e6),1) > > Z<-100000000*X + 1000*Y > > system.time(W<-unique(Z)) > [1] 0.92 0.01 0.93 0.00 0.00 > > length(W) > [1] 4953 > > there's a good reduction (about 1/200) but the discretisation > would definitely now be visible. However, as I suggested before, > there's an issue of choice of constant (i.e. of the resolution > of the discretisation so that there's a useful reduction and > also the plot is acceptable). > > I'd still like to learn of a method which avoids the > above method of packing, which strikes me as clumsy > (but maybe it's the best way after all). > > Ted. > > > -------------------------------------------------------------------- > E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> > Fax-to-email: +44 (0)870 094 0861 [NB: New number!] > Date: 25-Nov-04 Time: 01:45:48 > ------------------------------ XFMail ------------------------------ > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >
Another possibility might be to use a 2d kernel density estimate (eg. kde2d from library(MASS). Then for the high density areas plot the density contours, for the low density areas plot the individual points. Hadley