thr3ads.net - R help - [R] scatterplot of 100000 points and pdf file format [Nov 2004]

If this information is useful, please help other people find it:
Share via:

Witold Eryk Wolski

2004-Nov-24 15:34 UTC

[R] scatterplot of 100000 points and pdf file format

Hi,

I want to draw a scatter plot with 1M  and more points and save it as pdf.
This makes the pdf file large.
So i tried to save the file first as png and than convert it to pdf. 
This looks OK if printed but if viewed e.g. with acrobat as document 
figure the quality is bad.

Anyone knows a way to reduce the size but keep the quality?


/E

-- 
Dipl. bio-chem. Witold Eryk Wolski
MPI-Moleculare Genetic
Ihnestrasse 63-73 14195 Berlin
tel: 0049-30-83875219                 __("<    _
http://www.molgen.mpg.de/~wolski      \__/    'v'
http://r4proteomics.sourceforge.net    ||    /   \
mail: witek96 at users.sourceforge.net    ^^     m m
      wolski at molgen.mpg.de

(Ted Harding)

2004-Nov-24 16:16 UTC

head link

[R] scatterplot of 100000 points and pdf file format

On 24-Nov-04 Witold Eryk Wolski wrote:> Hi,
> I want to draw a scatter plot with 1M  and more points
> and save it as pdf.
> This makes the pdf file large.
> So i tried to save the file first as png and than convert
> it to pdf. This looks OK if printed but if viewed e.g. with
> acrobat as document figure the quality is bad.
> 
> Anyone knows a way to reduce the size but keep the quality?
If you want the PDF file to preserve the info about all the
1M points then the problem has no solution. The png file
will already have suppressed most of this (which is one
reason for poor quality).

I think you should give thought to reducing what you need
to plot.

Think about it: suppose you plot with a resolution of
1/200 points per inch (about the limit at which the eye
begins to see rough edges). Then you have 40000 points
per square inch. If your 1M points are separate but as
closely packed as possible, this requires 25 square inches,
or a 5x5 inch (= 12.7x12.7 cm) square. And this would be
solid black!

Presumably in your plot there is a very large number of
points which are effectively indistinguisable from other
points, so these could be eliminated without spoiling
the plot.

I don't have an obviously best strategy for reducing what
you actually plot, but perhaps one line to think along
might be the following:

1. Multiply the data by some factor and then round the
   results to an integer (to avoid problems in step 2).
   Factor chosen so that the result of (4) below is
   satisfactory.

2. Eliminate duplicates in the result of (1).

3. Divide by the factor you used in (1).

4. Plot the result; save plot to PDF.

As to how to do it in R: the critical step is (2),
which with so many points could be very heavy unless
done by a well-chosen procedure. I'm not expert enough
to advise about that, but no doubt others are.

Good luck!
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
Date: 24-Nov-04                                       Time: 16:16:28
------------------------------ XFMail ------------------------------

Marc Schwartz

2004-Nov-24 16:22 UTC

head link

[R] scatterplot of 100000 points and pdf file format

On Wed, 2004-11-24 at 16:34 +0100, Witold Eryk Wolski
wrote:> Hi,
> 
> I want to draw a scatter plot with 1M  and more points and save it as pdf.
> This makes the pdf file large.
> So i tried to save the file first as png and than convert it to pdf. 
> This looks OK if printed but if viewed e.g. with acrobat as document 
> figure the quality is bad.
> 
> Anyone knows a way to reduce the size but keep the quality?
Hi Eryk!

Part of the problem is that in a pdf file, the vector based instructions
will need to be defined for each of your 10 ^ 6 points in order to draw
them.

When trying to create a simple example:

pdf()
plot(rnorm(1000000), rnorm(1000000))
dev.off()

The pdf file is 55 Mb in size.

One immediate thought was to try a ps file and using the above plot, the
ps file was "only" 23 Mb in size. So note that ps can be more
efficient.

Going to a bitmap might result in a much smaller file, but as you note,
the quality does degrade as compared to a vector based image.

I tried the above to a png, then converted to a pdf (using 'convert')
and as expected, the image both viewed and printed was "pixelated",
since the pdf instructions are presumably drawing pixels and not vector
based objects.

Depending upon what you plan to do with the image, you may have to
choose among several options, resulting in tradeoffs between image
quality and file size.

If you can create the bitmap file explicitly in the size that you
require for printing or incorporating in a document, that is one way to
go and will preserve, to an extent, the overall fixed size image
quality, while keeping file size small.

Another option to consider for the pdf approach, if it does not
compromise the integrity of your plot, is to remove any duplicate data
points if any exist. Thus, you will not need what are in effect
redundant instructions in the pdf file. This may not be possible
depending upon the nature of your data (ie. doubles) without considering
some tolerance level for "equivalence".

Perhaps others will have additional ideas.

HTH,

Marc Schwartz

Liaw, Andy

2004-Nov-24 16:37 UTC

head link

[R] scatterplot of 100000 points and pdf file format

Marc/Eryk,

I have no experience with it, but I believe the hexbin package in BioC was
there for this purpose: avoid heavy over-plotting lots of points.  You might
want to look into that, if you have not done so yet.

Best,
Andy
> From: Marc Schwartz
> 
> On Wed, 2004-11-24 at 16:34 +0100, Witold Eryk Wolski wrote:
> > Hi,
> > 
> > I want to draw a scatter plot with 1M  and more points and 
> save it as pdf.
> > This makes the pdf file large.
> > So i tried to save the file first as png and than convert 
> it to pdf. 
> > This looks OK if printed but if viewed e.g. with acrobat as 
> document 
> > figure the quality is bad.
> > 
> > Anyone knows a way to reduce the size but keep the quality?
> 
> Hi Eryk!
> 
> Part of the problem is that in a pdf file, the vector based 
> instructions
> will need to be defined for each of your 10 ^ 6 points in 
> order to draw
> them.
> 
> When trying to create a simple example:
> 
> pdf()
> plot(rnorm(1000000), rnorm(1000000))
> dev.off()
> 
> The pdf file is 55 Mb in size.
> 
> One immediate thought was to try a ps file and using the 
> above plot, the
> ps file was "only" 23 Mb in size. So note that ps can be more 
> efficient.
> 
> Going to a bitmap might result in a much smaller file, but as 
> you note,
> the quality does degrade as compared to a vector based image.
> 
> I tried the above to a png, then converted to a pdf (using
'convert')
> and as expected, the image both viewed and printed was
"pixelated",
> since the pdf instructions are presumably drawing pixels and 
> not vector
> based objects.
> 
> Depending upon what you plan to do with the image, you may have to
> choose among several options, resulting in tradeoffs between image
> quality and file size.
> 
> If you can create the bitmap file explicitly in the size that you
> require for printing or incorporating in a document, that is 
> one way to
> go and will preserve, to an extent, the overall fixed size image
> quality, while keeping file size small.
> 
> Another option to consider for the pdf approach, if it does not
> compromise the integrity of your plot, is to remove any duplicate data
> points if any exist. Thus, you will not need what are in effect
> redundant instructions in the pdf file. This may not be possible
> depending upon the nature of your data (ie. doubles) without 
> considering
> some tolerance level for "equivalence".
> 
> Perhaps others will have additional ideas.
> 
> HTH,
> 
> Marc Schwartz
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>

Thomas Lumley

2004-Nov-24 16:48 UTC

head link

[R] scatterplot of 100000 points and pdf file format

On Wed, 24 Nov 2004, Witold Eryk Wolski wrote:
> Hi,
>
> I want to draw a scatter plot with 1M  and more points and save it as pdf.
Try the "hexbin" Bioconductor package, which gives hexagonally-binned 
density scatterplots. Even for tens of thousands of points this is often 
much better than a scatterplot.

 	-thomas

Matt Nelson

2004-Nov-24 16:53 UTC

head link

[R] scatterplot of 100000 points and pdf file format

Witold,

I have found that plotting more than a few thousand data points at a time
quickly becomes a loosing proposition.  That is, the dense overlap of data
points tends to obscure the patterns of interest, with only outliers
distinctly visible.  I typically deal with this in two ways.  

The most straight forward is to select a much smaller subset data points to
plot, say on the order of 100-1000, depending on the nature of the data and
the features you want to illustrate.  How you sample depends on the
structure of your data set.  E.g. you may want to sample fixed proportions
within subgroups.  You can add loess lines or confidence ellipses estimated
from the complete data.

Another approach is to estimate the two dimensional density using kde2d()
(MASS package) and represent the result with a contour or image plot.  See
?kde2d for an example.  

Both of these will result in much more manageable (and likely more
informative) figures.

Regards,
Matt

Matthew R. Nelson, Ph.D.
Director, Biostatistics
Sequenom, Inc.

> -----Original Message-----
> From: Witold Eryk Wolski [mailto:wolski at molgen.mpg.de]
> Sent: Wednesday, November 24, 2004 7:35 AM
> To: R Help Mailing List
> Subject: [R] scatterplot of 100000 points and pdf file format
> 
> 
> Hi,
> 
> I want to draw a scatter plot with 1M  and more points and 
> save it as pdf.
> This makes the pdf file large.
> So i tried to save the file first as png and than convert it to pdf. 
> This looks OK if printed but if viewed e.g. with acrobat as document 
> figure the quality is bad.
> 
> Anyone knows a way to reduce the size but keep the quality?
> 
> 
> /E
> 
> -- 
> Dipl. bio-chem. Witold Eryk Wolski
> MPI-Moleculare Genetic
> Ihnestrasse 63-73 14195 Berlin
> tel: 0049-30-83875219                 __("<    _
> http://www.molgen.mpg.de/~wolski      \__/    'v'
> http://r4proteomics.sourceforge.net    ||    /   \
> mail: witek96 at users.sourceforge.net    ^^     m m
>       wolski at molgen.mpg.de
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
>

james.holtman@convergys.com

2004-Nov-24 17:09 UTC

head link

[R] scatterplot of 100000 points and pdf file format

Have you tried

plot(...,pch='.')

This will use the period as the plotting character instead of the
'circle'
which is drawn.  This should reduce the size of the PDF file.

I have done scatter plots with 2M points and they are typically meaningless
with that many points overlaid.  Check out 'hexbin' on Bioconductor (you
can download the package from the RGUI window.  This is a much better way
of showing some information since it will plot the number of points that
are within a hexagon.  I have found this to be a better way of looking at
some data.
__________________________________________________________
James Holtman        "What is the problem you are trying to solve?"
Executive Technical Consultant  --  Office of Technology, Convergys
james.holtman at convergys.com
+1 (513) 723-2929



                      Witold Eryk Wolski
                      <wolski at molgen.mpg.de        To:       R Help
Mailing List <r-help at stat.math.ethz.ch>
                      >                            cc:
                      Sent by:                     Subject:  [R] scatterplot of
100000 points and pdf file format
                      r-help-bounces at stat.m
                      ath.ethz.ch


                      11/24/2004 10:34






Hi,

I want to draw a scatter plot with 1M  and more points and save it as pdf.
This makes the pdf file large.
So i tried to save the file first as png and than convert it to pdf.
This looks OK if printed but if viewed e.g. with acrobat as document
figure the quality is bad.

Anyone knows a way to reduce the size but keep the quality?


/E

--
Dipl. bio-chem. Witold Eryk Wolski
MPI-Moleculare Genetic
Ihnestrasse 63-73 14195 Berlin
tel: 0049-30-83875219                 __("<    _
http://www.molgen.mpg.de/~wolski      \__/    'v'
http://r4proteomics.sourceforge.net    ||    /   \
mail: witek96 at users.sourceforge.net    ^^     m m
      wolski at molgen.mpg.de

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

John

2004-Nov-24 17:37 UTC

head link

[R] scatterplot of 100000 points and pdf file format

On Wednesday 24 November 2004 07:34, Witold Eryk Wolski
wrote:> Hi,
>
> I want to draw a scatter plot with 1M  and more points and save it as pdf.
> This makes the pdf file large.
> So i tried to save the file first as png and than convert it to pdf.
> This looks OK if printed but if viewed e.g. with acrobat as document
> figure the quality is bad.
>
> Anyone knows a way to reduce the size but keep the quality?
>I would strongly suggest a different method to present the data such as a 
contour plot or 3D bar plot.  An XY plot with a million points is unlikely to 
be readable unless it is produced as a large format print.  At 200 DPI 
printed, 1,000,000 discrete points requires a minimum of a 5 inch (12.7
cm) by 5 inch area.  Besides, other than being visually overwhelming, what 
information would such a plot offer a viewer?

John

Sean Davis

2004-Nov-24 17:44 UTC

head link

[R] scatterplot of 100000 points and pdf file format

Do you have a measures of "scatter" or can you pick
"outliers" that
could allow you to produce a "mixed" plot using either density or 
hexbinned data with only outliers placed after-the-fact using points()?

Sean
>> -----Original Message-----
>> From: Witold Eryk Wolski [mailto:wolski at molgen.mpg.de]
>> Sent: Wednesday, November 24, 2004 7:35 AM
>> To: R Help Mailing List
>> Subject: [R] scatterplot of 100000 points and pdf file format
>>
>>
>> Hi,
>>
>> I want to draw a scatter plot with 1M  and more points and
>> save it as pdf.
>> This makes the pdf file large.
>> So i tried to save the file first as png and than convert it to pdf.
>> This looks OK if printed but if viewed e.g. with acrobat as document
>> figure the quality is bad.
>>
>> Anyone knows a way to reduce the size but keep the quality?

Greg Snow

2004-Nov-24 20:32 UTC

head link

[R] scatterplot of 100000 points and pdf file format

How about the following to plot only the 1,000 or so most extreem points
(the outliers):

x <- rnorm(1e6)
y <- 2*x+rnorm(1e6)

plot(x,y,pch='.')

tmp <- chull(x,y)

while( length(tmp) < 1000 ){
	tmp <- c(tmp, seq(along=x)[-tmp][ chull(x[-tmp],y[-tmp]) ] )
}

points(x[tmp],y[tmp], col='red')

now just replace the initial plot with a hexbin or contour plot and you
should have something that takes a lot less room but still shows the
locations of the outer points.



Greg Snow, Ph.D.
Statistical Data Center
greg.snow at ihc.com
(801) 408-8111

Austin, Matt

2004-Nov-25 01:51 UTC

head link

[R] scatterplot of 100000 points and pdf file format

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch]On Behalf Of
> Ted.Harding at nessie.mcc.ac.uk
> Sent: Wednesday, November 24, 2004 16:37 PM
> To: R Help Mailing List
> Subject: RE: [R] scatterplot of 100000 points and pdf file format
> 
> 
> On 24-Nov-04 Prof Brian Ripley wrote:
> > On Wed, 24 Nov 2004 Ted.Harding at nessie.mcc.ac.uk wrote:
> > 
> >> 1. Multiply the data by some factor and then round the
> >>   results to an integer (to avoid problems in step 2).
> >>   Factor chosen so that the result of (4) below is
> >>   satisfactory.
> >>
> >> 2. Eliminate duplicates in the result of (1).
> >>
> >> 3. Divide by the factor you used in (1).
> >>
> >> 4. Plot the result; save plot to PDF.
> >>
> >> As to how to do it in R: the critical step is (2),
> >> which with so many points could be very heavy unless
> >> done by a well-chosen procedure. I'm not expert enough
> >> to advise about that, but no doubt others are.
> > 
> > unique will eat that for breakfast
> > 
> >> x <- runif(1e6)
> >> system.time(xx <- unique(round(x, 4)))
> > [1] 0.55 0.09 0.64 0.00 0.00
> >> length(xx)
> > [1] 10001
> 
> 'unique' will eat x for breakfast, indeed, but will have some
> trouble chewing (x,y).
> 
>  xx <- data.frame(x=round(runif(1000000),4), y=round(runif(1000000),4))
>  system.time(xx2 <- unique(xx))[1] 14.23  0.06 14.34    NA    NA

The time does not seem too bad, depending on how many times it has to be
performed.
--Matt

Matt Austin
Statistician

Amgen 
One Amgen Center Drive
M/S 24-2-C
Thousand Oaks CA 93021
(805) 447 - 7431
> I still can't think of a neat way of doing that.
> 
> Best wishes,
> Ted.
> 
> 
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
> Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
> Date: 25-Nov-04                                       Time: 00:37:15
> ------------------------------ XFMail ------------------------------
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
>

Liaw, Andy

2004-Nov-25 02:45 UTC

head link

[R] scatterplot of 100000 points and pdf file format

> From: Ted.Harding at nessie.mcc.ac.uk
> 
> On 25-Nov-04 Ted Harding wrote:
> > 'unique' will eat x for breakfast, indeed, but will have some
> > trouble chewing (x,y).
> > 
> > I still can't think of a neat way of doing that.
> > 
> > Best wishes,
> > Ted.
> 
> Sorry, I don't want to be misunderstood.
> I didn't mean that 'unique' won't work for arrays.
> What I meant was:
> 
> > X<-round(rnorm(1e6),3);Y<-round(rnorm(1e6),3)
> > system.time(unique(X))
> [1] 0.74 0.07 0.81 0.00 0.00
> > system.time(unique(cbind(X,Y)))
> [1] 350.81   4.56 356.54   0.00   0.00
Do you know if majority of that time is spent in unique() itself?  If so,
which method?  What I see is:
> X<-round(rnorm(1e6),3);Y<-round(rnorm(1e6),3)
> system.time(unique(X), gcFirst=TRUE)
[1] 0.25 0.01 0.26   NA   NA> system.time(unique(cbind(X,Y)), gcFirst=TRUE)
[1] 101.80   0.34 104.61     NA     NA> system.time(dat <- data.frame(x=X, y=Y), gcFirst=TRUE)
[1] 10.17  0.00 10.24    NA    NA> system.time(unique(dat), gcFirst=TRUE)[1] 23.94  0.11 24.15    NA    NA

Andy

 > However, still rounding to 3 d.p. we can try packing:
> 
> > Z<-100000000*X + 1000*Y
> > system.time(W<-unique(Z))
> [1] 0.83 0.05 0.88 0.00 0.00
> > length(W)
> [1] 961523
> 
> Though the runtime is small we don't get much reduction
> and still W has to be unpacked.
> 
> With rounding to 2 d.p.
> 
> > X<-round(rnorm(1e6),2);Y<-round(rnorm(1e6),2)
> > Z<-100000000*X + 1000*Y
> > system.time(W<-unique(Z))
> [1] 1.31 0.01 1.32 0.00 0.00
> > length(W)
> [1] 209882
> 
> so now it's about 1/5, but visible discretisation must be
> getting close.
> 
> With 1 d.p.
> 
> > X<-round(rnorm(1e6),1);Y<-round(rnorm(1e6),1)
> > Z<-100000000*X + 1000*Y
> > system.time(W<-unique(Z))
> [1] 0.92 0.01 0.93 0.00 0.00
> > length(W)
> [1] 4953
> 
> there's a good reduction (about 1/200) but the discretisation
> would definitely now be visible. However, as I suggested before,
> there's an issue of choice of constant (i.e. of the resolution
> of the discretisation so that there's a useful reduction and
> also the plot is acceptable).
> 
> I'd still like to learn of a method which avoids the
> above method of packing, which strikes me as clumsy
> (but maybe it's the best way after all).
> 
> Ted.
> 
> 
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
> Fax-to-email: +44 (0)870 094 0861  [NB: New number!]
> Date: 25-Nov-04                                       Time: 01:45:48
> ------------------------------ XFMail ------------------------------
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>

hadley wickham

2004-Nov-25 03:21 UTC

head link

[R] scatterplot of 100000 points and pdf file format

Another possibility might be to use a 2d kernel density estimate (eg.
kde2d from library(MASS).  Then for the high density areas plot the
density contours, for the low density areas plot the individual
points.

Hadley

Reasonably Related Threads

Search for more reasonably related threads

R help - Nov 2004 - scatterplot of 100000 points and pdf file format

[R] scatterplot of 100000 points and pdf file format

[R] scatterplot of 100000 points and pdf file format

[R] scatterplot of 100000 points and pdf file format

[R] scatterplot of 100000 points and pdf file format

[R] scatterplot of 100000 points and pdf file format

[R] scatterplot of 100000 points and pdf file format

[R] scatterplot of 100000 points and pdf file format

[R] scatterplot of 100000 points and pdf file format

[R] scatterplot of 100000 points and pdf file format

[R] scatterplot of 100000 points and pdf file format

[R] scatterplot of 100000 points and pdf file format

[R] scatterplot of 100000 points and pdf file format

[R] scatterplot of 100000 points and pdf file format

Reasonably Related Threads