Thomas Frööjd
2008-Jun-23 19:40 UTC
[R] Need ideas on how to show spikes in my data and how to code it in R
Hi I have recently been analyzing birthweight data from a clinic. The data has obvious defects in that there is digit preference on certain weights making them overrepresented. This shows as spikes in the histogram on certain well rounded weights like 2, 2.5, 3, etc. I would like to show this to government officials but can't figure out how I should present the finding in an easy to understand manner. My idea is this: I have a dataset of 20 000 childbirths from another nation that I would like to plot in a graph over the histograms of birth weights from the clinic. This dataset doesn't share the digit preference problem. The idea is similar to how people sometimes plot a fitted normal density function over a histogram to show how data is distributed. To do this I need to do three steps. None which I succeeded with so far 1. Shift the mean and std on the reference dataset to the mean and std of my clinic birth weight data. 2. Scale the data so they can be plotted on the same axis. The reference dataset has around 20 000 observations and my data from the clinic only around 3000 so I have to fix this otherwise the plot of the reference datset will be much bigger in the graph. 3. Plot both on the same graph. The reference dataset like a density plot and my dataset as a histogram, that means weight bins on the x axis and number of observations on y. It should be added that my reference dataset isn't truly continuous but recorded at 100g intervals. This means both datasets have the same grouping however plotting both as histogram would probably make it harder to understand for a person with little training in statistics. This means that the reference dataset "density function" has to be smoothed somehow. I would be very thankful for help on any of those steps. Also if you think this approach is wrong for some reason please tell me. Best regards Thomas Fr?jd
Daniel Folkinshteyn
2008-Jun-23 20:10 UTC
[R] Need ideas on how to show spikes in my data and how to code it in R
on 06/23/2008 03:40 PM Thomas Fr??jd said the following:> 1. Shift the mean and std on the reference dataset to the mean > and std of my clinic birth weight data.to shift the mean by any distance, just add or subtract that distance from each observation (e.g., to move mean from m1 to m2, to each observation add (m2 -m1) ). to shift the stddev, from, say, s1 to s2, multiply each observation by s2/s1 instead of shifting ref dataset to mean/sdev of other dataset, it might be more intuitive to transform both to mean=0, sdev=1.> 2. Scale the data so they can be plotted on the same axis. The > reference dataset has around 20 000 observations and my data from the > clinic only around 3000 so I have to fix this otherwise the plot of > the reference datset will be much bigger in the graph.if you do a density plot (see ?density in R), it will automatically be scaled. if you want the histogram scaled too, then after calculating the histogram frequencies, multiply them by a ratio of numberofobs for your data, and number of obs for reference data (i.e.: NOBS_yourdata / NOSB_refdata) but i'd say, you might do better to just work with a density plot and set the appropriate bandwidth parameter, rather than working with a histogram, for presentational purposes.> 3. Plot both on the same graph. The reference dataset like a > density plot and my dataset as a histogram, that means weight bins on > the x axis and number of observations on y. It should be added that my > reference dataset isn't truly continuous but recorded at 100g > intervals. This means both datasets have the same grouping however > plotting both as histogram would probably make it harder to understand > for a person with little training in statistics. This means that the > reference dataset "density function" has to be smoothed somehow.see ?density, set the appropriate bandwidth parameter to achieve your desired degree of smoothing.> I would be very thankful for help on any of those steps. Also if you > think this approach is wrong for some reason please tell me.i think you'd have a much easier time of it (and also a better-looking and more informative plot), if you plot both as density on the same plot, and forgo the histogram overlay. your reference dataset will be a nice smooth histogram, as long as you choose a wide enough bandwidth to avoid showing peaks every 100g, and your target dataset will have large peaks at 2, 2.5, 3, etc. will look very nice salient. :) also, unless the babies in both plots are from different species :), you probably don't need to transform the data to equalize means and variances. -d
paste(sep="", "graf", 1:250, ".jpg") See ?paste, G. On Mon, Jun 30, 2008 at 11:58:51AM -0300, Leandro Marino wrote:> Hi list, > > I want to make a lot of graphics to my end course project. So, i was using > this sintax: > > > jpeg(filename = "graf01.jpg", width = 1024, height = 1024, > units = "px", pointsize = 25, quality = 100, bg = "grey95", > res = NA, restoreConsole = TRUE) > i=1 > par(mfrow=c(4,1),col="grey90",font.lab=2) > hist(sul[,i+2],nclass=75,xlab="Regi?o > Sul",ylab="Freq??ncia",col="antiquewhite4",main="") > hist(PR[,i+2],nclass=75,xlab="Paran?",ylab="Freq??ncia",col="antiquewhite4", > main="") > hist(SC[,i+2],nclass=75,xlab="Santa > Catarina",ylab="Freq??ncia",col="antiquewhite4",main="") > hist(RS[,i+2],nclass=75,xlab="Rio Grande do > Sul",ylab="Freq??ncia",col="antiquewhite4",main="") > dev.off() > > But, I want to know how can I create an for to do that. Like that: > > for(i in 1:250){ > jpeg(filename = "graf01.jpg", width = 1024, height = 1024, > units = "px", pointsize = 25, quality = 100, bg = "grey95", > res = NA, restoreConsole = TRUE) > par(mfrow=c(4,1),col="grey90",font.lab=2) > hist(sul[,i+2],nclass=75,xlab="Regi?o > Sul",ylab="Freq??ncia",col="antiquewhite4",main="") > hist(PR[,i+2],nclass=75,xlab="Paran?",ylab="Freq??ncia",col="antiquewhite4", > main="") > hist(SC[,i+2],nclass=75,xlab="Santa > Catarina",ylab="Freq??ncia",col="antiquewhite4",main="") > hist(RS[,i+2],nclass=75,xlab="Rio Grande do > Sul",ylab="Freq??ncia",col="antiquewhite4",main="") > dev.off() > } > > The problem is the name of the file, I want to do something like grafi.jpg > where i goes from 1 to 250. > > Thanks a lot for the help. > > > Best Regards, > Leandro Marino > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Csardi Gabor <csardi at rmki.kfki.hu> UNIL DGM
Thomas Frööjd
2008-Jul-07 10:24 UTC
[R] RES: Need ideas on how to show spikes in my data and how to codeit in R
Hi Try changing for(i in 1:250){ jpeg(filename = "graf01.jpg", width = 1024, height = 1024, to for(i in 1:250){ jpeg(filename = paste("graf", i, ".jpg", sep=""), width = 1024, height = 1024, I think that should work, otherwise read up on the paste() function at http://stat.ethz.ch/R-manual/R-patched/library/base/html/paste.html Best luck On Mon, Jun 30, 2008 at 4:58 PM, Leandro Marino <leandro at cesgranrio.org.br> wrote:> Hi list, > > I want to make a lot of graphics to my end course project. So, i was using > this sintax: > > > jpeg(filename = "graf01.jpg", width = 1024, height = 1024, > units = "px", pointsize = 25, quality = 100, bg = "grey95", > res = NA, restoreConsole = TRUE) > i=1 > par(mfrow=c(4,1),col="grey90",font.lab=2) > hist(sul[,i+2],nclass=75,xlab="Regi?o > Sul",ylab="Freq??ncia",col="antiquewhite4",main="") > hist(PR[,i+2],nclass=75,xlab="Paran?",ylab="Freq??ncia",col="antiquewhite4", > main="") > hist(SC[,i+2],nclass=75,xlab="Santa > Catarina",ylab="Freq??ncia",col="antiquewhite4",main="") > hist(RS[,i+2],nclass=75,xlab="Rio Grande do > Sul",ylab="Freq??ncia",col="antiquewhite4",main="") > dev.off() > > But, I want to know how can I create an for to do that. Like that: > > for(i in 1:250){ > jpeg(filename = "graf01.jpg", width = 1024, height = 1024, > units = "px", pointsize = 25, quality = 100, bg = "grey95", > res = NA, restoreConsole = TRUE) > par(mfrow=c(4,1),col="grey90",font.lab=2) > hist(sul[,i+2],nclass=75,xlab="Regi?o > Sul",ylab="Freq??ncia",col="antiquewhite4",main="") > hist(PR[,i+2],nclass=75,xlab="Paran?",ylab="Freq??ncia",col="antiquewhite4", > main="") > hist(SC[,i+2],nclass=75,xlab="Santa > Catarina",ylab="Freq??ncia",col="antiquewhite4",main="") > hist(RS[,i+2],nclass=75,xlab="Rio Grande do > Sul",ylab="Freq??ncia",col="antiquewhite4",main="") > dev.off() > } > > The problem is the name of the file, I want to do something like grafi.jpg > where i goes from 1 to 250. > > Thanks a lot for the help. > > > Best Regards, > Leandro Marino > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
S Ellison
2008-Jul-08 11:45 UTC
[R] Need ideas on how to show spikes in my data and how to code it in R
Two thoughts: i) If you have a narrow distribution, the density can be higher than 1. The area comes out at 1 for density, and n for the frequency. ii) hist() will not show the same frequencies as density() unless hist has unit bin sizes. density*length is showing number per unit change in Weight; hist shows number per bin width. Try plotting a histogram first, then plot the density on top of that. If they disagree >>> "Thomas Fr?jd" <tfrojd at gmail.com> 07/08/08 12:29 PM >>> Hi! Sorry for bothering you again but I can't seem to get it right. When i multiply the density with the number of observations it seems to be way to high, The reference curve is drawn at maybe 20 times higher frequency count than it should be. I use the following code where "weights$Weight" is my weights data and "reference" is my reference dataset. # calucate the right breakpoints breakpoints <- seq(min(weights), max(weights), by=binwidth) #scale density dens <- density(reference) dens$y <- dens$y * (length(weights$Weight)) #graph it hist(weights$Weight, freq=TRUE, breaks=breakpoints, main=wfiles[i]) lines(dens) Any ideas are greatly appreciated. /Thomas On Fri, Jun 27, 2008 at 10:54 PM, Daniel Folkinshteyn <dfolkins at gmail.com> wrote:> if you want the "frequency" scale rather than density scale, then leave hist > as is (by default it uses the frequency scale), and rescale the density by > multiplying it by the appropriate NOBS. > > on 06/27/2008 01:16 PM Thomas Fr??jd said the following: >> >> Hi >> >> Thank you very much for taking time to answer. >> >> The solution of using hist(data) for the main dataset and adding >> lines(density(refdata)) for the reference data seem to work great. I >> forgot to mention one thing however, I need to have frequency on the y >> azis instead of density as now. >> >> I know this is not a "real" histogram but since the audience is not >> very statistically experienced I would prefer to do it this way. >> Anyone have an idea? >> >> Thanks again for your help. >> >> Thomas Fr?jd >> >> On Wed, Jun 25, 2008 at 6:16 PM, Daniel Folkinshteyn <dfolkins at gmail.com> >> wrote: >>>> >>>> I don't understand this. Why not just get hist() to plot on the >>>> density scale, >>>> thereby making its output commensurate with the output of density()? >>>> The hist() function will plot on the density scale if you ask it to. >>>> Set freq=FALSE >>>> (or prob=TRUE) in the call to hist. >>> >>> ehrm... because i didn't realize that option existed :) that certainly is >>> easier than manually scaling hist output by NOBS - thanks for the tip! >>> >> > >______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ******************************************************************* This email and any attachments are confidential. Any use...{{dropped:8}}
Thomas Fröjd
2008-Jul-08 15:41 UTC
[R] Need ideas on how to show spikes in my data and how to code it in R
Hi thanks for your answer.> ii) hist() will not show the same frequencies as density() unless hist has unit bin sizes. density*length is showing number per unit change in Weight; hist shows number per bin width.I belive this is what is confounding me. I have a bin width of 0.1 in the histogram. Changing dens$y <- dens$y * (length(weights$Weight)) to dens$y <- dens$y * (length(weights$Weight)*binwidth) where binwidth=0.1 seems to output correct graphs. Can someone verify this is the right approach? On Tue, Jul 8, 2008 at 1:45 PM, S Ellison <S.Ellison at lgc.co.uk> wrote:> Two thoughts: > i) If you have a narrow distribution, the density can be higher than 1. The area comes out at 1 for density, and n for the frequency. > > ii) hist() will not show the same frequencies as density() unless hist has unit bin sizes. density*length is showing number per unit change in Weight; hist shows number per bin width. > > > > > > > Try plotting a histogram first, then plot the density on top of that. If they disagree >>> "Thomas Fr?jd" <tfrojd at gmail.com> 07/08/08 12:29 PM >>> > Hi! > > Sorry for bothering you again but I can't seem to get it right. > > When i multiply the density with the number of observations it seems > to be way to high, The reference curve is drawn at maybe 20 times > higher frequency count than it should be. > > I use the following code where "weights$Weight" is my weights data and > "reference" is my reference dataset. > > # calucate the right breakpoints > breakpoints <- seq(min(weights), max(weights), by=binwidth) > > > #scale density > dens <- density(reference) > dens$y <- dens$y * (length(weights$Weight)) > > #graph it > hist(weights$Weight, freq=TRUE, breaks=breakpoints, main=wfiles[i]) > > lines(dens) > > Any ideas are greatly appreciated. > > /Thomas > > > > On Fri, Jun 27, 2008 at 10:54 PM, Daniel Folkinshteyn > <dfolkins at gmail.com> wrote: >> if you want the "frequency" scale rather than density scale, then leave hist >> as is (by default it uses the frequency scale), and rescale the density by >> multiplying it by the appropriate NOBS. >> >> on 06/27/2008 01:16 PM Thomas Fr??jd said the following: >>> >>> Hi >>> >>> Thank you very much for taking time to answer. >>> >>> The solution of using hist(data) for the main dataset and adding >>> lines(density(refdata)) for the reference data seem to work great. I >>> forgot to mention one thing however, I need to have frequency on the y >>> azis instead of density as now. >>> >>> I know this is not a "real" histogram but since the audience is not >>> very statistically experienced I would prefer to do it this way. >>> Anyone have an idea? >>> >>> Thanks again for your help. >>> >>> Thomas Fr?jd >>> >>> On Wed, Jun 25, 2008 at 6:16 PM, Daniel Folkinshteyn <dfolkins at gmail.com> >>> wrote: >>>>> >>>>> I don't understand this. Why not just get hist() to plot on the >>>>> density scale, >>>>> thereby making its output commensurate with the output of density()? >>>>> The hist() function will plot on the density scale if you ask it to. >>>>> Set freq=FALSE >>>>> (or prob=TRUE) in the call to hist. >>>> >>>> ehrm... because i didn't realize that option existed :) that certainly is >>>> easier than manually scaling hist output by NOBS - thanks for the tip! >>>> >>> >> >> > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > ******************************************************************* > This email and any attachments are confidential. Any u...{{dropped:8}}