thr3ads.net - R help - [R] Need ideas on how to show spikes in my data and how to code it in R [Jun 2008]

If this information is useful, please help other people find it:
Share via:

Thomas Frööjd

2008-Jun-23 19:40 UTC

[R] Need ideas on how to show spikes in my data and how to code it in R

Hi

I have recently been analyzing birthweight data from a clinic.  The
data has obvious defects in that there is digit preference on certain
weights making them overrepresented. This shows as spikes in the
histogram on certain well rounded weights like 2, 2.5, 3, etc.   I
would like to show this to government officials but can't figure out
how I should present the finding in an easy to understand manner.

My idea is this:

I have a dataset of 20 000 childbirths from another nation that I
would like to plot in a graph over the histograms of birth weights
from the clinic. This dataset doesn't share the digit preference
problem. The idea is similar to how people sometimes plot a fitted
normal density function over a histogram to show how data is
distributed.

To do this I need to do three steps. None which I succeeded with so far

1.       Shift the mean and std on the reference dataset to the mean
and std of my clinic birth weight data.

2.       Scale the data so they can be plotted on the same axis. The
reference dataset has around 20 000 observations and my data from the
clinic only around 3000 so I have to fix this otherwise the plot of
the reference datset will be much bigger in the graph.

3.       Plot both on the same graph. The reference dataset like a
density plot and my dataset as a histogram, that means weight bins on
the x axis and number of observations on y. It should be added that my
reference dataset isn't truly continuous but recorded at 100g
intervals. This means both datasets have the same grouping however
plotting both as histogram would probably make it harder to understand
for a person with little training in statistics. This means that the
reference dataset "density function" has to be smoothed somehow.

I would be very thankful for help on any of those steps. Also if you
think this approach is wrong for some reason please tell me.

Best regards

Thomas Fr?jd

Daniel Folkinshteyn

2008-Jun-23 20:10 UTC

head link

[R] Need ideas on how to show spikes in my data and how to code it in R

on 06/23/2008 03:40 PM Thomas Fr??jd said the following:> 1.       Shift the mean and std on the reference dataset to the mean
> and std of my clinic birth weight data.
to shift the mean by any distance, just add or subtract that distance 
from each observation (e.g., to move mean from m1 to m2, to each 
observation add (m2 -m1) ).

to shift the stddev, from, say, s1 to s2, multiply each observation by s2/s1

instead of shifting ref dataset to mean/sdev of other dataset, it might 
be more intuitive to transform both to mean=0, sdev=1.
> 2.       Scale the data so they can be plotted on the same axis. The
> reference dataset has around 20 000 observations and my data from the
> clinic only around 3000 so I have to fix this otherwise the plot of
> the reference datset will be much bigger in the graph.
if you do a density plot (see ?density in R), it will automatically be 
scaled. if you want the histogram scaled too, then after calculating the 
histogram frequencies, multiply them by a ratio of numberofobs for your 
data, and number of obs for reference data (i.e.: NOBS_yourdata / 
NOSB_refdata)

but i'd say, you might do better to just work with a density plot and 
set the appropriate bandwidth parameter, rather than working with a 
histogram, for presentational purposes.
> 3.       Plot both on the same graph. The reference dataset like a
> density plot and my dataset as a histogram, that means weight bins on
> the x axis and number of observations on y. It should be added that my
> reference dataset isn't truly continuous but recorded at 100g
> intervals. This means both datasets have the same grouping however
> plotting both as histogram would probably make it harder to understand
> for a person with little training in statistics. This means that the
> reference dataset "density function" has to be smoothed somehow.
see ?density, set the appropriate bandwidth parameter to achieve your 
desired degree of smoothing.
> I would be very thankful for help on any of those steps. Also if you
> think this approach is wrong for some reason please tell me.
i think you'd have a much easier time of it (and also a better-looking 
and more informative plot), if you plot both as density on the same 
plot, and forgo the histogram overlay. your reference dataset will be a 
nice smooth histogram, as long as you choose a wide enough bandwidth to 
avoid showing peaks every 100g, and your target dataset will have large 
peaks at 2, 2.5, 3, etc. will look very nice salient. :)

also, unless the babies in both plots are from different species :), you 
probably don't need to transform the data to equalize means and variances.

-d

Gabor Csardi

2008-Jun-30 15:05 UTC

head link

[R] Graphs in R

paste(sep="", "graf", 1:250, ".jpg")

See ?paste,
G.

On Mon, Jun 30, 2008 at 11:58:51AM -0300, Leandro Marino
wrote:> Hi list,
> 
> I want to make a lot of graphics to my end course project. So, i was using
> this sintax:
> 
> 
> jpeg(filename = "graf01.jpg", width = 1024, height = 1024,
>      units = "px", pointsize = 25, quality = 100, bg =
"grey95",
>      res = NA, restoreConsole = TRUE)
> i=1
> par(mfrow=c(4,1),col="grey90",font.lab=2)
> hist(sul[,i+2],nclass=75,xlab="Regi?o
>
Sul",ylab="Freq??ncia",col="antiquewhite4",main="")
>
hist(PR[,i+2],nclass=75,xlab="Paran?",ylab="Freq??ncia",col="antiquewhite4",
> main="")
> hist(SC[,i+2],nclass=75,xlab="Santa
>
Catarina",ylab="Freq??ncia",col="antiquewhite4",main="")
> hist(RS[,i+2],nclass=75,xlab="Rio Grande do
>
Sul",ylab="Freq??ncia",col="antiquewhite4",main="")
> dev.off()
> 
> But, I want to know how can I create an for to do that. Like that:
> 
> for(i in 1:250){
> jpeg(filename = "graf01.jpg", width = 1024, height = 1024,
>      units = "px", pointsize = 25, quality = 100, bg =
"grey95",
>      res = NA, restoreConsole = TRUE)
> par(mfrow=c(4,1),col="grey90",font.lab=2)
> hist(sul[,i+2],nclass=75,xlab="Regi?o
>
Sul",ylab="Freq??ncia",col="antiquewhite4",main="")
>
hist(PR[,i+2],nclass=75,xlab="Paran?",ylab="Freq??ncia",col="antiquewhite4",
> main="")
> hist(SC[,i+2],nclass=75,xlab="Santa
>
Catarina",ylab="Freq??ncia",col="antiquewhite4",main="")
> hist(RS[,i+2],nclass=75,xlab="Rio Grande do
>
Sul",ylab="Freq??ncia",col="antiquewhite4",main="")
> dev.off()
> }
> 
> The problem is the name of the file, I want to do something like grafi.jpg
> where i goes from 1 to 250.
> 
> Thanks a lot for the help.
> 
> 
> Best Regards,
> Leandro Marino
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-- 
Csardi Gabor <csardi at rmki.kfki.hu>    UNIL DGM

Thomas Frööjd

2008-Jul-07 10:24 UTC

head link

[R] RES: Need ideas on how to show spikes in my data and how to codeit in R

Hi

Try changing

 for(i in 1:250){
 jpeg(filename = "graf01.jpg", width = 1024, height = 1024,

to

 for(i in 1:250){
 jpeg(filename = paste("graf", i, ".jpg", sep=""),
width = 1024, height = 1024,

I think that should work, otherwise read up on the paste() function at
http://stat.ethz.ch/R-manual/R-patched/library/base/html/paste.html

Best luck

On Mon, Jun 30, 2008 at 4:58 PM, Leandro Marino
<leandro at cesgranrio.org.br> wrote:> Hi list,
>
> I want to make a lot of graphics to my end course project. So, i was using
> this sintax:
>
>
> jpeg(filename = "graf01.jpg", width = 1024, height = 1024,
>     units = "px", pointsize = 25, quality = 100, bg =
"grey95",
>     res = NA, restoreConsole = TRUE)
> i=1
> par(mfrow=c(4,1),col="grey90",font.lab=2)
> hist(sul[,i+2],nclass=75,xlab="Regi?o
>
Sul",ylab="Freq??ncia",col="antiquewhite4",main="")
>
hist(PR[,i+2],nclass=75,xlab="Paran?",ylab="Freq??ncia",col="antiquewhite4",
> main="")
> hist(SC[,i+2],nclass=75,xlab="Santa
>
Catarina",ylab="Freq??ncia",col="antiquewhite4",main="")
> hist(RS[,i+2],nclass=75,xlab="Rio Grande do
>
Sul",ylab="Freq??ncia",col="antiquewhite4",main="")
> dev.off()
>
> But, I want to know how can I create an for to do that. Like that:
>
> for(i in 1:250){
> jpeg(filename = "graf01.jpg", width = 1024, height = 1024,
>     units = "px", pointsize = 25, quality = 100, bg =
"grey95",
>     res = NA, restoreConsole = TRUE)
> par(mfrow=c(4,1),col="grey90",font.lab=2)
> hist(sul[,i+2],nclass=75,xlab="Regi?o
>
Sul",ylab="Freq??ncia",col="antiquewhite4",main="")
>
hist(PR[,i+2],nclass=75,xlab="Paran?",ylab="Freq??ncia",col="antiquewhite4",
> main="")
> hist(SC[,i+2],nclass=75,xlab="Santa
>
Catarina",ylab="Freq??ncia",col="antiquewhite4",main="")
> hist(RS[,i+2],nclass=75,xlab="Rio Grande do
>
Sul",ylab="Freq??ncia",col="antiquewhite4",main="")
> dev.off()
> }
>
> The problem is the name of the file, I want to do something like grafi.jpg
> where i goes from 1 to 250.
>
> Thanks a lot for the help.
>
>
> Best Regards,
> Leandro Marino
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

S Ellison

2008-Jul-08 11:45 UTC

head link

[R] Need ideas on how to show spikes in my data and how to code it in R

Two thoughts:
i) If you have a narrow distribution, the density can be higher than 1. The area
comes out at 1 for density, and n for the frequency.

ii) hist() will not show the same frequencies as density() unless hist has unit
bin sizes. density*length is showing number per unit change in Weight; hist
shows number per bin width.

Try plotting a histogram first, then plot the density on top of that. If they
disagree >>> "Thomas Fr?jd" <tfrojd at gmail.com>
07/08/08 12:29 PM >>>
Hi!

Sorry for bothering you again but I can't seem to get it right.

When i multiply the density with the number of observations it seems
to be way to high, The reference curve is drawn at maybe 20 times
higher frequency count than it should be.

I use the following code where "weights$Weight" is my weights data and
"reference" is my reference dataset.

# calucate the right breakpoints
breakpoints <- seq(min(weights), max(weights), by=binwidth)

#scale density
dens <- density(reference)
dens$y <- dens$y * (length(weights$Weight))

#graph it
hist(weights$Weight, freq=TRUE, breaks=breakpoints, main=wfiles[i])

lines(dens)

Any ideas are greatly appreciated.

/Thomas

On Fri, Jun 27, 2008 at 10:54 PM, Daniel Folkinshteyn
<dfolkins at gmail.com> wrote:> if you want the "frequency" scale rather than density scale, then
leave hist
> as is (by default it uses the frequency scale), and rescale the density by
> multiplying it by the appropriate NOBS.
>
> on 06/27/2008 01:16 PM Thomas Fr??jd said the following:
>>
>> Hi
>>
>> Thank you very much for taking time to answer.
>>
>> The solution of using hist(data) for the main dataset and adding
>> lines(density(refdata)) for the reference data seem to work great. I
>> forgot to mention one thing however, I need to have frequency on the y
>> azis instead of density as now.
>>
>>  I know this is not a "real" histogram but since the audience
is not
>> very statistically experienced I would prefer to do it this way.
>> Anyone have an idea?
>>
>> Thanks again for your help.
>>
>> Thomas Fr?jd
>>
>> On Wed, Jun 25, 2008 at 6:16 PM, Daniel Folkinshteyn <dfolkins at
gmail.com>
>> wrote:
>>>>
>>>>   I don't understand this.  Why not just get hist() to plot
on the
>>>> density scale,
>>>>   thereby making its output commensurate with the output of
density()?
>>>>   The hist() function will plot on the density scale if you ask
it to.
>>>>  Set freq=FALSE
>>>>   (or prob=TRUE) in the call to hist.
>>>
>>> ehrm... because i didn't realize that option existed :) that
certainly is
>>> easier than manually scaling hist output by NOBS - thanks for the
tip!
>>>
>>
>
>
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}

Thomas Fröjd

2008-Jul-08 15:41 UTC

head link

[R] Need ideas on how to show spikes in my data and how to code it in R

Hi thanks for your answer.
> ii) hist() will not show the same frequencies as density() unless hist has
unit bin sizes. density*length is showing number per unit change in Weight; hist
shows number per bin width.
I belive this is what is confounding me. I have a bin width of 0.1 in
the histogram. Changing

dens$y <- dens$y * (length(weights$Weight))

to

dens$y <- dens$y * (length(weights$Weight)*binwidth)

where binwidth=0.1 seems to output correct graphs.

Can someone verify this is the right approach?

On Tue, Jul 8, 2008 at 1:45 PM, S Ellison <S.Ellison at lgc.co.uk>
wrote:> Two thoughts:
> i) If you have a narrow distribution, the density can be higher than 1. The
area comes out at 1 for density, and n for the frequency.
>
> ii) hist() will not show the same frequencies as density() unless hist has
unit bin sizes. density*length is showing number per unit change in Weight; hist
shows number per bin width.
>
>
>
>
>
>
> Try plotting a histogram first, then plot the density on top of that. If
they disagree >>> "Thomas Fr?jd" <tfrojd at gmail.com>
07/08/08 12:29 PM >>>
> Hi!
>
> Sorry for bothering you again but I can't seem to get it right.
>
> When i multiply the density with the number of observations it seems
> to be way to high, The reference curve is drawn at maybe 20 times
> higher frequency count than it should be.
>
> I use the following code where "weights$Weight" is my weights
data and
> "reference" is my reference dataset.
>
> # calucate the right breakpoints
> breakpoints <- seq(min(weights), max(weights), by=binwidth)
>
>
> #scale density
> dens <- density(reference)
> dens$y <- dens$y * (length(weights$Weight))
>
> #graph it
> hist(weights$Weight, freq=TRUE, breaks=breakpoints, main=wfiles[i])
>
> lines(dens)
>
> Any ideas are greatly appreciated.
>
> /Thomas
>
>
>
> On Fri, Jun 27, 2008 at 10:54 PM, Daniel Folkinshteyn
> <dfolkins at gmail.com> wrote:
>> if you want the "frequency" scale rather than density scale,
then leave hist
>> as is (by default it uses the frequency scale), and rescale the density
by
>> multiplying it by the appropriate NOBS.
>>
>> on 06/27/2008 01:16 PM Thomas Fr??jd said the following:
>>>
>>> Hi
>>>
>>> Thank you very much for taking time to answer.
>>>
>>> The solution of using hist(data) for the main dataset and adding
>>> lines(density(refdata)) for the reference data seem to work great.
I
>>> forgot to mention one thing however, I need to have frequency on
the y
>>> azis instead of density as now.
>>>
>>>  I know this is not a "real" histogram but since the
audience is not
>>> very statistically experienced I would prefer to do it this way.
>>> Anyone have an idea?
>>>
>>> Thanks again for your help.
>>>
>>> Thomas Fr?jd
>>>
>>> On Wed, Jun 25, 2008 at 6:16 PM, Daniel Folkinshteyn <dfolkins
at gmail.com>
>>> wrote:
>>>>>
>>>>>   I don't understand this.  Why not just get hist() to
plot on the
>>>>> density scale,
>>>>>   thereby making its output commensurate with the output of
density()?
>>>>>   The hist() function will plot on the density scale if you
ask it to.
>>>>>  Set freq=FALSE
>>>>>   (or prob=TRUE) in the call to hist.
>>>>
>>>> ehrm... because i didn't realize that option existed :)
that certainly is
>>>> easier than manually scaling hist output by NOBS - thanks for
the tip!
>>>>
>>>
>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> *******************************************************************
> This email and any attachments are confidential. Any u...{{dropped:8}}

Apparently Analagous Threads

Search for more reasonably related threads

R help - Jun 2008 - Need ideas on how to show spikes in my data and how to code it in R

[R] Need ideas on how to show spikes in my data and how to code it in R

[R] Need ideas on how to show spikes in my data and how to code it in R

[R] Graphs in R

[R] RES: Need ideas on how to show spikes in my data and how to codeit in R

[R] Need ideas on how to show spikes in my data and how to code it in R

[R] Need ideas on how to show spikes in my data and how to code it in R

Apparently Analagous Threads