matthias-gondan
2021-Aug-04 12:08 UTC
[R] What are the pros and cons of the log.p parameter in (p|q)norm and similar?
Response to 1You need the log version e.g. in maximum likelihood, otherwise the
product of the densities and probabilities can become very small.
-------- Urspr?ngliche Nachricht --------Von: r-help-request at r-project.org
Datum: 04.08.21 12:01 (GMT+01:00) An: r-help at r-project.org Betreff: R-help
Digest, Vol 222, Issue 4 Send R-help mailing list submissions to r-help at
r-project.orgTo subscribe or unsubscribe via the World Wide Web, visit
https://stat.ethz.ch/mailman/listinfo/r-helpor, via email, send a message with
subject or body 'help' to r-help-request at r-project.orgYou can reach
the person managing the list at r-help-owner at r-project.orgWhen replying,
please edit your Subject line so it is more specificthan "Re: Contents of
R-help digest..."Today's Topics:?? 1. What are the pros and cons of the
log.p parameter in????? (p|q)norm and similar? (Michael Dewey)?? 2. Help with
package EasyPubmed (bharat rawlley)?? 3. Re: Help with package EasyPubmed
(bharat rawlley)?? 4. Re:? What are the pros and cons of the log.p parameter
in????? (p|q)norm and similar? (Duncan Murdoch)?? 5. Re:? What are the pros and
cons of the log.p parameter in????? (p|q)norm and similar? (Bill Dunlap)?? 6.
Creating a log-transformed histogram of multiclass data????? (Tom Woolman)?? 7.
Re: Creating a log-transformed histogram of multiclass data????? (Tom
Woolman)----------------------------------------------------------------------Message:
1Date: Tue, 3 Aug 2021 17:20:12 +0100From: Michael Dewey <lists at
dewey.myzen.co.uk>To: "r-help at r-project.org" <r-help at
r-project.org>Subject: [R] What are the pros and cons of the log.p parameter
in (p|q)norm and similar?Message-ID: <e17bdaaa-7945-4f37-ee69-941eb8270f16 at
dewey.myzen.co.uk>Content-Type: text/plain; charset="utf-8";
Format="flowed"Short versionApart from the ability to work with values
of p too small to be of much practical use what are the advantages and
disadvantages of setting this to TRUE?Longer versionI am contemplating upgrading
various functions in one of my packages to use this and as far as I can see it
would only have the advantage of allowing people to use very small p-values but
before I go ahead have I missed anything? I am most concerned with negatives but
if there is any other advantage I would mention that in the vignette. I am not
concerned about speed or the extra effort in coding and expanding the
documentation.--
Michaelhttp://www.dewey.myzen.co.uk/home.html------------------------------Message:
2Date: Tue, 3 Aug 2021 18:20:52 +0000 (UTC)From: bharat rawlley <bharat_m_all
at yahoo.co.in>To: R-help Mailing List <r-help at
r-project.org>Subject: [R] Help with package EasyPubmedMessage-ID:
<1046636584.2205366.1628014852065 at mail.yahoo.com>Content-Type:
text/plain; charset="utf-8"Hello,?When I try to run the following code
using the package Easypubmed, I get a null result -?>
batch_pubmed_download(query_7)NULL#query_7 <- "Cardiology AND
randomizedcontrolledtrial[Filter] AND 2011[PDAT]"However, the exact same
search string yields 668 results on Pubmed.?I am unable to figure out why this
is happening. If I use the search string?"Cardiology AND 2011[PDAT]"
then it works just fine.?Any help would be?greatly appreciatedThank you!?
[[alternative HTML version deleted]]------------------------------Message:
3Date: Tue, 3 Aug 2021 18:26:40 +0000 (UTC)From: bharat rawlley <bharat_m_all
at yahoo.co.in>To: R-help Mailing List <r-help at
r-project.org>Subject: Re: [R] Help with package EasyPubmedMessage-ID:
<712126143.2207911.1628015200446 at mail.yahoo.com>Content-Type:
text/plain; charset="utf-8" ?Okay, the following search string
resolved my issue? -?"Cardiology AND randomized controlled
trial[Publication type] AND 2011[PDAT]"Thank you!??? On Tuesday, 3 August,
2021, 02:21:38 pm GMT-4, bharat rawlley via R-help <r-help at
r-project.org> wrote:? Hello,?When I try to run the following code using
the package Easypubmed, I get a null result -?>
batch_pubmed_download(query_7)NULL#query_7 <- "Cardiology AND
randomizedcontrolledtrial[Filter] AND 2011[PDAT]"However, the exact same
search string yields 668 results on Pubmed.?I am unable to figure out why this
is happening. If I use the search string?"Cardiology AND 2011[PDAT]"
then it works just fine.?Any help would be?greatly appreciatedThank you!????
[[alternative HTML version
deleted]]______________________________________________R-help at r-project.org
mailing list -- To UNSUBSCRIBE and more,
seehttps://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guide
http://www.R-project.org/posting-guide.htmland provide commented, minimal,
self-contained, reproducible code.? [[alternative HTML version
deleted]]------------------------------Message: 4Date: Tue, 3 Aug 2021 14:53:28
-0400From: Duncan Murdoch <murdoch.duncan at gmail.com>To: Michael Dewey
<lists at dewey.myzen.co.uk>, "r-help at r-project.org"
<r-help at r-project.org>Subject: Re: [R]? What are the pros and cons of
the log.p parameter in (p|q)norm and similar?Message-ID:
<c15f610b-7a16-9d84-884c-54cc170bbad8 at gmail.com>Content-Type:
text/plain; charset="utf-8"; Format="flowed"On 03/08/2021
12:20 p.m., Michael Dewey wrote:> Short version> > Apart from the
ability to work with values of p too small to be of much> practical use what
are the advantages and disadvantages of setting this> to TRUE?> >
Longer version> > I am contemplating upgrading various functions in one of
my packages to> use this and as far as I can see it would only have the
advantage of> allowing people to use very small p-values but before I go
ahead have I> missed anything? I am most concerned with negatives but if
there is any> other advantage I would mention that in the vignette. I am not
concerned> about speed or the extra effort in coding and expanding the
documentation.> These are often needed in likelihood problems.? In just about
any problem where the normal density shows up in the likelihood, you're
better off working with the log likelihood and setting log = TRUE in dnorm,
because sometimes you want to evaluate the likelihood very far from its mode.The
same sort of thing happens with pnorm for similar reasons.? Some likelihoods
involve normal integrals and will need it.I can't think of an example for
qnorm off the top of my head, but I imagine there are some:? maybe involving
simulation way out in the tails.The main negative about using logs is that they
aren't always needed.Duncan Murdoch------------------------------Message:
5Date: Tue, 3 Aug 2021 13:24:08 -0700From: Bill Dunlap <williamwdunlap at
gmail.com>To: Duncan Murdoch <murdoch.duncan at gmail.com>Cc: Michael
Dewey <lists at dewey.myzen.co.uk>, "r-help at r-project.org"
<r-help at r-project.org>Subject: Re: [R]? What are the pros and cons of
the log.p parameter in (p|q)norm and similar?Message-ID:
<CAHqSRuSBQyuyJ5a9YrHk3BHXPn5UmbxQ54bKhAU3G6yroCnG4A at
mail.gmail.com>Content-Type: text/plain; charset="utf-8"In maximum
likelihood problems, even when the individual density values arefairly far from
zero, their product may underflow to zero.? Optimizers haveproblems when there
is a large flat area.?? > q <- runif(n=1000, -0.1, +0.1)?? >
prod(dnorm(q))?? [1] 0?? > sum(dnorm(q, log=TRUE))?? [1] -920.6556A more
minor advantage for some probability-related functions is speed.E.g.,
dnorm(log=TRUE,...) does not need to evaluate exp().?? > q <- runif(1e6,
-10, 10)?? > system.time(for(i in 1:100)dnorm(q, log=FALSE))????? user?
system elapsed????? 9.13??? 0.11??? 9.23?? > system.time(for(i in
1:100)dnorm(q, log=TRUE))????? user? system elapsed????? 4.60??? 0.19??? 4.78
-BillOn Tue, Aug 3, 2021 at 11:53 AM Duncan Murdoch <murdoch.duncan at
gmail.com>wrote:> On 03/08/2021 12:20 p.m., Michael Dewey wrote:> >
Short version> >> > Apart from the ability to work with values of p
too small to be of much> > practical use what are the advantages and
disadvantages of setting this> > to TRUE?> >> > Longer
version> >> > I am contemplating upgrading various functions in one
of my packages to> > use this and as far as I can see it would only have
the advantage of> > allowing people to use very small p-values but before
I go ahead have I> > missed anything? I am most concerned with negatives
but if there is any> > other advantage I would mention that in the
vignette. I am not concerned> > about speed or the extra effort in coding
and expanding the> documentation.> >>> These are often needed in
likelihood problems.? In just about any> problem where the normal density
shows up in the likelihood, you're> better off working with the log
likelihood and setting log = TRUE in> dnorm, because sometimes you want to
evaluate the likelihood very far> from its mode.>> The same sort of
thing happens with pnorm for similar reasons.? Some> likelihoods involve
normal integrals and will need it.>> I can't think of an example for
qnorm off the top of my head, but I> imagine there are some:? maybe involving
simulation way out in the tails.>> The main negative about using logs is
that they aren't always needed.>> Duncan Murdoch>>
______________________________________________> R-help at r-project.org
mailing list -- To UNSUBSCRIBE and more, see>
https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting
guide> http://www.R-project.org/posting-guide.html> and provide commented,
minimal, self-contained, reproducible code.> [[alternative HTML version
deleted]]------------------------------Message: 6Date: Tue, 03 Aug 2021 18:56:08
-0400From: Tom Woolman <twoolman at ontargettek.com>To: r-help at
r-project.orgSubject: [R] Creating a log-transformed histogram of multiclass
dataMessage-ID: <2bc87c25f161bac1d8e5101e20bf2237 at
ontargettek.com>Content-Type: text/plain; charset="us-ascii";
Format="flowed"# Resending this message since the original email was
held in queue by the listserv software because of a "suspicious"
subject line, and/or because of attached .png histogram chart attachments.
I'm guessing that the listserv software doesn't like multiple image file
attachments.Hi everyone. I'm working on a research model now that is
calculating anomaly scores (RMSE values) for three distinct groups within a
large dataset. The anomaly scores are a continuous data type and are quite
small, ranging from approximately 1e-04 to 1-e07 across a population of
approximately 1 million observations.I have all of the summary and descriptive
statistics for each of the anomaly score distributions across each group label
in the dataset, and I am able to create some useful histograms showing how each
of the three groups is uniquely distributed across the range of scores. However,
because of the large variance within the frequency of score values and the high
density peaks within much of the anomaly scores, I need to use a log
transformation within the histogram to show both the log frequency count of each
binned observation range (y-axis) and a log transformation of the binned score
values (x-axis) to be able to appropriately illustrate the distributions within
the data and make it more readily understandable.Fortunately, ggplot2 is really
useful for creating some really attractive dual-axis log transformed
histograms.However, I cannot figure out a way to create the log transformed
histograms to show each of my three groups by color within the same histogram. I
would want it to look like this, BUT use a log transformation for each axis.
This plot below shows the 3 groups in one histogram but uses the default normal
values.For log transformed axis values, the best I can do so far is produce
three separate histograms, one for each group.Below is sample R code to
illustrate my problem with a randomly-generated example dataset and the ggplot2
approaches that I have taken so far:# Sample R code
below:library(ggplot2)library(dplyr)library(hrbrthemes)# I created some simple
random sample data to produce an example dataset.# This produces an example
dataframe called d, which contains a class label IV of either A, B or C for each
observation. The target variable is the anomaly_score continuous value for each
observation.# There are 300 rows of dummy data in this
dataframe.DV_score_generator = round(runif(300,0.001,0.999), 3)d <-
data.frame( label = sample( LETTERS[1:3], 300, replace=TRUE, prob=c(0.65, 0.30,
0.05) ), anomaly_score = DV_score_generator)# First, I use ggplot to create the
normal distribution histogram that shows all 3 groups on the same plot, by
color.# Please note that with this small set of randomized sample data it
doesn't appear to be necessary to use an x and y-axis log transformation to
show the distribution patterns, but it does becomes an issue with my vastly
larger and more complex score values in the DV of the actual data.p <- d
%>%ggplot( aes(x=anomaly_score, fill=label)) +geom_histogram(
color="#e9ecef", alpha=0.6, position = 'identity')
+scale_fill_manual(values=c("#69b3a2", "blue",
"#404080")) +theme_ipsum() +labs(fill="")p# Produces a
normal multiclass histogram.# Now produce a series of x and y-axis
log-transformed histograms, producing one histogram for each distinct label
class in the dataset:# Group A, log transformedggplot(group_a, aes(x =
anomaly_score)) +????? geom_histogram(aes(y = ..count..), binwidth = 0.05,?????
colour = "darkgoldenrod1", fill = "darkgoldenrod2") +?????
scale_x_continuous(name = "Log-scale Anomaly Score",
trans="log2") +????? scale_y_continuous(trans="log2",
name="Log-transformed Frequency Counts") +?????
ggtitle("Transformed Anomaly Scores - Group A Only")# Group A
transformed histogram is produced here.# Group B, log transformed?
ggplot(group_b, aes(x = anomaly_score)) +????? geom_histogram(aes(y =
..count..), binwidth = 0.05,????? colour = "green", fill =
"darkgreen") +????? scale_x_continuous(name = "Log-scale Anomaly
Score", trans="log2") +?????
scale_y_continuous(trans="log2", name="Log-transformed Frequency
Counts") +????? ggtitle("Transformed Anomaly Scores - Group B
Only")# Group B transformed histogram is produced here.# Group C, log
transformed? ggplot(group_c, aes(x = anomaly_score)) +????? geom_histogram(aes(y
= ..count..), binwidth = 0.05,????? colour = "red", fill =
"darkred") +????? scale_x_continuous(name = "Log-scale Anomaly
Score", trans="log2") +?????
scale_y_continuous(trans="log2", name="Log-transformed Frequency
Counts") +????? ggtitle("Transformed Anomaly Scores - Group C
Only")# Group C transformed histogram is produced here.# End.Thanks in
advance, everyone!- TomThomas A. Woolman, PhD Candidate (Indiana State
University), MBA, MS, MSOn Target Technologies, Inc.Virginia,
USA------------------------------Message: 7Date: Tue, 03 Aug 2021 19:04:29
-0400From: Tom Woolman <twoolman at ontargettek.com>To: r-help at
r-project.orgSubject: Re: [R] Creating a log-transformed histogram of multiclass
dataMessage-ID: <ba170db0581b2b7f5c79448355685e92 at
ontargettek.com>Content-Type: text/plain; charset="us-ascii";
Format="flowed"Apologies, I left out 3 critical lines of code after
the randomized sample dataframe is created:group_a <- d[ which(d$label
=='A'), ]group_b <- d[ which(d$label =='B'), ]group_c <-
d[ which(d$label =='C'), ]On 2021-08-03 18:56, Tom Woolman wrote:> #
Resending this message since the original email was held in queue by> the
listserv software because of a "suspicious" subject line, and/or>
because of attached .png histogram chart attachments. I'm guessing> that
the listserv software doesn't like multiple image file> attachments.>
> > Hi everyone. I'm working on a research model now that is
calculating> anomaly scores (RMSE values) for three distinct groups within a
large> dataset. The anomaly scores are a continuous data type and are
quite> small, ranging from approximately 1e-04 to 1-e07 across a
population> of approximately 1 million observations.> > I have all of
the summary and descriptive statistics for each of the> anomaly score
distributions across each group label in the dataset,> and I am able to
create some useful histograms showing how each of the> three groups is
uniquely distributed across the range of scores.> However, because of the
large variance within the frequency of score> values and the high density
peaks within much of the anomaly scores, I> need to use a log transformation
within the histogram to show both the> log frequency count of each binned
observation range (y-axis) and a> log transformation of the binned score
values (x-axis) to be able to> appropriately illustrate the distributions
within the data and make it> more readily understandable.> >
Fortunately, ggplot2 is really useful for creating some really> attractive
dual-axis log transformed histograms.> > However, I cannot figure out a
way to create the log transformed> histograms to show each of my three groups
by color within the same> histogram. I would want it to look like this, BUT
use a log> transformation for each axis. This plot below shows the 3 groups
in> one histogram but uses the default normal values.> > For log
transformed axis values, the best I can do so far is produce> three separate
histograms, one for each group.> > > > Below is sample R code to
illustrate my problem with a> randomly-generated example dataset and the
ggplot2 approaches that I> have taken so far:> > # Sample R code
below:> > library(ggplot2)> library(dplyr)> library(hrbrthemes)>
> # I created some simple random sample data to produce an example >
dataset.> # This produces an example dataframe called d, which contains a
class> label IV of either A, B or C for each observation. The target
variable> is the anomaly_score continuous value for each observation.> #
There are 300 rows of dummy data in this dataframe.> > DV_score_generator
= round(runif(300,0.001,0.999), 3)> d <- data.frame( label = sample(
LETTERS[1:3], 300, replace=TRUE,> prob=c(0.65, 0.30, 0.05) ), anomaly_score =
DV_score_generator)> > # First, I use ggplot to create the normal
distribution histogram that> shows all 3 groups on the same plot, by
color.> # Please note that with this small set of randomized sample data
it> doesn't appear to be necessary to use an x and y-axis log>
transformation to show the distribution patterns, but it does becomes> an
issue with my vastly larger and more complex score values in the DV> of the
actual data.> > p <- d %>%> ggplot( aes(x=anomaly_score,
fill=label)) +> geom_histogram( color="#e9ecef", alpha=0.6,
position = 'identity') +>
scale_fill_manual(values=c("#69b3a2", "blue",
"#404080")) +> theme_ipsum() +> labs(fill="")> >
p> > # Produces a normal multiclass histogram.> > > > # Now
produce a series of x and y-axis log-transformed histograms,> producing one
histogram for each distinct label class in the dataset:> > > # Group A,
log transformed> > ggplot(group_a, aes(x = anomaly_score)) +>?????
geom_histogram(aes(y = ..count..), binwidth = 0.05,>????? colour =
"darkgoldenrod1", fill = "darkgoldenrod2") +>?????
scale_x_continuous(name = "Log-scale Anomaly Score",
trans="log2") > +>?????
scale_y_continuous(trans="log2", name="Log-transformed Frequency
> Counts") +>????? ggtitle("Transformed Anomaly Scores - Group A
Only")> > > # Group A transformed histogram is produced here.>
> > > # Group B, log transformed> >? ggplot(group_b, aes(x =
anomaly_score)) +>????? geom_histogram(aes(y = ..count..), binwidth =
0.05,>????? colour = "green", fill = "darkgreen")
+>????? scale_x_continuous(name = "Log-scale Anomaly Score",
trans="log2") > +>?????
scale_y_continuous(trans="log2", name="Log-transformed Frequency
> Counts") +>????? ggtitle("Transformed Anomaly Scores - Group B
Only")> > # Group B transformed histogram is produced here.> >
> > # Group C, log transformed> >? ggplot(group_c, aes(x =
anomaly_score)) +>????? geom_histogram(aes(y = ..count..), binwidth =
0.05,>????? colour = "red", fill = "darkred") +>?????
scale_x_continuous(name = "Log-scale Anomaly Score",
trans="log2") > +>?????
scale_y_continuous(trans="log2", name="Log-transformed Frequency
> Counts") +>????? ggtitle("Transformed Anomaly Scores - Group C
Only")> > # Group C transformed histogram is produced here.> >
> # End.> > > > Thanks in advance, everyone!> > > -
Tom> > > Thomas A. Woolman, PhD Candidate (Indiana State University),
MBA, MS, > MS> On Target Technologies, Inc.> Virginia, USA> >
______________________________________________> R-help at r-project.org
mailing list -- To UNSUBSCRIBE and more, see>
https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting
guide > http://www.R-project.org/posting-guide.html> and provide
commented, minimal, self-contained, reproducible
code.------------------------------Subject: Digest
Footer_______________________________________________R-help at r-project.org
mailing listhttps://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the
posting guide http://www.R-project.org/posting-guide.htmland provide commented,
minimal, self-contained, reproducible code.------------------------------End of
R-help Digest, Vol 222, Issue 4**************************************
[[alternative HTML version deleted]]