thr3ads.net - R help - [R] Use of geometric mean .. in good data analysis [Jan 2024]

If this information is useful, please help other people find it:
Share via:

Bert Gunter

2024-Jan-22 20:23 UTC

[R] Use of geometric mean .. in good data analysis

Ah.... LOD's, typically LLOD's ("lower limits of detection").

Disclaimer: I am *NOT* in any sense an expert on such matters. What follows
are just some comments based on my personal experience. Please filter
accordingly. Also, while I kept it on list as Martin suggested it might be
useful to do so, most folks probably can safely ignore the rant that
follows as off topic and not of interest. So you've been warned!!

The rant:
My experience is: data that contain a "bunch" of values that are, e.g.
below a LLOD, are frequently reported and/or analyzed by various ad hoc,
and imho, uniformly bad methods. e.g.:

1) The censored values are recorded and analyzed as at the LLOD;
2) The censored values are recorded and analyzed at some arbitrary value
below the LLOD, like LLOD/2;
3) The censored values are are "imputed" by ad hoc methods, e.g.
uniform
random values between 0 and the LLOD for left censoring.

To repeat, *IMO*, all of this is junk and will produced misleading
statistical results. Whether they mislead enough to substantively affect
the science or regulatory decisions depend on the specifics of the
circumstances. I accept no general claim as to their innocuousness.

Further:

a) When you have a "lot" of values -- 50%? 75%?, 25%? -- face facts:
you
have (practically) no useful information from the values that you do have
to infer what the distribution of values that you don't have looks like.
All one can sensibly do is say that x% of the values are below a LOD and
here's the distribution of what lies above. Presumably, if you have such
data conditional on covariates with the obvious intent to determine the
relationship to those covariates, you could analyze the percentages of
LLOD's and known values separately. There are undoubtedly more
sophisticated methods out there, so this is where you need to go to the
literature to see what might suit; though I think it will still have to
come down to looking at these separately (e.g. with extra parameters to
account for unmeasurable values). Another way of saying this is: any
analysis which treats all the data as arising from a single distribution
will depend more on the assumptions you make than on the data. So good luck
with that!

b) If you have a "modest" amount of (known) censoring -- 5%?, 20%?
10%? --
methods for the analysis of censored data should be useful. My
understanding is that MI (multiple imputation) is regarded as a generally
useful approach, and there are many R packages that can do various flavors
of this. Again, you should consult the literature: there are very likely
nontechnical reviews of this topic, too, as well as online discussions and
tutorials.

So if you are serious about dealing with this and have a lot of data with
these issues, my advice would be to stop looking for ad hoc advice and dig
into the literature: it's one of the many areas of "data science"
where
seemingly simple but pervasive questions require complex answers.

And, again, heed my personal caveats.

Thus endeth my rant.

Cheers to all,
Bert

On Mon, Jan 22, 2024 at 9:29?AM Rich Shepard <rshepard at appl-ecosys.com>
wrote:
> On Mon, 22 Jan 2024, Martin Maechler wrote:
>
> > I think it is a good question, not really only about geo-chemistry,
but
> > about statistics in applied sciences (and engineering for that
matter).
>
> > John W Tukey (and several other of the grands of the time) had the log
> > transform among the "First aid transformations":
> >
> > If the data for a continuous variable must all be positive it is also
> > typically the case that the distribution is considerably skewed to the
> > right. In such a case behave as a good human who sees another human in
> > health distress: apply First Aid -- do the things you learned to do
> > quickly without too much thought, because things must happen fast
---to
> > hopefully save the other's life.
>
> Martin,
>
> Thanks very much. I will look further into this because toxic metals and
> organic compounds in geochemical collections almost always have censored
> lab
> results (below method dection limits) that range from about 15% to 80% or
> more, and there almost always are very high extreme values.
>
> I'll learn to understand what benefits log transforms have over
> compositional data analyses.
>
> Best regards,
>
> Rich
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Bert Gunter

2024-Jan-22 20:47 UTC

head link

[R] Use of geometric mean .. in good data analysis

In the spirit of Martin's comments, it is perhaps worthwhile to note one of
John Tukey's (who I actually knew) pertinent quotes:
"The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
<https://www.azquotes.com/quote/603406>"

"Sunset Salvo" by John Tukey in The American Statistician, Volume 40,
No. 1
(pp. 72-76), www.jstor.org. February 1986.

Cheers,
Bert

<https://www.azquotes.com/author/14847-John_Tukey>

On Mon, Jan 22, 2024 at 12:23?PM Bert Gunter <bgunter.4567 at gmail.com>
wrote:
>
> Ah.... LOD's, typically LLOD's ("lower limits of
detection").
>
> Disclaimer: I am *NOT* in any sense an expert on such matters. What
> follows are just some comments based on my personal experience. Please
> filter accordingly. Also, while I kept it on list as Martin suggested it
> might be useful to do so, most folks probably can safely ignore the rant
> that follows as off topic and not of interest. So you've been warned!!
>
> The rant:
> My experience is: data that contain a "bunch" of values that are,
e.g.
> below a LLOD, are frequently reported and/or analyzed by various ad hoc,
> and imho, uniformly bad methods. e.g.:
>
> 1) The censored values are recorded and analyzed as at the LLOD;
> 2) The censored values are recorded and analyzed at some arbitrary value
> below the LLOD, like LLOD/2;
> 3) The censored values are are "imputed" by ad hoc methods, e.g.
uniform
> random values between 0 and the LLOD for left censoring.
>
> To repeat, *IMO*, all of this is junk and will produced misleading
> statistical results. Whether they mislead enough to substantively affect
> the science or regulatory decisions depend on the specifics of the
> circumstances. I accept no general claim as to their innocuousness.
>
> Further:
>
> a) When you have a "lot" of values -- 50%? 75%?, 25%? -- face
facts: you
> have (practically) no useful information from the values that you do have
> to infer what the distribution of values that you don't have looks
like.
> All one can sensibly do is say that x% of the values are below a LOD and
> here's the distribution of what lies above. Presumably, if you have
such
> data conditional on covariates with the obvious intent to determine the
> relationship to those covariates, you could analyze the percentages of
> LLOD's and known values separately. There are undoubtedly more
> sophisticated methods out there, so this is where you need to go to the
> literature to see what might suit; though I think it will still have to
> come down to looking at these separately (e.g. with extra parameters to
> account for unmeasurable values). Another way of saying this is: any
> analysis which treats all the data as arising from a single distribution
> will depend more on the assumptions you make than on the data. So good luck
> with that!
>
> b) If you have a "modest" amount of (known) censoring -- 5%?,
20%? 10%? --
> methods for the analysis of censored data should be useful. My
> understanding is that MI (multiple imputation) is regarded as a generally
> useful approach, and there are many R packages that can do various flavors
> of this. Again, you should consult the literature: there are very likely
> nontechnical reviews of this topic, too, as well as online discussions and
> tutorials.
>
> So if you are serious about dealing with this and have a lot of data with
> these issues, my advice would be to stop looking for ad hoc advice and dig
> into the literature: it's one of the many areas of "data
science" where
> seemingly simple but pervasive questions require complex answers.
>
> And, again, heed my personal caveats.
>
> Thus endeth my rant.
>
> Cheers to all,
> Bert
>
>
>
> On Mon, Jan 22, 2024 at 9:29?AM Rich Shepard <rshepard at
appl-ecosys.com>
> wrote:
>
>> On Mon, 22 Jan 2024, Martin Maechler wrote:
>>
>> > I think it is a good question, not really only about
geo-chemistry, but
>> > about statistics in applied sciences (and engineering for that
matter).
>>
>> > John W Tukey (and several other of the grands of the time) had the
log
>> > transform among the "First aid transformations":
>> >
>> > If the data for a continuous variable must all be positive it is
also
>> > typically the case that the distribution is considerably skewed to
the
>> > right. In such a case behave as a good human who sees another
human in
>> > health distress: apply First Aid -- do the things you learned to
do
>> > quickly without too much thought, because things must happen fast
---to
>> > hopefully save the other's life.
>>
>> Martin,
>>
>> Thanks very much. I will look further into this because toxic metals
and
>> organic compounds in geochemical collections almost always have
censored
>> lab
>> results (below method dection limits) that range from about 15% to 80%
or
>> more, and there almost always are very high extreme values.
>>
>> I'll learn to understand what benefits log transforms have over
>> compositional data analyses.
>>
>> Best regards,
>>
>> Rich
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
	[[alternative HTML version deleted]]

Jeff Newmiller

2024-Jan-22 20:57 UTC

head link

[R] Use of geometric mean .. in good data analysis

Still OT... but here is my own (I think previously mentioned here) rant on
people thrashing about with log transformation and an all-too-common kludge to
deal with zeros mixed among small numbers...
https://gist.github.com/jdnewmil/99301a88de702ad2fcbaef33326b08b4

OP perhaps posting a link here to your question posed wherever you end up with
it will help shorten this thread.

On January 22, 2024 12:23:20 PM PST, Bert Gunter <bgunter.4567 at
gmail.com> wrote:>Ah.... LOD's, typically LLOD's ("lower limits of
detection").
>
>Disclaimer: I am *NOT* in any sense an expert on such matters. What follows
>are just some comments based on my personal experience. Please filter
>accordingly. Also, while I kept it on list as Martin suggested it might be
>useful to do so, most folks probably can safely ignore the rant that
>follows as off topic and not of interest. So you've been warned!!
>
>The rant:
>My experience is: data that contain a "bunch" of values that are,
e.g.
>below a LLOD, are frequently reported and/or analyzed by various ad hoc,
>and imho, uniformly bad methods. e.g.:
>
>1) The censored values are recorded and analyzed as at the LLOD;
>2) The censored values are recorded and analyzed at some arbitrary value
>below the LLOD, like LLOD/2;
>3) The censored values are are "imputed" by ad hoc methods, e.g.
uniform
>random values between 0 and the LLOD for left censoring.
>
>To repeat, *IMO*, all of this is junk and will produced misleading
>statistical results. Whether they mislead enough to substantively affect
>the science or regulatory decisions depend on the specifics of the
>circumstances. I accept no general claim as to their innocuousness.
>
>Further:
>
>a) When you have a "lot" of values -- 50%? 75%?, 25%? -- face
facts: you
>have (practically) no useful information from the values that you do have
>to infer what the distribution of values that you don't have looks like.
>All one can sensibly do is say that x% of the values are below a LOD and
>here's the distribution of what lies above. Presumably, if you have such
>data conditional on covariates with the obvious intent to determine the
>relationship to those covariates, you could analyze the percentages of
>LLOD's and known values separately. There are undoubtedly more
>sophisticated methods out there, so this is where you need to go to the
>literature to see what might suit; though I think it will still have to
>come down to looking at these separately (e.g. with extra parameters to
>account for unmeasurable values). Another way of saying this is: any
>analysis which treats all the data as arising from a single distribution
>will depend more on the assumptions you make than on the data. So good luck
>with that!
>
>b) If you have a "modest" amount of (known) censoring -- 5%?, 20%?
10%? --
>methods for the analysis of censored data should be useful. My
>understanding is that MI (multiple imputation) is regarded as a generally
>useful approach, and there are many R packages that can do various flavors
>of this. Again, you should consult the literature: there are very likely
>nontechnical reviews of this topic, too, as well as online discussions and
>tutorials.
>
>So if you are serious about dealing with this and have a lot of data with
>these issues, my advice would be to stop looking for ad hoc advice and dig
>into the literature: it's one of the many areas of "data
science" where
>seemingly simple but pervasive questions require complex answers.
>
>And, again, heed my personal caveats.
>
>Thus endeth my rant.
>
>Cheers to all,
>Bert
>
>
>
>On Mon, Jan 22, 2024 at 9:29?AM Rich Shepard <rshepard at
appl-ecosys.com>
>wrote:
>
>> On Mon, 22 Jan 2024, Martin Maechler wrote:
>>
>> > I think it is a good question, not really only about
geo-chemistry, but
>> > about statistics in applied sciences (and engineering for that
matter).
>>
>> > John W Tukey (and several other of the grands of the time) had the
log
>> > transform among the "First aid transformations":
>> >
>> > If the data for a continuous variable must all be positive it is
also
>> > typically the case that the distribution is considerably skewed to
the
>> > right. In such a case behave as a good human who sees another
human in
>> > health distress: apply First Aid -- do the things you learned to
do
>> > quickly without too much thought, because things must happen fast
---to
>> > hopefully save the other's life.
>>
>> Martin,
>>
>> Thanks very much. I will look further into this because toxic metals
and
>> organic compounds in geochemical collections almost always have
censored
>> lab
>> results (below method dection limits) that range from about 15% to 80%
or
>> more, and there almost always are very high extreme values.
>>
>> I'll learn to understand what benefits log transforms have over
>> compositional data analyses.
>>
>> Best regards,
>>
>> Rich
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>	[[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
-- 
Sent from my phone. Please excuse my brevity.

Apparently Analagous Threads

Search for more reasonably related threads

R help - Jan 2024 - Use of geometric mean .. in good data analysis

[R] Use of geometric mean .. in good data analysis

[R] Use of geometric mean .. in good data analysis

[R] Use of geometric mean .. in good data analysis

Apparently Analagous Threads