Ah.... LOD's, typically LLOD's ("lower limits of detection"). Disclaimer: I am *NOT* in any sense an expert on such matters. What follows are just some comments based on my personal experience. Please filter accordingly. Also, while I kept it on list as Martin suggested it might be useful to do so, most folks probably can safely ignore the rant that follows as off topic and not of interest. So you've been warned!! The rant: My experience is: data that contain a "bunch" of values that are, e.g. below a LLOD, are frequently reported and/or analyzed by various ad hoc, and imho, uniformly bad methods. e.g.: 1) The censored values are recorded and analyzed as at the LLOD; 2) The censored values are recorded and analyzed at some arbitrary value below the LLOD, like LLOD/2; 3) The censored values are are "imputed" by ad hoc methods, e.g. uniform random values between 0 and the LLOD for left censoring. To repeat, *IMO*, all of this is junk and will produced misleading statistical results. Whether they mislead enough to substantively affect the science or regulatory decisions depend on the specifics of the circumstances. I accept no general claim as to their innocuousness. Further: a) When you have a "lot" of values -- 50%? 75%?, 25%? -- face facts: you have (practically) no useful information from the values that you do have to infer what the distribution of values that you don't have looks like. All one can sensibly do is say that x% of the values are below a LOD and here's the distribution of what lies above. Presumably, if you have such data conditional on covariates with the obvious intent to determine the relationship to those covariates, you could analyze the percentages of LLOD's and known values separately. There are undoubtedly more sophisticated methods out there, so this is where you need to go to the literature to see what might suit; though I think it will still have to come down to looking at these separately (e.g. with extra parameters to account for unmeasurable values). Another way of saying this is: any analysis which treats all the data as arising from a single distribution will depend more on the assumptions you make than on the data. So good luck with that! b) If you have a "modest" amount of (known) censoring -- 5%?, 20%? 10%? -- methods for the analysis of censored data should be useful. My understanding is that MI (multiple imputation) is regarded as a generally useful approach, and there are many R packages that can do various flavors of this. Again, you should consult the literature: there are very likely nontechnical reviews of this topic, too, as well as online discussions and tutorials. So if you are serious about dealing with this and have a lot of data with these issues, my advice would be to stop looking for ad hoc advice and dig into the literature: it's one of the many areas of "data science" where seemingly simple but pervasive questions require complex answers. And, again, heed my personal caveats. Thus endeth my rant. Cheers to all, Bert On Mon, Jan 22, 2024 at 9:29?AM Rich Shepard <rshepard at appl-ecosys.com> wrote:> On Mon, 22 Jan 2024, Martin Maechler wrote: > > > I think it is a good question, not really only about geo-chemistry, but > > about statistics in applied sciences (and engineering for that matter). > > > John W Tukey (and several other of the grands of the time) had the log > > transform among the "First aid transformations": > > > > If the data for a continuous variable must all be positive it is also > > typically the case that the distribution is considerably skewed to the > > right. In such a case behave as a good human who sees another human in > > health distress: apply First Aid -- do the things you learned to do > > quickly without too much thought, because things must happen fast ---to > > hopefully save the other's life. > > Martin, > > Thanks very much. I will look further into this because toxic metals and > organic compounds in geochemical collections almost always have censored > lab > results (below method dection limits) that range from about 15% to 80% or > more, and there almost always are very high extreme values. > > I'll learn to understand what benefits log transforms have over > compositional data analyses. > > Best regards, > > Rich > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
In the spirit of Martin's comments, it is perhaps worthwhile to note one of John Tukey's (who I actually knew) pertinent quotes: "The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. <https://www.azquotes.com/quote/603406>" "Sunset Salvo" by John Tukey in The American Statistician, Volume 40, No. 1 (pp. 72-76), www.jstor.org. February 1986. Cheers, Bert <https://www.azquotes.com/author/14847-John_Tukey> On Mon, Jan 22, 2024 at 12:23?PM Bert Gunter <bgunter.4567 at gmail.com> wrote:> > Ah.... LOD's, typically LLOD's ("lower limits of detection"). > > Disclaimer: I am *NOT* in any sense an expert on such matters. What > follows are just some comments based on my personal experience. Please > filter accordingly. Also, while I kept it on list as Martin suggested it > might be useful to do so, most folks probably can safely ignore the rant > that follows as off topic and not of interest. So you've been warned!! > > The rant: > My experience is: data that contain a "bunch" of values that are, e.g. > below a LLOD, are frequently reported and/or analyzed by various ad hoc, > and imho, uniformly bad methods. e.g.: > > 1) The censored values are recorded and analyzed as at the LLOD; > 2) The censored values are recorded and analyzed at some arbitrary value > below the LLOD, like LLOD/2; > 3) The censored values are are "imputed" by ad hoc methods, e.g. uniform > random values between 0 and the LLOD for left censoring. > > To repeat, *IMO*, all of this is junk and will produced misleading > statistical results. Whether they mislead enough to substantively affect > the science or regulatory decisions depend on the specifics of the > circumstances. I accept no general claim as to their innocuousness. > > Further: > > a) When you have a "lot" of values -- 50%? 75%?, 25%? -- face facts: you > have (practically) no useful information from the values that you do have > to infer what the distribution of values that you don't have looks like. > All one can sensibly do is say that x% of the values are below a LOD and > here's the distribution of what lies above. Presumably, if you have such > data conditional on covariates with the obvious intent to determine the > relationship to those covariates, you could analyze the percentages of > LLOD's and known values separately. There are undoubtedly more > sophisticated methods out there, so this is where you need to go to the > literature to see what might suit; though I think it will still have to > come down to looking at these separately (e.g. with extra parameters to > account for unmeasurable values). Another way of saying this is: any > analysis which treats all the data as arising from a single distribution > will depend more on the assumptions you make than on the data. So good luck > with that! > > b) If you have a "modest" amount of (known) censoring -- 5%?, 20%? 10%? -- > methods for the analysis of censored data should be useful. My > understanding is that MI (multiple imputation) is regarded as a generally > useful approach, and there are many R packages that can do various flavors > of this. Again, you should consult the literature: there are very likely > nontechnical reviews of this topic, too, as well as online discussions and > tutorials. > > So if you are serious about dealing with this and have a lot of data with > these issues, my advice would be to stop looking for ad hoc advice and dig > into the literature: it's one of the many areas of "data science" where > seemingly simple but pervasive questions require complex answers. > > And, again, heed my personal caveats. > > Thus endeth my rant. > > Cheers to all, > Bert > > > > On Mon, Jan 22, 2024 at 9:29?AM Rich Shepard <rshepard at appl-ecosys.com> > wrote: > >> On Mon, 22 Jan 2024, Martin Maechler wrote: >> >> > I think it is a good question, not really only about geo-chemistry, but >> > about statistics in applied sciences (and engineering for that matter). >> >> > John W Tukey (and several other of the grands of the time) had the log >> > transform among the "First aid transformations": >> > >> > If the data for a continuous variable must all be positive it is also >> > typically the case that the distribution is considerably skewed to the >> > right. In such a case behave as a good human who sees another human in >> > health distress: apply First Aid -- do the things you learned to do >> > quickly without too much thought, because things must happen fast ---to >> > hopefully save the other's life. >> >> Martin, >> >> Thanks very much. I will look further into this because toxic metals and >> organic compounds in geochemical collections almost always have censored >> lab >> results (below method dection limits) that range from about 15% to 80% or >> more, and there almost always are very high extreme values. >> >> I'll learn to understand what benefits log transforms have over >> compositional data analyses. >> >> Best regards, >> >> Rich >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >[[alternative HTML version deleted]]
Still OT... but here is my own (I think previously mentioned here) rant on people thrashing about with log transformation and an all-too-common kludge to deal with zeros mixed among small numbers... https://gist.github.com/jdnewmil/99301a88de702ad2fcbaef33326b08b4 OP perhaps posting a link here to your question posed wherever you end up with it will help shorten this thread. On January 22, 2024 12:23:20 PM PST, Bert Gunter <bgunter.4567 at gmail.com> wrote:>Ah.... LOD's, typically LLOD's ("lower limits of detection"). > >Disclaimer: I am *NOT* in any sense an expert on such matters. What follows >are just some comments based on my personal experience. Please filter >accordingly. Also, while I kept it on list as Martin suggested it might be >useful to do so, most folks probably can safely ignore the rant that >follows as off topic and not of interest. So you've been warned!! > >The rant: >My experience is: data that contain a "bunch" of values that are, e.g. >below a LLOD, are frequently reported and/or analyzed by various ad hoc, >and imho, uniformly bad methods. e.g.: > >1) The censored values are recorded and analyzed as at the LLOD; >2) The censored values are recorded and analyzed at some arbitrary value >below the LLOD, like LLOD/2; >3) The censored values are are "imputed" by ad hoc methods, e.g. uniform >random values between 0 and the LLOD for left censoring. > >To repeat, *IMO*, all of this is junk and will produced misleading >statistical results. Whether they mislead enough to substantively affect >the science or regulatory decisions depend on the specifics of the >circumstances. I accept no general claim as to their innocuousness. > >Further: > >a) When you have a "lot" of values -- 50%? 75%?, 25%? -- face facts: you >have (practically) no useful information from the values that you do have >to infer what the distribution of values that you don't have looks like. >All one can sensibly do is say that x% of the values are below a LOD and >here's the distribution of what lies above. Presumably, if you have such >data conditional on covariates with the obvious intent to determine the >relationship to those covariates, you could analyze the percentages of >LLOD's and known values separately. There are undoubtedly more >sophisticated methods out there, so this is where you need to go to the >literature to see what might suit; though I think it will still have to >come down to looking at these separately (e.g. with extra parameters to >account for unmeasurable values). Another way of saying this is: any >analysis which treats all the data as arising from a single distribution >will depend more on the assumptions you make than on the data. So good luck >with that! > >b) If you have a "modest" amount of (known) censoring -- 5%?, 20%? 10%? -- >methods for the analysis of censored data should be useful. My >understanding is that MI (multiple imputation) is regarded as a generally >useful approach, and there are many R packages that can do various flavors >of this. Again, you should consult the literature: there are very likely >nontechnical reviews of this topic, too, as well as online discussions and >tutorials. > >So if you are serious about dealing with this and have a lot of data with >these issues, my advice would be to stop looking for ad hoc advice and dig >into the literature: it's one of the many areas of "data science" where >seemingly simple but pervasive questions require complex answers. > >And, again, heed my personal caveats. > >Thus endeth my rant. > >Cheers to all, >Bert > > > >On Mon, Jan 22, 2024 at 9:29?AM Rich Shepard <rshepard at appl-ecosys.com> >wrote: > >> On Mon, 22 Jan 2024, Martin Maechler wrote: >> >> > I think it is a good question, not really only about geo-chemistry, but >> > about statistics in applied sciences (and engineering for that matter). >> >> > John W Tukey (and several other of the grands of the time) had the log >> > transform among the "First aid transformations": >> > >> > If the data for a continuous variable must all be positive it is also >> > typically the case that the distribution is considerably skewed to the >> > right. In such a case behave as a good human who sees another human in >> > health distress: apply First Aid -- do the things you learned to do >> > quickly without too much thought, because things must happen fast ---to >> > hopefully save the other's life. >> >> Martin, >> >> Thanks very much. I will look further into this because toxic metals and >> organic compounds in geochemical collections almost always have censored >> lab >> results (below method dection limits) that range from about 15% to 80% or >> more, and there almost always are very high extreme values. >> >> I'll learn to understand what benefits log transforms have over >> compositional data analyses. >> >> Best regards, >> >> Rich >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > [[alternative HTML version deleted]] > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.-- Sent from my phone. Please excuse my brevity.
Apparently Analagous Threads
- Use of geometric mean .. in good data analysis
- Use of geometric mean .. in good data analysis
- Use of geometric mean for geochemical concentrations
- Use of geometric mean for geochemical concentrations [RESOLVED]
- Use of geometric mean .. in good data analysis