Jim Lemon
2017-Nov-30 00:00 UTC
[R] Data cleaning & Data preparation, what do R users want?
Hi again, Typo in the last email. Should read "about 40 standard deviations". Jim On Thu, Nov 30, 2017 at 10:54 AM, Jim Lemon <drjimlemon at gmail.com> wrote:> Hi Robert, > People want different levels of automation in the software they use. > What concerns many of us is the desire for the function > "figure-out-what-this-data-is-import-it-and-get-rid-of-bad-values". > Such users typically want something that justifies its use by being > written by someone who seems to know what they're doing and lots of > other people use it. One advantage of many R functions is their > modular construction. This encourages users to at least consider the > steps that are taken rather than just accept what comes out of that > long tube. > > Take the contentious problem of outlier identification. If I just let > the black box peel off some values, I don't know what I have lost. On > the other hand, if I import data and examine it with a summary > function, I may find that one woman has a height of 5.2 meters. I can > range check by looking up the Guinness Book of Records. It's an > outlier. I can estimate the probability of such a height. Hmm, about > 4 standard deviations above the mean. It's an outlier. I can attempt a > Sherlock Holmes. "Watson, I conclude that an imperial measure (5'2") > has been recorded as a metric value". It's not an outlier. > > The more R gravitates toward "black box" functions, the more some > users are encouraged to let them do the work.You pays your money and > you takes your chances. > > Jim > > > On Thu, Nov 30, 2017 at 3:37 AM, Robert Wilkins <iwritecode2 at gmail.com> wrote: >> R has a very wide audience, clinical research, astronomy, psychology, and >> so on and so on. >> I would consider data analysis work to be three stages: data preparation, >> statistical analysis, and producing the report. >> This regards the process of getting the data ready for analysis and >> reporting, sometimes called "data cleaning" or "data munging" or "data >> wrangling". >> >> So as regards tools for data preparation, speaking to the highly diverse >> audience mentioned, here is my question: >> >> What do you want? >> Or are you already quite happy with the range of tools that is currently >> before you? >> >> [BTW, I posed the same question last week to the r-devel list, and was >> advised that r-help might be a more suitable audience by one of the >> moderators.] >> >> Robert Wilkins >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code.
Dominik Schneider
2017-Nov-30 09:11 UTC
[R] Data cleaning & Data preparation, what do R users want?
I would agree that getting data into R from various sources is the biggest pain point. Even if there is an api, the results are not always consistent and you have to do lots of dimension checking to get it right. Or there isn't an open api at all and you have to hack it by web scraping or otherwise- http://enpiar.com/2017/08/11/one-hour-package/ On Thu, Nov 30, 2017 at 1:00 AM, Jim Lemon <drjimlemon at gmail.com> wrote:> Hi again, > Typo in the last email. Should read "about 40 standard deviations". > > Jim > > On Thu, Nov 30, 2017 at 10:54 AM, Jim Lemon <drjimlemon at gmail.com> wrote: > > Hi Robert, > > People want different levels of automation in the software they use. > > What concerns many of us is the desire for the function > > "figure-out-what-this-data-is-import-it-and-get-rid-of-bad-values". > > Such users typically want something that justifies its use by being > > written by someone who seems to know what they're doing and lots of > > other people use it. One advantage of many R functions is their > > modular construction. This encourages users to at least consider the > > steps that are taken rather than just accept what comes out of that > > long tube. > > > > Take the contentious problem of outlier identification. If I just let > > the black box peel off some values, I don't know what I have lost. On > > the other hand, if I import data and examine it with a summary > > function, I may find that one woman has a height of 5.2 meters. I can > > range check by looking up the Guinness Book of Records. It's an > > outlier. I can estimate the probability of such a height. Hmm, about > > 4 standard deviations above the mean. It's an outlier. I can attempt a > > Sherlock Holmes. "Watson, I conclude that an imperial measure (5'2") > > has been recorded as a metric value". It's not an outlier. > > > > The more R gravitates toward "black box" functions, the more some > > users are encouraged to let them do the work.You pays your money and > > you takes your chances. > > > > Jim > > > > > > On Thu, Nov 30, 2017 at 3:37 AM, Robert Wilkins <iwritecode2 at gmail.com> > wrote: > >> R has a very wide audience, clinical research, astronomy, psychology, > and > >> so on and so on. > >> I would consider data analysis work to be three stages: data > preparation, > >> statistical analysis, and producing the report. > >> This regards the process of getting the data ready for analysis and > >> reporting, sometimes called "data cleaning" or "data munging" or "data > >> wrangling". > >> > >> So as regards tools for data preparation, speaking to the highly diverse > >> audience mentioned, here is my question: > >> > >> What do you want? > >> Or are you already quite happy with the range of tools that is currently > >> before you? > >> > >> [BTW, I posed the same question last week to the r-devel list, and was > >> advised that r-help might be a more suitable audience by one of the > >> moderators.] > >> > >> Robert Wilkins > >> > >> [[alternative HTML version deleted]] > >> > >> ______________________________________________ > >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Robert Wilkins
2017-Dec-11 17:35 UTC
[R] Data cleaning & Data preparation, what do R users want?
Dominik (and others) If it is indeed still the biggest paint point, even in 2017, then maybe we can do something about that, with more efforts at different user interface design and try-outs with them on specialized datasets. [ The fact that in some specialties, such as clinical trials, for example, getting access to public domain datasets (and not having to use a tiny "toy" dataset, which nobody will pay attention to, does make it harder]. It would help if academia (both comp-sci and statistics departments) would support those who invest resources in drafting and test-driving new product designs. If, in the year 2017, it is still a big pain point, doesn't that make sense. More speculative work in statistical programming language design has not been a priority in academia since before 1980. On Thu, Nov 30, 2017 at 4:11 AM, Dominik Schneider < dominik.schneider at colorado.edu> wrote:> I would agree that getting data into R from various sources is the biggest > pain point. Even if there is an api, the results are not always consistent > and you have to do lots of dimension checking to get it right. Or there > isn't an open api at all and you have to hack it by web scraping or > otherwise- http://enpiar.com/2017/08/11/one-hour-package/ > > On Thu, Nov 30, 2017 at 1:00 AM, Jim Lemon <drjimlemon at gmail.com> wrote: > >> Hi again, >> Typo in the last email. Should read "about 40 standard deviations". >> >> Jim >> >> On Thu, Nov 30, 2017 at 10:54 AM, Jim Lemon <drjimlemon at gmail.com> wrote: >> > Hi Robert, >> > People want different levels of automation in the software they use. >> > What concerns many of us is the desire for the function >> > "figure-out-what-this-data-is-import-it-and-get-rid-of-bad-values". >> > Such users typically want something that justifies its use by being >> > written by someone who seems to know what they're doing and lots of >> > other people use it. One advantage of many R functions is their >> > modular construction. This encourages users to at least consider the >> > steps that are taken rather than just accept what comes out of that >> > long tube. >> > >> > Take the contentious problem of outlier identification. If I just let >> > the black box peel off some values, I don't know what I have lost. On >> > the other hand, if I import data and examine it with a summary >> > function, I may find that one woman has a height of 5.2 meters. I can >> > range check by looking up the Guinness Book of Records. It's an >> > outlier. I can estimate the probability of such a height. Hmm, about >> > 4 standard deviations above the mean. It's an outlier. I can attempt a >> > Sherlock Holmes. "Watson, I conclude that an imperial measure (5'2") >> > has been recorded as a metric value". It's not an outlier. >> > >> > The more R gravitates toward "black box" functions, the more some >> > users are encouraged to let them do the work.You pays your money and >> > you takes your chances. >> > >> > Jim >> > >> > >> > On Thu, Nov 30, 2017 at 3:37 AM, Robert Wilkins <iwritecode2 at gmail.com> >> wrote: >> >> R has a very wide audience, clinical research, astronomy, psychology, >> and >> >> so on and so on. >> >> I would consider data analysis work to be three stages: data >> preparation, >> >> statistical analysis, and producing the report. >> >> This regards the process of getting the data ready for analysis and >> >> reporting, sometimes called "data cleaning" or "data munging" or "data >> >> wrangling". >> >> >> >> So as regards tools for data preparation, speaking to the highly >> diverse >> >> audience mentioned, here is my question: >> >> >> >> What do you want? >> >> Or are you already quite happy with the range of tools that is >> currently >> >> before you? >> >> >> >> [BTW, I posed the same question last week to the r-devel list, and was >> >> advised that r-help might be a more suitable audience by one of the >> >> moderators.] >> >> >> >> Robert Wilkins >> >> >> >> [[alternative HTML version deleted]] >> >> >> >> ______________________________________________ >> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> >> https://stat.ethz.ch/mailman/listinfo/r-help >> >> PLEASE do read the posting guide http://www.R-project.org/posti >> ng-guide.html >> >> and provide commented, minimal, self-contained, reproducible code. >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posti >> ng-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > >[[alternative HTML version deleted]]
Reasonably Related Threads
- Data cleaning & Data preparation, what do R users want?
- Data cleaning & Data preparation, what do R users want?
- Data cleaning & Data preparation, what do R users want?
- Data cleaning & Data preparation, what do R users want?
- Data cleaning & Data preparation, what do R users want?