thr3ads.net - R help - [R] Data cleaning & Data preparation, what do R users want? [Dec 2017]

If this information is useful, please help other people find it:
Share via:

Jim Lemon

2017-Nov-30 00:00 UTC

[R] Data cleaning & Data preparation, what do R users want?

Hi again,
Typo in the last email. Should read "about 40 standard deviations".

Jim

On Thu, Nov 30, 2017 at 10:54 AM, Jim Lemon <drjimlemon at gmail.com>
wrote:> Hi Robert,
> People want different levels of automation in the software they use.
> What concerns many of us is the desire for the function
>
"figure-out-what-this-data-is-import-it-and-get-rid-of-bad-values".
> Such users typically want something that justifies its use by being
> written by someone who seems to know what they're doing and lots of
> other people use it. One advantage of many R functions is their
> modular construction. This encourages users to at least consider the
> steps that are taken rather than just accept what comes out of that
> long tube.
>
> Take the contentious problem of outlier identification. If I just let
> the black box peel off some values, I don't know what I have lost. On
> the other hand, if I import data and examine it with a summary
> function, I may find that one woman has a height of 5.2 meters. I can
> range check by looking up the Guinness Book of Records. It's an
> outlier. I can estimate the probability of such a height.  Hmm, about
> 4 standard deviations above the mean. It's an outlier. I can attempt a
> Sherlock Holmes. "Watson, I conclude that an imperial measure
(5'2")
> has been recorded as a metric value". It's not an outlier.
>
> The more R gravitates toward "black box" functions, the more some
> users are encouraged to let them do the work.You pays your money and
> you takes your chances.
>
> Jim
>
>
> On Thu, Nov 30, 2017 at 3:37 AM, Robert Wilkins <iwritecode2 at
gmail.com> wrote:
>> R has a very wide audience, clinical research, astronomy, psychology,
and
>> so on and so on.
>> I would consider data analysis work to be three stages: data
preparation,
>> statistical analysis, and producing the report.
>> This regards the process of getting the data ready for analysis and
>> reporting, sometimes called "data cleaning" or "data
munging" or "data
>> wrangling".
>>
>> So as regards tools for data preparation, speaking to the highly
diverse
>> audience mentioned, here is my question:
>>
>> What do you want?
>> Or are you already quite happy with the range of tools that is
currently
>> before you?
>>
>> [BTW,  I posed the same question last week to the r-devel list, and was
>> advised that r-help might be a more suitable audience by one of the
>> moderators.]
>>
>> Robert Wilkins
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

Dominik Schneider

2017-Nov-30 09:11 UTC

head link

[R] Data cleaning & Data preparation, what do R users want?

I would agree that getting data into R from various sources is the biggest
pain point. Even if there is an api, the results are not always consistent
and you have to do lots of dimension checking to get it right. Or there
isn't an open api at all and you have to hack it by web scraping or
otherwise- http://enpiar.com/2017/08/11/one-hour-package/

On Thu, Nov 30, 2017 at 1:00 AM, Jim Lemon <drjimlemon at gmail.com>
wrote:
> Hi again,
> Typo in the last email. Should read "about 40 standard
deviations".
>
> Jim
>
> On Thu, Nov 30, 2017 at 10:54 AM, Jim Lemon <drjimlemon at gmail.com>
wrote:
> > Hi Robert,
> > People want different levels of automation in the software they use.
> > What concerns many of us is the desire for the function
> >
"figure-out-what-this-data-is-import-it-and-get-rid-of-bad-values".
> > Such users typically want something that justifies its use by being
> > written by someone who seems to know what they're doing and lots
of
> > other people use it. One advantage of many R functions is their
> > modular construction. This encourages users to at least consider the
> > steps that are taken rather than just accept what comes out of that
> > long tube.
> >
> > Take the contentious problem of outlier identification. If I just let
> > the black box peel off some values, I don't know what I have lost.
On
> > the other hand, if I import data and examine it with a summary
> > function, I may find that one woman has a height of 5.2 meters. I can
> > range check by looking up the Guinness Book of Records. It's an
> > outlier. I can estimate the probability of such a height.  Hmm, about
> > 4 standard deviations above the mean. It's an outlier. I can
attempt a
> > Sherlock Holmes. "Watson, I conclude that an imperial measure
(5'2")
> > has been recorded as a metric value". It's not an outlier.
> >
> > The more R gravitates toward "black box" functions, the more
some
> > users are encouraged to let them do the work.You pays your money and
> > you takes your chances.
> >
> > Jim
> >
> >
> > On Thu, Nov 30, 2017 at 3:37 AM, Robert Wilkins <iwritecode2 at
gmail.com>
> wrote:
> >> R has a very wide audience, clinical research, astronomy,
psychology,
> and
> >> so on and so on.
> >> I would consider data analysis work to be three stages: data
> preparation,
> >> statistical analysis, and producing the report.
> >> This regards the process of getting the data ready for analysis
and
> >> reporting, sometimes called "data cleaning" or
"data munging" or "data
> >> wrangling".
> >>
> >> So as regards tools for data preparation, speaking to the highly
diverse
> >> audience mentioned, here is my question:
> >>
> >> What do you want?
> >> Or are you already quite happy with the range of tools that is
currently
> >> before you?
> >>
> >> [BTW,  I posed the same question last week to the r-devel list,
and was
> >> advised that r-help might be a more suitable audience by one of
the
> >> moderators.]
> >>
> >> Robert Wilkins
> >>
> >>         [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Robert Wilkins

2017-Dec-11 17:35 UTC

head link

[R] Data cleaning & Data preparation, what do R users want?

Dominik (and others)

If it is indeed still the biggest paint point, even in 2017, then maybe we
can do something about that, with more efforts at different user interface
design and try-outs with them on specialized datasets.
[ The fact that in some specialties, such as clinical trials, for example,
getting access to public domain datasets (and not having to use a tiny
"toy" dataset, which nobody will pay attention to, does make it
harder].

It would help if academia (both comp-sci and statistics departments) would
support those who invest resources in drafting and test-driving new product
designs. If, in the year 2017, it is still a big pain point, doesn't that
make sense. More speculative work in statistical programming language
design has not been a priority in academia since before 1980.

On Thu, Nov 30, 2017 at 4:11 AM, Dominik Schneider <
dominik.schneider at colorado.edu> wrote:
> I would agree that getting data into R from various sources is the biggest
> pain point. Even if there is an api, the results are not always consistent
> and you have to do lots of dimension checking to get it right. Or there
> isn't an open api at all and you have to hack it by web scraping or
> otherwise- http://enpiar.com/2017/08/11/one-hour-package/
>
> On Thu, Nov 30, 2017 at 1:00 AM, Jim Lemon <drjimlemon at gmail.com>
wrote:
>
>> Hi again,
>> Typo in the last email. Should read "about 40 standard
deviations".
>>
>> Jim
>>
>> On Thu, Nov 30, 2017 at 10:54 AM, Jim Lemon <drjimlemon at
gmail.com> wrote:
>> > Hi Robert,
>> > People want different levels of automation in the software they
use.
>> > What concerns many of us is the desire for the function
>> >
"figure-out-what-this-data-is-import-it-and-get-rid-of-bad-values".
>> > Such users typically want something that justifies its use by
being
>> > written by someone who seems to know what they're doing and
lots of
>> > other people use it. One advantage of many R functions is their
>> > modular construction. This encourages users to at least consider
the
>> > steps that are taken rather than just accept what comes out of
that
>> > long tube.
>> >
>> > Take the contentious problem of outlier identification. If I just
let
>> > the black box peel off some values, I don't know what I have
lost. On
>> > the other hand, if I import data and examine it with a summary
>> > function, I may find that one woman has a height of 5.2 meters. I
can
>> > range check by looking up the Guinness Book of Records. It's
an
>> > outlier. I can estimate the probability of such a height.  Hmm,
about
>> > 4 standard deviations above the mean. It's an outlier. I can
attempt a
>> > Sherlock Holmes. "Watson, I conclude that an imperial measure
(5'2")
>> > has been recorded as a metric value". It's not an
outlier.
>> >
>> > The more R gravitates toward "black box" functions, the
more some
>> > users are encouraged to let them do the work.You pays your money
and
>> > you takes your chances.
>> >
>> > Jim
>> >
>> >
>> > On Thu, Nov 30, 2017 at 3:37 AM, Robert Wilkins <iwritecode2 at
gmail.com>
>> wrote:
>> >> R has a very wide audience, clinical research, astronomy,
psychology,
>> and
>> >> so on and so on.
>> >> I would consider data analysis work to be three stages: data
>> preparation,
>> >> statistical analysis, and producing the report.
>> >> This regards the process of getting the data ready for
analysis and
>> >> reporting, sometimes called "data cleaning" or
"data munging" or "data
>> >> wrangling".
>> >>
>> >> So as regards tools for data preparation, speaking to the
highly
>> diverse
>> >> audience mentioned, here is my question:
>> >>
>> >> What do you want?
>> >> Or are you already quite happy with the range of tools that is
>> currently
>> >> before you?
>> >>
>> >> [BTW,  I posed the same question last week to the r-devel
list, and was
>> >> advised that r-help might be a more suitable audience by one
of the
>> >> moderators.]
>> >>
>> >> Robert Wilkins
>> >>
>> >>         [[alternative HTML version deleted]]
>> >>
>> >> ______________________________________________
>> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide
http://www.R-project.org/posti
>> ng-guide.html
>> >> and provide commented, minimal, self-contained, reproducible
code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
	[[alternative HTML version deleted]]

Possibly Parallel Threads

Search for more maybe matching threads

R help - Dec 2017 - Data cleaning & Data preparation, what do R users want?

[R] Data cleaning & Data preparation, what do R users want?

[R] Data cleaning & Data preparation, what do R users want?

[R] Data cleaning & Data preparation, what do R users want?

Possibly Parallel Threads