Robert Wilkins
2017-Nov-29 16:37 UTC
[R] Data cleaning & Data preparation, what do R users want?
R has a very wide audience, clinical research, astronomy, psychology, and so on and so on. I would consider data analysis work to be three stages: data preparation, statistical analysis, and producing the report. This regards the process of getting the data ready for analysis and reporting, sometimes called "data cleaning" or "data munging" or "data wrangling". So as regards tools for data preparation, speaking to the highly diverse audience mentioned, here is my question: What do you want? Or are you already quite happy with the range of tools that is currently before you? [BTW, I posed the same question last week to the r-devel list, and was advised that r-help might be a more suitable audience by one of the moderators.] Robert Wilkins [[alternative HTML version deleted]]
Bert Gunter
2017-Nov-29 16:48 UTC
[R] Data cleaning & Data preparation, what do R users want?
I don't think my view is of interest to many, so offlist. I reject this: " I would consider data analysis work to be three stages: data preparation, statistical analysis, and producing the report." For example, there is no such thing as "outliers" -- data to be removed as part of cleaning/preparation -- without a statistical model to be an "outlier" **from**, which is part of the statistical analysis. And the structure of the data (data preparation) may need to change depending on the course of the analysis (including graphics, also part of the analysis). So I think your view reflects a na?ve view of the nature of data analysis, which is an iterative and holistic process. I suspect your training is as a computer scientist and you have not done much 1-1 consulting with researchers, though you should certainly feel free to reject this canard. Building software for large scale automated analysis of data required a much different analytical paradigm than the statistical consulting model, which is largely my background. No reply necessary. Just my opinion, which you are of course free to trash. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Wed, Nov 29, 2017 at 8:37 AM, Robert Wilkins <iwritecode2 at gmail.com> wrote:> R has a very wide audience, clinical research, astronomy, psychology, and > so on and so on. > I would consider data analysis work to be three stages: data preparation, > statistical analysis, and producing the report. > This regards the process of getting the data ready for analysis and > reporting, sometimes called "data cleaning" or "data munging" or "data > wrangling". > > So as regards tools for data preparation, speaking to the highly diverse > audience mentioned, here is my question: > > What do you want? > Or are you already quite happy with the range of tools that is currently > before you? > > [BTW, I posed the same question last week to the r-devel list, and was > advised that r-help might be a more suitable audience by one of the > moderators.] > > Robert Wilkins > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Bert Gunter
2017-Nov-29 16:49 UTC
[R] Data cleaning & Data preparation, what do R users want?
Oh Crap! I mistakenly replied onlist. PLEASE IGNORE -- these are only my ignorant opinions. -- Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Wed, Nov 29, 2017 at 8:48 AM, Bert Gunter <bgunter.4567 at gmail.com> wrote:> I don't think my view is of interest to many, so offlist. > > I reject this: > > " I would consider data analysis work to be three stages: data preparation, > statistical analysis, and producing the report." > > For example, there is no such thing as "outliers" -- data to be removed as > part of cleaning/preparation -- without a statistical model to be an > "outlier" **from**, which is part of the statistical analysis. And the > structure of the data (data preparation) may need to change depending on > the course of the analysis (including graphics, also part of the analysis). > So I think your view reflects a na?ve view of the nature of data analysis, > which is an iterative and holistic process. I suspect your training is as a > computer scientist and you have not done much 1-1 consulting with > researchers, though you should certainly feel free to reject this canard. > Building software for large scale automated analysis of data required a > much different analytical paradigm than the statistical consulting model, > which is largely my background. > > No reply necessary. Just my opinion, which you are of course free to trash. > > Cheers, > Bert > > > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along and > sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > On Wed, Nov 29, 2017 at 8:37 AM, Robert Wilkins <iwritecode2 at gmail.com> > wrote: > >> R has a very wide audience, clinical research, astronomy, psychology, and >> so on and so on. >> I would consider data analysis work to be three stages: data preparation, >> statistical analysis, and producing the report. >> This regards the process of getting the data ready for analysis and >> reporting, sometimes called "data cleaning" or "data munging" or "data >> wrangling". >> >> So as regards tools for data preparation, speaking to the highly diverse >> audience mentioned, here is my question: >> >> What do you want? >> Or are you already quite happy with the range of tools that is currently >> before you? >> >> [BTW, I posed the same question last week to the r-devel list, and was >> advised that r-help might be a more suitable audience by one of the >> moderators.] >> >> Robert Wilkins >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posti >> ng-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > >[[alternative HTML version deleted]]
Christopher W. Ryan
2017-Nov-29 16:52 UTC
[R] Data cleaning & Data preparation, what do R users want?
Great question. What do I want? I want my co-workers to stop using Excel spreadsheets for data entry, storage, and sharing! I want them to understand the value of data discipline. But alas . . . . I work in a county health department in the US. Between dplyr, stringr, grep, grepl, and the base R read() functions, I'm doing OK. I need to learn more about APIs, so I can see if I can make R directly grab data from, e.g. our state health department sources. My biggest hassle is having to download a data file, save it somewhere, and then open R and read it in. I'd like to be able to do it all in R. Would make the generation of recurring reports easier. --Chris Ryan Robert Wilkins wrote:> R has a very wide audience, clinical research, astronomy, psychology, and > so on and so on. > I would consider data analysis work to be three stages: data preparation, > statistical analysis, and producing the report. > This regards the process of getting the data ready for analysis and > reporting, sometimes called "data cleaning" or "data munging" or "data > wrangling". > > So as regards tools for data preparation, speaking to the highly diverse > audience mentioned, here is my question: > > What do you want? > Or are you already quite happy with the range of tools that is currently > before you? > > [BTW, I posed the same question last week to the r-devel list, and was > advised that r-help might be a more suitable audience by one of the > moderators.] > > Robert Wilkins > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Robert Wilkins
2017-Nov-29 17:08 UTC
[R] Data cleaning & Data preparation, what do R users want?
Christopher, OK, well what about a range of functions in an R package that automatically, with very little syntax, pulls in data from a variety of formats (CSV, SQLite, and so on) and converts them to an R data frame. You seem to be pointing to something like that. Something like that, in some form or another, probably already exists, though it might be either imperfect (not as user-friendly as possible) or not well publicised, or both. Or another tangent: your co-workers are not going to stop using Excel, whether you like it or not, and many end-users are stuck in the exact same position as you (co-workers who deliver the data in Excel). I will guess that data stored in Excel tends to be dirty in somewhat predictable ways. (And again, those other end-user's coworkers are not going to change their behaviour). And so: a data munging tool that makes it as easy as possible to clean up the data in Excel spreadsheets and export them to R data frames. One prerequisite: an understanding of what tends to go wrong with data with Excel ( the data in Excel tends to be dirty, but dirty in what way?). Thank you for your response Christopher. What state are you in? On Wed, Nov 29, 2017 at 11:52 AM, Christopher W. Ryan <cryan at binghamton.edu> wrote:> Great question. What do I want? I want my co-workers to stop using Excel > spreadsheets for data entry, storage, and sharing! I want them to > understand the value of data discipline. But alas . . . . > > I work in a county health department in the US. Between dplyr, stringr, > grep, grepl, and the base R read() functions, I'm doing OK. > > I need to learn more about APIs, so I can see if I can make R directly > grab data from, e.g. our state health department sources. My biggest > hassle is having to download a data file, save it somewhere, and then > open R and read it in. I'd like to be able to do it all in R. Would make > the generation of recurring reports easier. > > --Chris Ryan > > Robert Wilkins wrote: > > R has a very wide audience, clinical research, astronomy, psychology, and > > so on and so on. > > I would consider data analysis work to be three stages: data preparation, > > statistical analysis, and producing the report. > > This regards the process of getting the data ready for analysis and > > reporting, sometimes called "data cleaning" or "data munging" or "data > > wrangling". > > > > So as regards tools for data preparation, speaking to the highly diverse > > audience mentioned, here is my question: > > > > What do you want? > > Or are you already quite happy with the range of tools that is currently > > before you? > > > > [BTW, I posed the same question last week to the r-devel list, and was > > advised that r-help might be a more suitable audience by one of the > > moderators.] > > > > Robert Wilkins > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > >[[alternative HTML version deleted]]
Jim Lemon
2017-Nov-29 23:54 UTC
[R] Data cleaning & Data preparation, what do R users want?
Hi Robert, People want different levels of automation in the software they use. What concerns many of us is the desire for the function "figure-out-what-this-data-is-import-it-and-get-rid-of-bad-values". Such users typically want something that justifies its use by being written by someone who seems to know what they're doing and lots of other people use it. One advantage of many R functions is their modular construction. This encourages users to at least consider the steps that are taken rather than just accept what comes out of that long tube. Take the contentious problem of outlier identification. If I just let the black box peel off some values, I don't know what I have lost. On the other hand, if I import data and examine it with a summary function, I may find that one woman has a height of 5.2 meters. I can range check by looking up the Guinness Book of Records. It's an outlier. I can estimate the probability of such a height. Hmm, about 4 standard deviations above the mean. It's an outlier. I can attempt a Sherlock Holmes. "Watson, I conclude that an imperial measure (5'2") has been recorded as a metric value". It's not an outlier. The more R gravitates toward "black box" functions, the more some users are encouraged to let them do the work.You pays your money and you takes your chances. Jim On Thu, Nov 30, 2017 at 3:37 AM, Robert Wilkins <iwritecode2 at gmail.com> wrote:> R has a very wide audience, clinical research, astronomy, psychology, and > so on and so on. > I would consider data analysis work to be three stages: data preparation, > statistical analysis, and producing the report. > This regards the process of getting the data ready for analysis and > reporting, sometimes called "data cleaning" or "data munging" or "data > wrangling". > > So as regards tools for data preparation, speaking to the highly diverse > audience mentioned, here is my question: > > What do you want? > Or are you already quite happy with the range of tools that is currently > before you? > > [BTW, I posed the same question last week to the r-devel list, and was > advised that r-help might be a more suitable audience by one of the > moderators.] > > Robert Wilkins > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Jim Lemon
2017-Nov-30 00:00 UTC
[R] Data cleaning & Data preparation, what do R users want?
Hi again, Typo in the last email. Should read "about 40 standard deviations". Jim On Thu, Nov 30, 2017 at 10:54 AM, Jim Lemon <drjimlemon at gmail.com> wrote:> Hi Robert, > People want different levels of automation in the software they use. > What concerns many of us is the desire for the function > "figure-out-what-this-data-is-import-it-and-get-rid-of-bad-values". > Such users typically want something that justifies its use by being > written by someone who seems to know what they're doing and lots of > other people use it. One advantage of many R functions is their > modular construction. This encourages users to at least consider the > steps that are taken rather than just accept what comes out of that > long tube. > > Take the contentious problem of outlier identification. If I just let > the black box peel off some values, I don't know what I have lost. On > the other hand, if I import data and examine it with a summary > function, I may find that one woman has a height of 5.2 meters. I can > range check by looking up the Guinness Book of Records. It's an > outlier. I can estimate the probability of such a height. Hmm, about > 4 standard deviations above the mean. It's an outlier. I can attempt a > Sherlock Holmes. "Watson, I conclude that an imperial measure (5'2") > has been recorded as a metric value". It's not an outlier. > > The more R gravitates toward "black box" functions, the more some > users are encouraged to let them do the work.You pays your money and > you takes your chances. > > Jim > > > On Thu, Nov 30, 2017 at 3:37 AM, Robert Wilkins <iwritecode2 at gmail.com> wrote: >> R has a very wide audience, clinical research, astronomy, psychology, and >> so on and so on. >> I would consider data analysis work to be three stages: data preparation, >> statistical analysis, and producing the report. >> This regards the process of getting the data ready for analysis and >> reporting, sometimes called "data cleaning" or "data munging" or "data >> wrangling". >> >> So as regards tools for data preparation, speaking to the highly diverse >> audience mentioned, here is my question: >> >> What do you want? >> Or are you already quite happy with the range of tools that is currently >> before you? >> >> [BTW, I posed the same question last week to the r-devel list, and was >> advised that r-help might be a more suitable audience by one of the >> moderators.] >> >> Robert Wilkins >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code.
Apparently Analagous Threads
- Data cleaning & Data preparation, what do R users want?
- Data cleaning & Data preparation, what do R users want?
- Data cleaning & Data preparation, what do R users want?
- Data cleaning & Data preparation, what do R users want?
- Data cleaning & Data preparation, what do R users want?