On 07.05.2020 at 11:19 Deepayan Sarkar wrote:> On Thu, May 7, 2020 at 12:58 AM Thomas Petzoldt <thpe at simecol.de> wrote: >> >> Sorry if I'm joining a little bit late. >> >> I've put some related links and scripts together a few weeks ago. Then I >> stopped with this, because there is so much. >> >> The data format employed by John Hopkins CSSE was sort of a big surprise >> to me. > > Why? I find it quite convenient to drop the first few columns and > extract the data as a matrix (using data.matrix()). > > -DeepayanMany thanks for the hint to use data.matrix My aim was not to say that it is difficult, especially as R has all the tools for data mangling. My surprise was that "wide tables" and non-ISO dates as column names are not the "data base way" that we in general teach to our students With reshape2::melt or tidyr::gather resp. pivot_longer, conversion is quite easy, regardless if one wants to use tidyverse or not, see example below. Again, thanks, Thomas library("dplyr") library("readr") library("tidyr") file <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv" dat <- read_delim(file, delim=",") names(dat)[1:2] <- c("Province_State", "Country_Region") dat2 <- dat %>% ## summarize Country/Region duplicates group_by(Country_Region) %>% summarise_at(vars(-(1:4)), sum) %>% ## make it a long table pivot_longer(cols = -Country_Region, names_to = "time") %>% ## convert to ISO 8601 date mutate(time = as.POSIXct(time, format="%m/%e/%y"))> >> An opposite approach was taken in Germany, that organized it as a >> big JSON trees. >> >> Fortunately, both can be "tidied" with R, and represent good didactic >> examples for our students. >> >> Here yet another repo linking to the data: >> >> https://github.com/tpetzoldt/covid >> >> >> Thomas >> >> >> On 04.05.2020 at 20:48 James Spottiswoode wrote: >>> Sure. COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University is available here: >>> >>> https://github.com/CSSEGISandData/COVID-19 >>> >>> All in csv fiormat. >>> >>> >>>> On May 4, 2020, at 11:31 AM, Bernard McGarvey <mcgarvey.bernard at comcast.net> wrote: >>>> >>>> Just curious does anyone know of a website that has data available in a format that R can download and analyze? >>>> >>>> Thanks >>>> >>>> >>>> Bernard McGarvey >>>> >>>> >>>> Director, Fort Myers Beach Lions Foundation, Inc. >>>> >>>> >>>> Retired (Lilly Engineering Fellow). >>>> >>>> ______________________________________________ >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>> >>> James Spottiswoode >>> Applied Mathematics & Statistics >>> (310) 270 6220 >>> jamesspottiswoode Skype >>> james at jsasoc.com >>>-- Dr. Thomas Petzoldt senior scientist Technische Universitaet Dresden Faculty of Environmental Sciences Institute of Hydrobiology 01062 Dresden, Germany https://tu-dresden.de/Members/thomas.petzoldt
On Thu, May 7, 2020 at 4:16 PM Thomas Petzoldt <thpe at simecol.de> wrote:> > On 07.05.2020 at 11:19 Deepayan Sarkar wrote: > > On Thu, May 7, 2020 at 12:58 AM Thomas Petzoldt <thpe at simecol.de> wrote: > >> > >> Sorry if I'm joining a little bit late. > >> > >> I've put some related links and scripts together a few weeks ago. Then I > >> stopped with this, because there is so much. > >> > >> The data format employed by John Hopkins CSSE was sort of a big surprise > >> to me. > > > > Why? I find it quite convenient to drop the first few columns and > > extract the data as a matrix (using data.matrix()). > > > > -Deepayan > > Many thanks for the hint to use data.matrix > > My aim was not to say that it is difficult, especially as R has all the > tools for data mangling. > > My surprise was that "wide tables" and non-ISO dates as column names are > not the "data base way" that we in general teach to our studentsWell, I am all for long format data when it makes sense, but I would disagree that that is always the "right approach". In the case of regular multiple time series, as in this context, a matrix-like structure seems much more natural (and nicely handled by ts() in R), and I wouldn't even bother reshaping the data in the first place. See, for example, https://github.com/deepayan/deepayan.github.io/blob/master/covid-19/deaths.rmd and https://deepayan.github.io/covid-19/deaths.html -Deepayan> With reshape2::melt or tidyr::gather resp. pivot_longer, conversion is > quite easy, regardless if one wants to use tidyverse or not, see example > below. > > Again, thanks, Thomas > > > library("dplyr") > library("readr") > library("tidyr") > > file <- > "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv" > > dat <- read_delim(file, delim=",") > names(dat)[1:2] <- c("Province_State", "Country_Region") > dat2 <- > dat %>% > ## summarize Country/Region duplicates > group_by(Country_Region) %>% summarise_at(vars(-(1:4)), sum) %>% > ## make it a long table > pivot_longer(cols = -Country_Region, names_to = "time") %>% > ## convert to ISO 8601 date > mutate(time = as.POSIXct(time, format="%m/%e/%y")) > > > > > > >> An opposite approach was taken in Germany, that organized it as a > >> big JSON trees. > >> > >> Fortunately, both can be "tidied" with R, and represent good didactic > >> examples for our students. > >> > >> Here yet another repo linking to the data: > >> > >> https://github.com/tpetzoldt/covid > >> > >> > >> Thomas > >> > >> > >> On 04.05.2020 at 20:48 James Spottiswoode wrote: > >>> Sure. COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University is available here: > >>> > >>> https://github.com/CSSEGISandData/COVID-19 > >>> > >>> All in csv fiormat. > >>> > >>> > >>>> On May 4, 2020, at 11:31 AM, Bernard McGarvey <mcgarvey.bernard at comcast.net> wrote: > >>>> > >>>> Just curious does anyone know of a website that has data available in a format that R can download and analyze? > >>>> > >>>> Thanks > >>>> > >>>> > >>>> Bernard McGarvey > >>>> > >>>> > >>>> Director, Fort Myers Beach Lions Foundation, Inc. > >>>> > >>>> > >>>> Retired (Lilly Engineering Fellow). > >>>> > >>>> ______________________________________________ > >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > >>>> and provide commented, minimal, self-contained, reproducible code. > >>>> > >>> > >>> James Spottiswoode > >>> Applied Mathematics & Statistics > >>> (310) 270 6220 > >>> jamesspottiswoode Skype > >>> james at jsasoc.com > >>> > > -- > Dr. Thomas Petzoldt > senior scientist > > Technische Universitaet Dresden > Faculty of Environmental Sciences > Institute of Hydrobiology > 01062 Dresden, Germany > > https://tu-dresden.de/Members/thomas.petzoldt
On 07.05.2020 at 13:12 Deepayan Sarkar wrote:> On Thu, May 7, 2020 at 4:16 PM Thomas Petzoldt <thpe at simecol.de> wrote: >> On 07.05.2020 at 11:19 Deepayan Sarkar wrote: >>> On Thu, May 7, 2020 at 12:58 AM Thomas Petzoldt <thpe at simecol.de> wrote: >>>> Sorry if I'm joining a little bit late. >>>> >>>> I've put some related links and scripts together a few weeks ago. Then I >>>> stopped with this, because there is so much. >>>> >>>> The data format employed by John Hopkins CSSE was sort of a big surprise >>>> to me. >>> Why? I find it quite convenient to drop the first few columns and >>> extract the data as a matrix (using data.matrix()). >>> >>> -Deepayan >> Many thanks for the hint to use data.matrix >> >> My aim was not to say that it is difficult, especially as R has all the >> tools for data mangling. >> >> My surprise was that "wide tables" and non-ISO dates as column names are >> not the "data base way" that we in general teach to our students > Well, I am all for long format data when it makes sense, but I would > disagree that that is always the "right approach". In the case of > regular multiple time series, as in this context, a matrix-like > structure seems much more natural (and nicely handled by ts() in R), > and I wouldn't even bother reshaping the data in the first place. > > See, for example, > > https://github.com/deepayan/deepayan.github.io/blob/master/covid-19/deaths.rmd > > and > > https://deepayan.github.io/covid-19/deaths.html > > -DeepayanGreat, thank you for the link with the comprehensive lattice graphs and the explanations. I like your package very much and use it often, since it appeared on CRAN (3 of my CRAN packages depend on it). As "dynamic modeller", I consider time always as the first column, but I agree on the other hand, that long tables are often, but not always the right approach, let's think about gridded multi-dimensional netcdf data. Many thanks for sharing your analysis publicly, I'll add your repo to my link list. Thomas>> With reshape2::melt or tidyr::gather resp. pivot_longer, conversion is >> quite easy, regardless if one wants to use tidyverse or not, see example >> below. >> >> Again, thanks, Thomas >> >> >> library("dplyr") >> library("readr") >> library("tidyr") >> >> file <- >> "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv" >> >> dat <- read_delim(file, delim=",") >> names(dat)[1:2] <- c("Province_State", "Country_Region") >> dat2 <- >> dat %>% >> ## summarize Country/Region duplicates >> group_by(Country_Region) %>% summarise_at(vars(-(1:4)), sum) %>% >> ## make it a long table >> pivot_longer(cols = -Country_Region, names_to = "time") %>% >> ## convert to ISO 8601 date >> mutate(time = as.POSIXct(time, format="%m/%e/%y")) >> >> >> >>>> An opposite approach was taken in Germany, that organized it as a >>>> big JSON trees. >>>> >>>> Fortunately, both can be "tidied" with R, and represent good didactic >>>> examples for our students. >>>> >>>> Here yet another repo linking to the data: >>>> >>>> https://github.com/tpetzoldt/covid >>>> >>>> >>>> Thomas >>>> >>>> >>>> On 04.05.2020 at 20:48 James Spottiswoode wrote: >>>>> Sure. COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University is available here: >>>>> >>>>> https://github.com/CSSEGISandData/COVID-19 >>>>> >>>>> All in csv fiormat. >>>>> >>>>> >>>>>> On May 4, 2020, at 11:31 AM, Bernard McGarvey <mcgarvey.bernard at comcast.net> wrote: >>>>>> >>>>>> Just curious does anyone know of a website that has data available in a format that R can download and analyze? >>>>>> >>>>>> Thanks >>>>>> >>>>>> >>>>>> Bernard McGarvey >>>>>> >>>>>> >>>>>> Director, Fort Myers Beach Lions Foundation, Inc. >>>>>> >>>>>> >>>>>> Retired (Lilly Engineering Fellow). >>>>>> >>>>>> ______________________________________________ >>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>> >>>>> James Spottiswoode >>>>> Applied Mathematics & Statistics >>>>> (310) 270 6220 >>>>> jamesspottiswoode Skype >>>>> james at jsasoc.com >>>>> >> -- >> Dr. Thomas Petzoldt >> senior scientist >> >> Technische Universitaet Dresden >> Faculty of Environmental Sciences >> Institute of Hydrobiology >> 01062 Dresden, Germany >> >> https://tu-dresden.de/Members/thomas.petzoldt