Hello. I am trying to find a way to retrieve data from Harvard Dataverse website. I usually don't have problem in web-scraping data but the problem here is that there are a bunch of data formats such as .tab, .7z and so and I just can't find a way to retrieve the data I am interested in woth an unique solution. Any hint? [[alternative HTML version deleted]]
Ilio Fornasero writes:> Hello. > > I am trying to find a way to retrieve data from Harvard Dataverse website. > I usually don't have problem in web-scraping data but the problem here is that there are a bunch of data formats such as .tab, .7z and so and I just can't find a way to retrieve the data I am interested in woth an unique solution. > Any hint?.tab does not identify a file format. It might be in a read.csv format or a read.fwf format. No 7z decompressor seems to exist in CRAN, (I checked `findFn('7z')`.) so you could use system/system2: `system2('7z', c('e', ...)), or I think 7z.exe on Windows. You would need to install p7zip and read the manual (`man 7z` on a Unix-like system). Please send an example.
Thomas Levine
2018-May-13 11:13 UTC
[R] Dataverse (reading files with .tab and .7z suffixes)
Ilio Fornasero writes:> I am trying to find a way to retrieve data from Harvard Dataverse website. > I usually don't have problem in web-scraping data but the problem here is > that there are a bunch of data formats such as .tab, .7z and so and > I just can't find a way to retrieve the data I am interested in woth an > unique solution. > Any hint?.tab does not identify a file format. That file might be in a read.csv format or a read.fwf format. No 7z decompressor seems to exist in CRAN, (I checked `findFn('7z')`.) so you could use system/system2: `system2('7z', c('e', ...)), or I think 7z.exe on Windows. You would need to install p7zip and read the manual (`man 7z` on a Unix-like system). Please send an example.
Thomas Levine
2018-May-13 12:04 UTC
[R] Dataverse (reading files with .tab and .7z suffixes)
Ilio Fornasero writes:> Yet, I am at this point. > > > > > ## 01. Finding the dataverse server and making a search > Sys.setenv("DATAVERSE_SERVER" =3D "dataverse.harvard.edu") > dataverse_search(".Hunger") > > > ## 02. Loading the dataset (in this example, I have chosen the word ".Hunge> r" to get > # one list and then picked up one out of hundreds results. > # The get-dataset() function has to be picked on the dynamic web address> ) > (dataset_ifpri <- get_dataset("https://doi.org/10.7910/DVN/ZTCWYQ")) > > ## 03. Grabbing the (1st) file we are interested on > AppendixC <- get_file("001_AppendixC.tab", > "https://doi.org/10.7910/DVN/ZTCWYQ") > writeBin(AppendixC, "001_AppendixC.tab") > > read.table("001_AppendixC.tab")I imagine you are using the dataverse package. 7z is more straightforward because the file format is clear. You need to figure out the 001_AppendixC.tab file format. On first glance it looks to me like a spreadsheet. $ file /tmp/001_AppendixC.tab /tmp/001_AppendixC.tab: Zip archive data, at least v2.0 to extract $ cd /tmp && unzip 001_AppendixC.tab $ head -n2 /tmp/xl/workbook.xml | cut -c 1-75 <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" Once you figure out the format manually, write an R function that figures out the format, and ask again here to find an R function that reads the format.
Jorge Cimentada
2018-May-13 12:37 UTC
[R] Dataverse (reading files with .tab and .7z suffixes)
Our just use the dataverse package already in CRAN: https://cran.r-project.org/web/packages/dataverse/index.html ----------------------------------- Jorge Cimentada *https://cimentadaj.github.io/ <https://cimentadaj.github.io/>* On Sun, May 13, 2018 at 2:04 PM, Thomas Levine <_ at thomaslevine.com> wrote:> Ilio Fornasero writes: > > Yet, I am at this point. > > > > > > > > > > ## 01. Finding the dataverse server and making a search > > Sys.setenv("DATAVERSE_SERVER" =3D "dataverse.harvard.edu") > > dataverse_search(".Hunger") > > > > > > ## 02. Loading the dataset (in this example, I have chosen the word > ".Hunge> > r" to get > > # one list and then picked up one out of hundreds results. > > # The get-dataset() function has to be picked on the dynamic web > address> > ) > > (dataset_ifpri <- get_dataset("https://doi.org/10.7910/DVN/ZTCWYQ")) > > > > ## 03. Grabbing the (1st) file we are interested on > > AppendixC <- get_file("001_AppendixC.tab", > > "https://doi.org/10.7910/DVN/ZTCWYQ") > > writeBin(AppendixC, "001_AppendixC.tab") > > > > read.table("001_AppendixC.tab") > > I imagine you are using the dataverse package. > > 7z is more straightforward because the file format is clear. > > You need to figure out the 001_AppendixC.tab file format. > On first glance it looks to me like a spreadsheet. > > $ file /tmp/001_AppendixC.tab > /tmp/001_AppendixC.tab: Zip archive data, at least v2.0 to extract > $ cd /tmp && unzip 001_AppendixC.tab > $ head -n2 /tmp/xl/workbook.xml | cut -c 1-75 > <?xml version="1.0" encoding="UTF-8" standalone="yes"?> > <workbook xmlns="http://schemas.openxmlformats.org/ > spreadsheetml/2006/main" > > Once you figure out the format manually, write an R function that > figures out the format, and ask again here to find an R function that > reads the format. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
David Winsemius
2018-May-13 17:05 UTC
[R] Dataverse (reading files with .tab and .7z suffixes)
> On May 13, 2018, at 5:04 AM, Thomas Levine <_ at thomaslevine.com> wrote: > > Ilio Fornasero writes: >> Yet, I am at this point. >> >> >> >> >> ## 01. Finding the dataverse server and making a search >> Sys.setenv("DATAVERSE_SERVER" =3D "dataverse.harvard.edu") >> dataverse_search(".Hunger") >> >> >> ## 02. Loading the dataset (in this example, I have chosen the word ".Hunge>> r" to get >> # one list and then picked up one out of hundreds results. >> # The get-dataset() function has to be picked on the dynamic web address>> ) >> (dataset_ifpri <- get_dataset("https://doi.org/10.7910/DVN/ZTCWYQ")) >> >> ## 03. Grabbing the (1st) file we are interested on >> AppendixC <- get_file("001_AppendixC.tab", >> "https://doi.org/10.7910/DVN/ZTCWYQ") >> writeBin(AppendixC, "001_AppendixC.tab") >> >> read.table("001_AppendixC.tab") > > I imagine you are using the dataverse package. > > 7z is more straightforward because the file format is clear. > > You need to figure out the 001_AppendixC.tab file format. > On first glance it looks to me like a spreadsheet.That website says it's tab-delimited. The read.delim (in base R) function is designed for that possibility. However the download pull-down menu that appears, seems to offer the option of deliver in a variety of formats: -------------- next part -------------- A non-text attachment was scrubbed... Name: Untitled.pdf Type: application/pdf Size: 21204 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20180513/6ed61785/attachment.pdf> -------------- next part -------------- When I choose the Rdata option I get: fil <- load("/Users/davidwinsemius/001_AppendixC.RData") fil #[1] "x" str(x) #------------------- 'data.frame': 132 obs. of 17 variables: $ Country :Class 'AsIs' atomic [1:132] Afghanistan Albania Algeria Angola ... .. ..- attr(*, "comment")= chr "Country" $ UN9193 :Class 'AsIs' atomic [1:132] 37.4 7.7 9.1 65.400000000000006 ... .. ..- attr(*, "comment")= chr "UN9193" $ UN9901 :Class 'AsIs' atomic [1:132] 46.1 7.2 10.7 50 ... ------ snipped -------- -- David.> > $ file /tmp/001_AppendixC.tab > /tmp/001_AppendixC.tab: Zip archive data, at least v2.0 to extract > $ cd /tmp && unzip 001_AppendixC.tab > $ head -n2 /tmp/xl/workbook.xml | cut -c 1-75 > <?xml version="1.0" encoding="UTF-8" standalone="yes"?> > <workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" > > Once you figure out the format manually, write an R function that > figures out the format, and ask again here to find an R function that > reads the format. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius Alameda, CA, USA 'Any technology distinguishable from magic is insufficiently advanced.' -Gehm's Corollary to Clarke's Third Law
Use https://cran.rstudio.com/web/packages/dataverse/ --Ista On Sun, May 13, 2018 at 5:21 AM, Ilio Fornasero <iliofornasero at hotmail.com> wrote:> Hello. > > I am trying to find a way to retrieve data from Harvard Dataverse website. > I usually don't have problem in web-scraping data but the problem here is that there are a bunch of data formats such as .tab, .7z and so and I just can't find a way to retrieve the data I am interested in woth an unique solution. > Any hint? > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.