thr3ads.net - R help - [R] Dataverse [May 2018]

If this information is useful, please help other people find it:
Share via:

Ilio Fornasero

2018-May-13 09:21 UTC

[R] Dataverse

Hello.

I am trying to find a way to retrieve data from Harvard Dataverse website.
I usually don't have problem in web-scraping data but the problem here is
that there are a bunch of data formats such as .tab,  .7z and so and I just
can't find a way to retrieve the data I am interested in woth an unique
solution.
Any hint?



	[[alternative HTML version deleted]]

Thomas Levine

2018-May-13 10:34 UTC

head link

[R] R-help Digest, Vol 183, Issue 13

Ilio Fornasero writes:> Hello.
> 
> I am trying to find a way to retrieve data from Harvard Dataverse website.
> I usually don't have problem in web-scraping data but the problem here
is that there are a bunch of data formats such as .tab,  .7z and so and I just
can't find a way to retrieve the data I am interested in woth an unique
solution.
> Any hint?
.tab does not identify a file format. It might be in a read.csv format
or a read.fwf format.

No 7z decompressor seems to exist in CRAN, (I checked `findFn('7z')`.)
so you could use system/system2: `system2('7z', c('e', ...)), or
I think
7z.exe on Windows. You would need to install p7zip and read the manual
(`man 7z` on a Unix-like system).

Please send an example.

Thomas Levine

2018-May-13 11:13 UTC

head link

[R] Dataverse (reading files with .tab and .7z suffixes)

Ilio Fornasero writes:> I am trying to find a way to retrieve data from Harvard Dataverse website.
> I usually don't have problem in web-scraping data but the problem here
is
> that there are a bunch of data formats such as .tab,  .7z and so and
> I just can't find a way to retrieve the data I am interested in woth an
> unique solution.
> Any hint?
.tab does not identify a file format. That file might be in a read.csv
format or a read.fwf format.

No 7z decompressor seems to exist in CRAN, (I checked `findFn('7z')`.)
so you could use system/system2: `system2('7z', c('e', ...)), or
I think
7z.exe on Windows. You would need to install p7zip and read the manual
(`man 7z` on a Unix-like system).

Please send an example.

Thomas Levine

2018-May-13 12:04 UTC

head link

[R] Dataverse (reading files with .tab and .7z suffixes)

Ilio Fornasero writes:> Yet, I am at this point.
>
>
>
>
> ## 01. Finding the dataverse server and making a search
> Sys.setenv("DATAVERSE_SERVER" =3D
"dataverse.harvard.edu")
> dataverse_search(".Hunger")
>
>
> ## 02. Loading the dataset (in this example, I have chosen the word
".Hunge> r" to get
>    # one list and then picked up one out of hundreds results.
>    # The get-dataset() function has to be picked on the dynamic web
address> )
> (dataset_ifpri <-
get_dataset("https://doi.org/10.7910/DVN/ZTCWYQ"))
>
> ## 03. Grabbing the (1st) file we are interested on
> AppendixC <- get_file("001_AppendixC.tab",
>                       "https://doi.org/10.7910/DVN/ZTCWYQ")
> writeBin(AppendixC, "001_AppendixC.tab")
>
> read.table("001_AppendixC.tab")
I imagine you are using the dataverse package.

7z is more straightforward because the file format is clear.

You need to figure out the 001_AppendixC.tab file format.
On first glance it looks to me like a spreadsheet.

  $ file /tmp/001_AppendixC.tab
  /tmp/001_AppendixC.tab: Zip archive data, at least v2.0 to extract
  $ cd /tmp && unzip 001_AppendixC.tab
  $ head -n2 /tmp/xl/workbook.xml | cut -c 1-75
  <?xml version="1.0" encoding="UTF-8"
standalone="yes"?>
  <workbook
xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"

Once you figure out the format manually, write an R function that
figures out the format, and ask again here to find an R function that
reads the format.

Jorge Cimentada

2018-May-13 12:37 UTC

head link

[R] Dataverse (reading files with .tab and .7z suffixes)

Our just use the dataverse package already in CRAN:
https://cran.r-project.org/web/packages/dataverse/index.html

-----------------------------------


Jorge Cimentada
*https://cimentadaj.github.io/ <https://cimentadaj.github.io/>*


On Sun, May 13, 2018 at 2:04 PM, Thomas Levine <_ at thomaslevine.com>
wrote:
> Ilio Fornasero writes:
> > Yet, I am at this point.
> >
> >
> >
> >
> > ## 01. Finding the dataverse server and making a search
> > Sys.setenv("DATAVERSE_SERVER" =3D
"dataverse.harvard.edu")
> > dataverse_search(".Hunger")
> >
> >
> > ## 02. Loading the dataset (in this example, I have chosen the word
> ".Hunge> > r" to get
> >    # one list and then picked up one out of hundreds results.
> >    # The get-dataset() function has to be picked on the dynamic web
> address> > )
> > (dataset_ifpri <-
get_dataset("https://doi.org/10.7910/DVN/ZTCWYQ"))
> >
> > ## 03. Grabbing the (1st) file we are interested on
> > AppendixC <- get_file("001_AppendixC.tab",
> >                       "https://doi.org/10.7910/DVN/ZTCWYQ")
> > writeBin(AppendixC, "001_AppendixC.tab")
> >
> > read.table("001_AppendixC.tab")
>
> I imagine you are using the dataverse package.
>
> 7z is more straightforward because the file format is clear.
>
> You need to figure out the 001_AppendixC.tab file format.
> On first glance it looks to me like a spreadsheet.
>
>   $ file /tmp/001_AppendixC.tab
>   /tmp/001_AppendixC.tab: Zip archive data, at least v2.0 to extract
>   $ cd /tmp && unzip 001_AppendixC.tab
>   $ head -n2 /tmp/xl/workbook.xml | cut -c 1-75
>   <?xml version="1.0" encoding="UTF-8"
standalone="yes"?>
>   <workbook xmlns="http://schemas.openxmlformats.org/
> spreadsheetml/2006/main"
>
> Once you figure out the format manually, write an R function that
> figures out the format, and ask again here to find an R function that
> reads the format.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

David Winsemius

2018-May-13 17:05 UTC

head link

[R] Dataverse (reading files with .tab and .7z suffixes)

> On May 13, 2018, at 5:04 AM, Thomas Levine <_ at thomaslevine.com>
wrote:
> 
> Ilio Fornasero writes:
>> Yet, I am at this point.
>> 
>> 
>> 
>> 
>> ## 01. Finding the dataverse server and making a search
>> Sys.setenv("DATAVERSE_SERVER" =3D
"dataverse.harvard.edu")
>> dataverse_search(".Hunger")
>> 
>> 
>> ## 02. Loading the dataset (in this example, I have chosen the word
".Hunge>> r" to get
>>   # one list and then picked up one out of hundreds results.
>>   # The get-dataset() function has to be picked on the dynamic web
address>> )
>> (dataset_ifpri <-
get_dataset("https://doi.org/10.7910/DVN/ZTCWYQ"))
>> 
>> ## 03. Grabbing the (1st) file we are interested on
>> AppendixC <- get_file("001_AppendixC.tab",
>>                      "https://doi.org/10.7910/DVN/ZTCWYQ")
>> writeBin(AppendixC, "001_AppendixC.tab")
>> 
>> read.table("001_AppendixC.tab")
> 
> I imagine you are using the dataverse package.
> 
> 7z is more straightforward because the file format is clear.
> 
> You need to figure out the 001_AppendixC.tab file format.
> On first glance it looks to me like a spreadsheet.
That website says it's tab-delimited. The read.delim (in base R) function is
designed for that possibility. However the download pull-down menu that appears,
seems to offer the option of deliver in a variety of formats:


-------------- next part --------------
A non-text attachment was scrubbed...
Name: Untitled.pdf
Type: application/pdf
Size: 21204 bytes
Desc: not available
URL:
<https://stat.ethz.ch/pipermail/r-help/attachments/20180513/6ed61785/attachment.pdf>

-------------- next part --------------


When I choose the Rdata option I get:

 fil <- load("/Users/davidwinsemius/001_AppendixC.RData")
 fil
#[1] "x"

str(x)
#-------------------
'data.frame':	132 obs. of  17 variables:
 $ Country :Class 'AsIs'  atomic [1:132] Afghanistan Albania Algeria
Angola ...
  .. ..- attr(*, "comment")= chr "Country"
 $ UN9193  :Class 'AsIs'  atomic [1:132] 37.4 7.7 9.1 65.400000000000006
...
  .. ..- attr(*, "comment")= chr "UN9193"
 $ UN9901  :Class 'AsIs'  atomic [1:132] 46.1 7.2 10.7 50 ...
------ snipped --------


-- 
David.

> 
>  $ file /tmp/001_AppendixC.tab
>  /tmp/001_AppendixC.tab: Zip archive data, at least v2.0 to extract
>  $ cd /tmp && unzip 001_AppendixC.tab
>  $ head -n2 /tmp/xl/workbook.xml | cut -c 1-75
>  <?xml version="1.0" encoding="UTF-8"
standalone="yes"?>
>  <workbook
xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
> 
> Once you figure out the format manually, write an R function that
> figures out the format, and ask again here to find an R function that
> reads the format.
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius
Alameda, CA, USA

'Any technology distinguishable from magic is insufficiently advanced.' 
-Gehm's Corollary to Clarke's Third Law

Ista Zahn

2018-May-13 23:16 UTC

head link

[R] Dataverse

Use https://cran.rstudio.com/web/packages/dataverse/

--Ista

On Sun, May 13, 2018 at 5:21 AM, Ilio Fornasero
<iliofornasero at hotmail.com> wrote:> Hello.
>
> I am trying to find a way to retrieve data from Harvard Dataverse website.
> I usually don't have problem in web-scraping data but the problem here
is that there are a bunch of data formats such as .tab,  .7z and so and I just
can't find a way to retrieve the data I am interested in woth an unique
solution.
> Any hint?
>
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Maybe Matching Threads

Search for more seemingly similar threads

R help - May 2018 - Dataverse

[R] Dataverse

[R] R-help Digest, Vol 183, Issue 13

[R] Dataverse (reading files with .tab and .7z suffixes)

[R] Dataverse (reading files with .tab and .7z suffixes)

[R] Dataverse (reading files with .tab and .7z suffixes)

[R] Dataverse (reading files with .tab and .7z suffixes)

[R] Dataverse

Maybe Matching Threads