thr3ads.net - R devel - [Rd] Issues with libcurl + HTTP status codes (eg. 403, 404) [Aug 2015]

If this information is useful, please help other people find it:
Share via:

Jeroen Ooms

2015-Aug-26 22:04 UTC

[Rd] Issues with libcurl + HTTP status codes (eg. 403, 404)

On Tue, Aug 25, 2015 at 10:33 PM, Martin Morgan <mtmorgan at
fredhutch.org> wrote:>
> actually I don't know that it does -- it addresses the symptom but I
think there should be an error from libcurl on the 403 / 404 rather than from
read.dcf on error page...
Indeed, the only correct behavior is to turn the protocol error code
into an R exception. When the server returns a status code >= 400, it
indicates that the request was unsuccessful and the response body does
not contain the content the client had requested, but should instead
be interpreted as an error message/page. Ignoring this fact and
proceeding with parsing the body as usual is incorrect and leads to
all kind of strange errors downstream.

The other download methods did this correctly, it is unclear why the
current implementation of the "libcurl" method does not. Not only does
it lead to hard to interpret downstream parsing errors, it also makes
the behavior of R ambiguous as it is dependent on which download
method is in use. It is certainly not a limitation of the libcurl
library: the 'curl' package has alternative implementations of url()
and download.file() which exercise the correct behavior.

I can only speculate, but if the motivation is to explicitly support
retrieval of error pages, perhaps the download.file() and url()
functions can gain an argument 'stop_on_error' or something similar
which give the user an option to ignore server errors. However this
behavior should certainly not be the default. When a function or
script contains a line like this:

  download.file("https://someserver.com/mydata.csv",
"mydata.csv")

Then in the next line of code we must be able to expect that the file
"mydata.csv" we have downloaded to our disk is in fact the file
"mydata.csv" that was requested from the server. An implementation
that instead saves an error page (likely html content) to the
"mydata.csv" file is simply incorrect and will lead to obvious
problems, even with a warning.


[1] https://www.opencpu.org/posts/cran-https/

Duncan Murdoch

2015-Aug-26 23:07 UTC

head link

[Rd] Issues with libcurl + HTTP status codes (eg. 403, 404)

On 26/08/2015 6:04 PM, Jeroen Ooms wrote:> On Tue, Aug 25, 2015 at 10:33 PM, Martin Morgan <mtmorgan at
fredhutch.org> wrote:
>>
>> actually I don't know that it does -- it addresses the symptom but
I think there should be an error from libcurl on the 403 / 404 rather than from
read.dcf on error page...
> 
> Indeed, the only correct behavior is to turn the protocol error code
> into an R exception. When the server returns a status code >= 400, it
> indicates that the request was unsuccessful and the response body does
> not contain the content the client had requested, but should instead
> be interpreted as an error message/page. Ignoring this fact and
> proceeding with parsing the body as usual is incorrect and leads to
> all kind of strange errors downstream.
Yes.  I haven't been following this long thread.  Is it only in R-devel,
or is this happening in 3.2.2 or R-patched?

If the latter, please submit a bug report.  If it is only R-devel,
please just be patient.  When R-devel becomes R-alpha next year, if the
bug still exists, please report it.

Duncan Murdoch
> 
> The other download methods did this correctly, it is unclear why the
> current implementation of the "libcurl" method does not. Not only
does
> it lead to hard to interpret downstream parsing errors, it also makes
> the behavior of R ambiguous as it is dependent on which download
> method is in use. It is certainly not a limitation of the libcurl
> library: the 'curl' package has alternative implementations of
url()
> and download.file() which exercise the correct behavior.
> 
> I can only speculate, but if the motivation is to explicitly support
> retrieval of error pages, perhaps the download.file() and url()
> functions can gain an argument 'stop_on_error' or something similar
> which give the user an option to ignore server errors. However this
> behavior should certainly not be the default. When a function or
> script contains a line like this:
> 
>   download.file("https://someserver.com/mydata.csv",
"mydata.csv")
> 
> Then in the next line of code we must be able to expect that the file
> "mydata.csv" we have downloaded to our disk is in fact the file
> "mydata.csv" that was requested from the server. An
implementation
> that instead saves an error page (likely html content) to the
> "mydata.csv" file is simply incorrect and will lead to obvious
> problems, even with a warning.
> 
> 
> [1] https://www.opencpu.org/posts/cran-https/
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Martin Maechler

2015-Aug-27 15:16 UTC

head link

[Rd] Issues with libcurl + HTTP status codes (eg. 403, 404)

>>>>> "DM" == Duncan Murdoch <murdoch.duncan at
gmail.com>
>>>>>     on Wed, 26 Aug 2015 19:07:23 -0400 writes:
    DM> On 26/08/2015 6:04 PM, Jeroen Ooms wrote:
    >> On Tue, Aug 25, 2015 at 10:33 PM, Martin Morgan <mtmorgan at
fredhutch.org> wrote:
    >>> 
    >>> actually I don't know that it does -- it addresses the
symptom but I think there should be an error from libcurl on the 403 / 404
rather than from read.dcf on error page...
    >> 
    >> Indeed, the only correct behavior is to turn the protocol error
code
    >> into an R exception. When the server returns a status code >=
400, it
    >> indicates that the request was unsuccessful and the response body
does
    >> not contain the content the client had requested, but should
instead
    >> be interpreted as an error message/page. Ignoring this fact and
    >> proceeding with parsing the body as usual is incorrect and leads to
    >> all kind of strange errors downstream.

    DM> Yes.  I haven't been following this long thread.  Is it only in
R-devel,
    DM> or is this happening in 3.2.2 or R-patched?

    DM> If the latter, please submit a bug report.  If it is only R-devel,
    DM> please just be patient.  When R-devel becomes R-alpha next year, if
the
    DM> bug still exists, please report it.

    DM> Duncan Murdoch

Probably I'm confused now...
Both R-patched and R-devel give an error (after a *long* wait!) 
for
       download.file("https://someserver.com/mydata.csv",
"mydata.csv")

So that problem is I think  solved now.
Ideally, it would nice to set the *timeout* as an R function
argument ourselves.. though.

Kevin Ushey's original problem however is still in R-patched and
R-devel:

ap <- available.packages("http://www.stats.ox.ac.uk/pub/RWin",
method="libcurl")
ap

giving
> ap <- available.packages("http://www.stats.ox.ac.uk/pub/RWin",
method="libcurl")Warning: unable to access index for repository
http://www.stats.ox.ac.uk/pub/RWin:  Line starting '<!DOCTYPE HTML PUBLI ...' is
malformed!> ap     Package Version Priority Depends Imports LinkingTo Suggests Enhances
License License_is_FOSS License_restricts_use OS_type Archs
     MD5sum NeedsCompilation File Repository> 
and the resulting 'ap' is the same as e.g., with the the default
method which also gives a warning and then an empty list (well
"data.frame") of packages.


I don't see a big problem with the above.
It would be better if the warning did not contain the extra
   "Line starting '<!DOCTYPE HTML PUBLI ...' is malformed!"
part, but apart from that I'd say the behavior is not bogous:

We ask for the available package get as answer 'zero packages'
which is correct.

Seemingly Similar Threads

Search for more possibly parallel threads

R devel - Aug 2015 - Issues with libcurl + HTTP status codes (eg. 403, 404)

[Rd] Issues with libcurl + HTTP status codes (eg. 403, 404)

[Rd] Issues with libcurl + HTTP status codes (eg. 403, 404)

[Rd] Issues with libcurl + HTTP status codes (eg. 403, 404)

Seemingly Similar Threads