thr3ads.net - R devel - [Rd] Issues with libcurl + HTTP status codes (eg. 403, 404) [Aug 2015]

If this information is useful, please help other people find it:
Share via:

Martin Maechler

2015-Aug-27 15:16 UTC

[Rd] Issues with libcurl + HTTP status codes (eg. 403, 404)

>>>>> "DM" == Duncan Murdoch <murdoch.duncan at
gmail.com>
>>>>>     on Wed, 26 Aug 2015 19:07:23 -0400 writes:
    DM> On 26/08/2015 6:04 PM, Jeroen Ooms wrote:
    >> On Tue, Aug 25, 2015 at 10:33 PM, Martin Morgan <mtmorgan at
fredhutch.org> wrote:
    >>> 
    >>> actually I don't know that it does -- it addresses the
symptom but I think there should be an error from libcurl on the 403 / 404
rather than from read.dcf on error page...
    >> 
    >> Indeed, the only correct behavior is to turn the protocol error
code
    >> into an R exception. When the server returns a status code >=
400, it
    >> indicates that the request was unsuccessful and the response body
does
    >> not contain the content the client had requested, but should
instead
    >> be interpreted as an error message/page. Ignoring this fact and
    >> proceeding with parsing the body as usual is incorrect and leads to
    >> all kind of strange errors downstream.

    DM> Yes.  I haven't been following this long thread.  Is it only in
R-devel,
    DM> or is this happening in 3.2.2 or R-patched?

    DM> If the latter, please submit a bug report.  If it is only R-devel,
    DM> please just be patient.  When R-devel becomes R-alpha next year, if
the
    DM> bug still exists, please report it.

    DM> Duncan Murdoch

Probably I'm confused now...
Both R-patched and R-devel give an error (after a *long* wait!) 
for
       download.file("https://someserver.com/mydata.csv",
"mydata.csv")

So that problem is I think  solved now.
Ideally, it would nice to set the *timeout* as an R function
argument ourselves.. though.

Kevin Ushey's original problem however is still in R-patched and
R-devel:

ap <- available.packages("http://www.stats.ox.ac.uk/pub/RWin",
method="libcurl")
ap

giving
> ap <- available.packages("http://www.stats.ox.ac.uk/pub/RWin",
method="libcurl")Warning: unable to access index for repository
http://www.stats.ox.ac.uk/pub/RWin:  Line starting '<!DOCTYPE HTML PUBLI ...' is
malformed!> ap     Package Version Priority Depends Imports LinkingTo Suggests Enhances
License License_is_FOSS License_restricts_use OS_type Archs
     MD5sum NeedsCompilation File Repository> 
and the resulting 'ap' is the same as e.g., with the the default
method which also gives a warning and then an empty list (well
"data.frame") of packages.


I don't see a big problem with the above.
It would be better if the warning did not contain the extra
   "Line starting '<!DOCTYPE HTML PUBLI ...' is malformed!"
part, but apart from that I'd say the behavior is not bogous:

We ask for the available package get as answer 'zero packages'
which is correct.

Martin Morgan

2015-Aug-27 15:27 UTC

head link

[Rd] Issues with libcurl + HTTP status codes (eg. 403, 404)

On 08/27/2015 08:16 AM, Martin Maechler wrote:>>>>>> "DM" == Duncan Murdoch <murdoch.duncan at
gmail.com>
>>>>>>      on Wed, 26 Aug 2015 19:07:23 -0400 writes:
>
>      DM> On 26/08/2015 6:04 PM, Jeroen Ooms wrote:
>      >> On Tue, Aug 25, 2015 at 10:33 PM, Martin Morgan <mtmorgan
at fredhutch.org> wrote:
>      >>>
>      >>> actually I don't know that it does -- it addresses
the symptom but I think there should be an error from libcurl on the 403 / 404
rather than from read.dcf on error page...
>      >>
>      >> Indeed, the only correct behavior is to turn the protocol
error code
>      >> into an R exception. When the server returns a status code
>= 400, it
>      >> indicates that the request was unsuccessful and the response
body does
>      >> not contain the content the client had requested, but should
instead
>      >> be interpreted as an error message/page. Ignoring this fact
and
>      >> proceeding with parsing the body as usual is incorrect and
leads to
>      >> all kind of strange errors downstream.
>
>      DM> Yes.  I haven't been following this long thread.  Is it
only in R-devel,
>      DM> or is this happening in 3.2.2 or R-patched?
>
>      DM> If the latter, please submit a bug report.  If it is only
R-devel,
>      DM> please just be patient.  When R-devel becomes R-alpha next
year, if the
>      DM> bug still exists, please report it.
>
>      DM> Duncan Murdoch
>
> Probably I'm confused now...
> Both R-patched and R-devel give an error (after a *long* wait!)
> for
>         download.file("https://someserver.com/mydata.csv",
"mydata.csv")
>
> So that problem is I think  solved now.
> Ideally, it would nice to set the *timeout* as an R function
> argument ourselves.. though.
>
> Kevin Ushey's original problem however is still in R-patched and
> R-devel:
>
> ap <- available.packages("http://www.stats.ox.ac.uk/pub/RWin",
method="libcurl")
> ap
>
> giving
>
>> ap <-
available.packages("http://www.stats.ox.ac.uk/pub/RWin",
method="libcurl")Warning: unable to access index for repository
http://www.stats.ox.ac.uk/pub/RWin:
>    Line starting '<!DOCTYPE HTML PUBLI ...' is malformed!
>> ap
>       Package Version Priority Depends Imports LinkingTo Suggests Enhances
License License_is_FOSS License_restricts_use OS_type Archs
>       MD5sum NeedsCompilation File Repository
>>
>
> and the resulting 'ap' is the same as e.g., with the the default
> method which also gives a warning and then an empty list (well
> "data.frame") of packages.
>
>
> I don't see a big problem with the above.
> It would be better if the warning did not contain the extra
>     "Line starting '<!DOCTYPE HTML PUBLI ...' is
malformed!"
> part, but apart from that I'd say the behavior is not bogous:
>
> We ask for the available package get as answer 'zero packages'
> which is correct.
>
In Kevin's original post, he was using an earlier version of R, and the code
in
available.packages was returning an error.

The code had been updated (by me) in the version that you are using to return a 
warning, which was the original design and intention (to convert errors during 
repository queries into warnings, so other repositories could be queried; this 
was Kevin's original point).

The fix I provided does not address the underlying problem, which is that

   download.file("http://www.stats.ox.ac.uk/pub/RWin/PACKAGES.gz",
                 fl <- tempfile(), method="libcurl")

actually downloads the error file, without throwing an error

 >  
download.file("http://www.stats.ox.ac.uk/pub/RWin/PACKAGES.gz",   fl
<-
tempfile(), method="libcurl")
trying URL 'http://www.stats.ox.ac.uk/pub/RWin/PACKAGES.gz'
Content type 'text/html; charset=iso-8859-1' length 302 bytes
=================================================downloaded 302 bytes

 > cat(paste(readLines(fl), collapse="\n"))
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL /pub/RWin/PACKAGES.gz was not found on this
server.</p>
<hr>
<address>Apache/2.2.22 (Debian) Server at www.stats.ox.ac.uk Port
80</address>
</body></html>>


I do have a patch for this, which I will share off-list before committing.

Martin
-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

Jeroen Ooms

2015-Aug-27 15:46 UTC

head link

[Rd] Issues with libcurl + HTTP status codes (eg. 403, 404)

On Thu, Aug 27, 2015 at 5:16 PM, Martin Maechler
<maechler at stat.math.ethz.ch> wrote:> Probably I'm confused now...
> Both R-patched and R-devel give an error (after a *long* wait!)
> for
>        download.file("https://someserver.com/mydata.csv",
"mydata.csv")
>
> So that problem is I think  solved now.
I'm sorry for the confusion, this was a hypothetical example.
Connection failures are different from http status errors. Below some
real examples of servers returning http errors. For each example the
"internal" method correctly raises an R error, whereas the
"libcurl"
method does not.

# File not found (404)
download.file("http://httpbin.org/data.csv", "data.csv",
method = "internal")
download.file("http://httpbin.org/data.csv", "data.csv",
method = "libcurl")
readLines(url("http://httpbin.org/data.csv", method =
"internal"))
readLines(url("http://httpbin.org/data.csv", method =
"libcurl"))

# Unauthorized (401)
download.file("https://httpbin.org/basic-auth/user/passwd",
"data.csv", method = "internal")
download.file("https://httpbin.org/basic-auth/user/passwd",
"data.csv", method = "libcurl")
readLines(url("https://httpbin.org/basic-auth/user/passwd", method
"internal"))
readLines(url("https://httpbin.org/basic-auth/user/passwd", method =
"libcurl"))

Martin Morgan

2015-Aug-27 17:27 UTC

head link

[Rd] Issues with libcurl + HTTP status codes (eg. 403, 404)

R-devel r69197 returns appropriate errors for the cases below; I know of a few 
rough edges

- ftp error codes are not reported correctly
- download.file creates destfile before discovering that http fails, leaving an 
empty file on disk

and am happy to hear of more.

Martin

On 08/27/2015 08:46 AM, Jeroen Ooms wrote:> On Thu, Aug 27, 2015 at 5:16 PM, Martin Maechler
> <maechler at stat.math.ethz.ch> wrote:
>> Probably I'm confused now...
>> Both R-patched and R-devel give an error (after a *long* wait!)
>> for
>>         download.file("https://someserver.com/mydata.csv",
"mydata.csv")
>>
>> So that problem is I think  solved now.
>
> I'm sorry for the confusion, this was a hypothetical example.
> Connection failures are different from http status errors. Below some
> real examples of servers returning http errors. For each example the
> "internal" method correctly raises an R error, whereas the
"libcurl"
> method does not.
>
> # File not found (404)
> download.file("http://httpbin.org/data.csv",
"data.csv", method = "internal")
> download.file("http://httpbin.org/data.csv",
"data.csv", method = "libcurl")
> readLines(url("http://httpbin.org/data.csv", method =
"internal"))
> readLines(url("http://httpbin.org/data.csv", method =
"libcurl"))
>
> # Unauthorized (401)
> download.file("https://httpbin.org/basic-auth/user/passwd",
> "data.csv", method = "internal")
> download.file("https://httpbin.org/basic-auth/user/passwd",
> "data.csv", method = "libcurl")
> readLines(url("https://httpbin.org/basic-auth/user/passwd",
method > "internal"))
> readLines(url("https://httpbin.org/basic-auth/user/passwd",
method = "libcurl"))
>

-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

R devel - Aug 2015 - Issues with libcurl + HTTP status codes (eg. 403, 404)

[Rd] Issues with libcurl + HTTP status codes (eg. 403, 404)

[Rd] Issues with libcurl + HTTP status codes (eg. 403, 404)

[Rd] Issues with libcurl + HTTP status codes (eg. 403, 404)

[Rd] Issues with libcurl + HTTP status codes (eg. 403, 404)