thr3ads.net - R devel - [Rd] URL checks [Jan 2021]

If this information is useful, please help other people find it:
Share via:

Hugo Gruson

2021-Jan-07 15:53 UTC

[Rd] URL checks

I encountered the same issue today with https://astrostatistics.psu.edu/.

This is a trust chain issue, as explained here: 
https://whatsmychaincert.com/?astrostatistics.psu.edu.

I've worked for a couple of years on a project to increase HTTPS 
adoption on the web and we noticed that this type of error is very 
common, and that website maintainers are often unresponsive to requests 
to fix this issue.

Therefore, I totally agree with Kirill that a list of known 
false-positive/exceptions would be a great addition to save time to both 
the CRAN team and package developers.

Hugo

On 07/01/2021 15:45, Kirill M?ller via R-devel wrote:> One other failure mode: SSL certificates trusted by browsers that are 
> not installed on the check machine, e.g. the "GEANT Vereniging" 
> certificate from https://relational.fit.cvut.cz/ .
> 
> 
> K
> 
> 
> On 07.01.21 12:14, Kirill M?ller via R-devel wrote:
>> Hi
>>
>>
>> The URL checks in R CMD check test all links in the README and 
>> vignettes for broken or redirected links. In many cases this improves 
>> documentation, I see problems with this approach which I have detailed 
>> below.
>>
>> I'm writing to this mailing list because I think the change needs
to
>> happen in R's check routines. I propose to introduce an
"allow-list"
>> for URLs, to reduce the burden on both CRAN and package maintainers.
>>
>> Comments are greatly appreciated.
>>
>>
>> Best regards
>>
>> Kirill
>>
>>
>> # Problems with the detection of broken/redirected URLs
>>
>> ## 301 should often be 307, how to change?
>>
>> Many web sites use a 301 redirection code that probably should be a 
>> 307. For example, https://www.oracle.com and https://www.oracle.com/ 
>> both redirect to https://www.oracle.com/index.html with a 301. I 
>> suspect the company still wants oracle.com to be recognized as the 
>> primary entry point of their web presence (to reserve the right to 
>> move the redirection to a different location later), I haven't
checked
>> with their PR department though. If that's true, the redirect
probably
>> should be a 307, which should be fixed by their IT department which I 
>> haven't contacted yet either.
>>
>> $ curl -i https://www.oracle.com
>> HTTP/2 301
>> server: AkamaiGHost
>> content-length: 0
>> location: https://www.oracle.com/index.html
>> ...
>>
>> ## User agent detection
>>
>> twitter.com responds with a 400 error for requests without a user 
>> agent string hinting at an accepted browser.
>>
>> $ curl -i https://twitter.com/
>> HTTP/2 400
>> ...
>> <body>...<p>Please switch to a supported
browser...</p>...</body>
>>
>> $ curl -s -i https://twitter.com/ -A "Mozilla/5.0 (X11; Ubuntu;
Linux
>> x86_64; rv:84.0) Gecko/20100101 Firefox/84.0" | head -n 1
>> HTTP/2 200
>>
>> # Impact
>>
>> While the latter problem *could* be fixed by supplying a browser-like 
>> user agent string, the former problem is virtually unfixable -- so 
>> many web sites should use 307 instead of 301 but don't. The above
list
>> is also incomplete -- think of unreliable links, HTTP links, other 
>> failure modes...
>>
>> This affects me as a package maintainer, I have the choice to either 
>> change the links to incorrect versions, or remove them altogether.
>>
>> I can also choose to explain each broken link to CRAN, this subjects 
>> the team to undue burden I think. Submitting a package with NOTEs 
>> delays the release for a package which I must release very soon to 
>> avoid having it pulled from CRAN, I'd rather not risk that -- hence
I
>> need to remove the link and put it back later.
>>
>> I'm aware of https://github.com/r-lib/urlchecker, this alleviates
the
>> problem but ultimately doesn't solve it.
>>
>> # Proposed solution
>>
>> ## Allow-list
>>
>> A file inst/URL that lists all URLs where failures are allowed -- 
>> possibly with a list of the HTTP codes accepted for that link.
>>
>> Example:
>>
>> https://oracle.com/ 301
>> https://twitter.com/drob/status/1224851726068527106 400
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Spencer Graves

2021-Jan-08 12:04 UTC

head link

[Rd] URL checks

I also would be pleased to be allowed to provide "a list of known 
false-positive/exceptions" to the URL tests.  I've been challenged 
multiple times regarding URLs that worked fine when I checked them.  We 
should not be required to do a partial lobotomy to pass R CMD check ;-)


	  Spencer Graves


On 2021-01-07 09:53, Hugo Gruson wrote:> 
> I encountered the same issue today with https://astrostatistics.psu.edu/.
> 
> This is a trust chain issue, as explained here: 
> https://whatsmychaincert.com/?astrostatistics.psu.edu.
> 
> I've worked for a couple of years on a project to increase HTTPS 
> adoption on the web and we noticed that this type of error is very 
> common, and that website maintainers are often unresponsive to requests 
> to fix this issue.
> 
> Therefore, I totally agree with Kirill that a list of known 
> false-positive/exceptions would be a great addition to save time to both 
> the CRAN team and package developers.
> 
> Hugo
> 
> On 07/01/2021 15:45, Kirill M?ller via R-devel wrote:
>> One other failure mode: SSL certificates trusted by browsers that are 
>> not installed on the check machine, e.g. the "GEANT
Vereniging"
>> certificate from https://relational.fit.cvut.cz/ .
>>
>>
>> K
>>
>>
>> On 07.01.21 12:14, Kirill M?ller via R-devel wrote:
>>> Hi
>>>
>>>
>>> The URL checks in R CMD check test all links in the README and 
>>> vignettes for broken or redirected links. In many cases this
improves
>>> documentation, I see problems with this approach which I have 
>>> detailed below.
>>>
>>> I'm writing to this mailing list because I think the change
needs to
>>> happen in R's check routines. I propose to introduce an
"allow-list"
>>> for URLs, to reduce the burden on both CRAN and package
maintainers.
>>>
>>> Comments are greatly appreciated.
>>>
>>>
>>> Best regards
>>>
>>> Kirill
>>>
>>>
>>> # Problems with the detection of broken/redirected URLs
>>>
>>> ## 301 should often be 307, how to change?
>>>
>>> Many web sites use a 301 redirection code that probably should be a
>>> 307. For example, https://www.oracle.com and
https://www.oracle.com/
>>> both redirect to https://www.oracle.com/index.html with a 301. I 
>>> suspect the company still wants oracle.com to be recognized as the 
>>> primary entry point of their web presence (to reserve the right to 
>>> move the redirection to a different location later), I haven't 
>>> checked with their PR department though. If that's true, the
redirect
>>> probably should be a 307, which should be fixed by their IT 
>>> department which I haven't contacted yet either.
>>>
>>> $ curl -i https://www.oracle.com
>>> HTTP/2 301
>>> server: AkamaiGHost
>>> content-length: 0
>>> location: https://www.oracle.com/index.html
>>> ...
>>>
>>> ## User agent detection
>>>
>>> twitter.com responds with a 400 error for requests without a user 
>>> agent string hinting at an accepted browser.
>>>
>>> $ curl -i https://twitter.com/
>>> HTTP/2 400
>>> ...
>>> <body>...<p>Please switch to a supported
browser...</p>...</body>
>>>
>>> $ curl -s -i https://twitter.com/ -A "Mozilla/5.0 (X11;
Ubuntu; Linux
>>> x86_64; rv:84.0) Gecko/20100101 Firefox/84.0" | head -n 1
>>> HTTP/2 200
>>>
>>> # Impact
>>>
>>> While the latter problem *could* be fixed by supplying a
browser-like
>>> user agent string, the former problem is virtually unfixable -- so 
>>> many web sites should use 307 instead of 301 but don't. The
above
>>> list is also incomplete -- think of unreliable links, HTTP links, 
>>> other failure modes...
>>>
>>> This affects me as a package maintainer, I have the choice to
either
>>> change the links to incorrect versions, or remove them altogether.
>>>
>>> I can also choose to explain each broken link to CRAN, this
subjects
>>> the team to undue burden I think. Submitting a package with NOTEs 
>>> delays the release for a package which I must release very soon to 
>>> avoid having it pulled from CRAN, I'd rather not risk that --
hence I
>>> need to remove the link and put it back later.
>>>
>>> I'm aware of https://github.com/r-lib/urlchecker, this
alleviates the
>>> problem but ultimately doesn't solve it.
>>>
>>> # Proposed solution
>>>
>>> ## Allow-list
>>>
>>> A file inst/URL that lists all URLs where failures are allowed -- 
>>> possibly with a list of the HTTP codes accepted for that link.
>>>
>>> Example:
>>>
>>> https://oracle.com/ 301
>>> https://twitter.com/drob/status/1224851726068527106 400
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

R devel - Jan 2021 - URL checks

[Rd] URL checks

[Rd] URL checks