thr3ads.net - R devel - [Rd] URL checks [Jan 2021]

If this information is useful, please help other people find it:
Share via:

Martin Maechler

2021-Jan-11 09:51 UTC

[Rd] URL checks

>>>>> Viechtbauer, Wolfgang (SP) 
>>>>>     on Fri, 8 Jan 2021 13:50:14 +0000 writes:
    > Instead of a separate file to store such a list, would it be an idea to
add versions of the \href{}{} and \url{} markup commands that are skipped by the
URL checks?
    > Best,
    > Wolfgang

I think John Nash and you misunderstood -- or then I
misunderstood -- the original proposal:

I've been understanding that there should be a  "central
repository" of URL
exceptions that is maintained by volunteers.

And rather *not* that package authors should get ways to skip
URL checking..

Martin


    >> -----Original Message-----
    >> From: R-devel [mailto:r-devel-bounces at r-project.org] On Behalf
Of Spencer
    >> Graves
    >> Sent: Friday, 08 January, 2021 13:04
    >> To: r-devel at r-project.org
    >> Subject: Re: [Rd] URL checks
    >> 
    >> I also would be pleased to be allowed to provide "a list of
known
    >> false-positive/exceptions" to the URL tests.  I've been
challenged
    >> multiple times regarding URLs that worked fine when I checked them.
We
    >> should not be required to do a partial lobotomy to pass R CMD check
;-)
    >> 
    >> Spencer Graves
    >> 
    >> On 2021-01-07 09:53, Hugo Gruson wrote:
    >>> 
    >>> I encountered the same issue today with
https://astrostatistics.psu.edu/.
    >>> 
    >>> This is a trust chain issue, as explained here:
    >>> https://whatsmychaincert.com/?astrostatistics.psu.edu.
    >>> 
    >>> I've worked for a couple of years on a project to increase
HTTPS
    >>> adoption on the web and we noticed that this type of error is
very
    >>> common, and that website maintainers are often unresponsive to
requests
    >>> to fix this issue.
    >>> 
    >>> Therefore, I totally agree with Kirill that a list of known
    >>> false-positive/exceptions would be a great addition to save
time to both
    >>> the CRAN team and package developers.
    >>> 
    >>> Hugo
    >>> 
    >>> On 07/01/2021 15:45, Kirill M?ller via R-devel wrote:
    >>>> One other failure mode: SSL certificates trusted by
browsers that are
    >>>> not installed on the check machine, e.g. the "GEANT
Vereniging"
    >>>> certificate from https://relational.fit.cvut.cz/ .
    >>>> 
    >>>> K
    >>>> 
    >>>> On 07.01.21 12:14, Kirill M?ller via R-devel wrote:
    >>>>> Hi
    >>>>> 
    >>>>> The URL checks in R CMD check test all links in the
README and
    >>>>> vignettes for broken or redirected links. In many cases
this improves
    >>>>> documentation, I see problems with this approach which
I have
    >>>>> detailed below.
    >>>>> 
    >>>>> I'm writing to this mailing list because I think
the change needs to
    >>>>> happen in R's check routines. I propose to
introduce an "allow-list"
    >>>>> for URLs, to reduce the burden on both CRAN and package
maintainers.
    >>>>> 
    >>>>> Comments are greatly appreciated.
    >>>>> 
    >>>>> Best regards
    >>>>> 
    >>>>> Kirill
    >>>>> 
    >>>>> # Problems with the detection of broken/redirected URLs
    >>>>> 
    >>>>> ## 301 should often be 307, how to change?
    >>>>> 
    >>>>> Many web sites use a 301 redirection code that probably
should be a
    >>>>> 307. For example, https://www.oracle.com and
https://www.oracle.com/
    >>>>> both redirect to https://www.oracle.com/index.html with
a 301. I
    >>>>> suspect the company still wants oracle.com to be
recognized as the
    >>>>> primary entry point of their web presence (to reserve
the right to
    >>>>> move the redirection to a different location later), I
haven't
    >>>>> checked with their PR department though. If that's
true, the redirect
    >>>>> probably should be a 307, which should be fixed by
their IT
    >>>>> department which I haven't contacted yet either.
    >>>>> 
    >>>>> $ curl -i https://www.oracle.com
    >>>>> HTTP/2 301
    >>>>> server: AkamaiGHost
    >>>>> content-length: 0
    >>>>> location: https://www.oracle.com/index.html
    >>>>> ...
    >>>>> 
    >>>>> ## User agent detection
    >>>>> 
    >>>>> twitter.com responds with a 400 error for requests
without a user
    >>>>> agent string hinting at an accepted browser.
    >>>>> 
    >>>>> $ curl -i https://twitter.com/
    >>>>> HTTP/2 400
    >>>>> ...
    >>>>> <body>...<p>Please switch to a supported
browser...</p>...</body>
    >>>>> 
    >>>>> $ curl -s -i https://twitter.com/ -A "Mozilla/5.0
(X11; Ubuntu; Linux
    >>>>> x86_64; rv:84.0) Gecko/20100101 Firefox/84.0" |
head -n 1
    >>>>> HTTP/2 200
    >>>>> 
    >>>>> # Impact
    >>>>> 
    >>>>> While the latter problem *could* be fixed by supplying
a browser-like
    >>>>> user agent string, the former problem is virtually
unfixable -- so
    >>>>> many web sites should use 307 instead of 301 but
don't. The above
    >>>>> list is also incomplete -- think of unreliable links,
HTTP links,
    >>>>> other failure modes...
    >>>>> 
    >>>>> This affects me as a package maintainer, I have the
choice to either
    >>>>> change the links to incorrect versions, or remove them
altogether.
    >>>>> 
    >>>>> I can also choose to explain each broken link to CRAN,
this subjects
    >>>>> the team to undue burden I think. Submitting a package
with NOTEs
    >>>>> delays the release for a package which I must release
very soon to
    >>>>> avoid having it pulled from CRAN, I'd rather not
risk that -- hence I
    >>>>> need to remove the link and put it back later.
    >>>>> 
    >>>>> I'm aware of https://github.com/r-lib/urlchecker,
this alleviates the
    >>>>> problem but ultimately doesn't solve it.
    >>>>> 
    >>>>> # Proposed solution
    >>>>> 
    >>>>> ## Allow-list
    >>>>> 
    >>>>> A file inst/URL that lists all URLs where failures are
allowed --
    >>>>> possibly with a list of the HTTP codes accepted for
that link.
    >>>>> 
    >>>>> Example:
    >>>>> 
    >>>>> https://oracle.com/ 301
    >>>>> https://twitter.com/drob/status/1224851726068527106 400
    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

Viechtbauer, Wolfgang (SP)

2021-Jan-11 10:41 UTC

head link

[Rd] URL checks

>>>>>> Viechtbauer, Wolfgang (SP)
>>>>>>     on Fri, 8 Jan 2021 13:50:14 +0000 writes:
>
>    > Instead of a separate file to store such a list, would it be an
idea
>to add versions of the \href{}{} and \url{} markup commands that are skipped
>by the URL checks?
>    > Best,
>    > Wolfgang
>
>I think John Nash and you misunderstood -- or then I
>misunderstood -- the original proposal:
>
>I've been understanding that there should be a  "central
repository" of URL
>exceptions that is maintained by volunteers.
>
>And rather *not* that package authors should get ways to skip
>URL checking..
>
>Martin
Hi Martin,

Kirill suggested: "A file inst/URL that lists all URLs where failures are
allowed -- possibly with a list of the HTTP codes accepted for that link."

So, if it is a file in inst/, then this sounds to me like this is part of the
package and not part of some central repository.

Best,
Wolfgang

J C Nash

2021-Jan-11 14:41 UTC

head link

[Rd] URL checks

Sorry, Martin, but I've NOT commented on this matter, unless someone has
been impersonating me.
Someone else?

JN


On 2021-01-11 4:51 a.m., Martin Maechler wrote:>>>>>> Viechtbauer, Wolfgang (SP) 
>>>>>>     on Fri, 8 Jan 2021 13:50:14 +0000 writes:
> 
>     > Instead of a separate file to store such a list, would it be an
idea to add versions of the \href{}{} and \url{} markup commands that are
skipped by the URL checks?
>     > Best,
>     > Wolfgang
> 
> I think John Nash and you misunderstood -- or then I
> misunderstood -- the original proposal:
> 
> I've been understanding that there should be a  "central
repository" of URL
> exceptions that is maintained by volunteers.
> 
> And rather *not* that package authors should get ways to skip
> URL checking..
> 
> Martin
> 
> 
>     >> -----Original Message-----
>     >> From: R-devel [mailto:r-devel-bounces at r-project.org] On
Behalf Of Spencer
>     >> Graves
>     >> Sent: Friday, 08 January, 2021 13:04
>     >> To: r-devel at r-project.org
>     >> Subject: Re: [Rd] URL checks
>     >> 
>     >> I also would be pleased to be allowed to provide "a list
of known
>     >> false-positive/exceptions" to the URL tests.  I've
been challenged
>     >> multiple times regarding URLs that worked fine when I checked
them.  We
>     >> should not be required to do a partial lobotomy to pass R CMD
check ;-)
>     >> 
>     >> Spencer Graves
>     >> 
>     >> On 2021-01-07 09:53, Hugo Gruson wrote:
>     >>> 
>     >>> I encountered the same issue today with
https://astrostatistics.psu.edu/.
>     >>> 
>     >>> This is a trust chain issue, as explained here:
>     >>> https://whatsmychaincert.com/?astrostatistics.psu.edu.
>     >>> 
>     >>> I've worked for a couple of years on a project to
increase HTTPS
>     >>> adoption on the web and we noticed that this type of error
is very
>     >>> common, and that website maintainers are often
unresponsive to requests
>     >>> to fix this issue.
>     >>> 
>     >>> Therefore, I totally agree with Kirill that a list of
known
>     >>> false-positive/exceptions would be a great addition to
save time to both
>     >>> the CRAN team and package developers.
>     >>> 
>     >>> Hugo
>     >>> 
>     >>> On 07/01/2021 15:45, Kirill M?ller via R-devel wrote:
>     >>>> One other failure mode: SSL certificates trusted by
browsers that are
>     >>>> not installed on the check machine, e.g. the
"GEANT Vereniging"
>     >>>> certificate from https://relational.fit.cvut.cz/ .
>     >>>> 
>     >>>> K
>     >>>> 
>     >>>> On 07.01.21 12:14, Kirill M?ller via R-devel wrote:
>     >>>>> Hi
>     >>>>> 
>     >>>>> The URL checks in R CMD check test all links in
the README and
>     >>>>> vignettes for broken or redirected links. In many
cases this improves
>     >>>>> documentation, I see problems with this approach
which I have
>     >>>>> detailed below.
>     >>>>> 
>     >>>>> I'm writing to this mailing list because I
think the change needs to
>     >>>>> happen in R's check routines. I propose to
introduce an "allow-list"
>     >>>>> for URLs, to reduce the burden on both CRAN and
package maintainers.
>     >>>>> 
>     >>>>> Comments are greatly appreciated.
>     >>>>> 
>     >>>>> Best regards
>     >>>>> 
>     >>>>> Kirill
>     >>>>> 
>     >>>>> # Problems with the detection of broken/redirected
URLs
>     >>>>> 
>     >>>>> ## 301 should often be 307, how to change?
>     >>>>> 
>     >>>>> Many web sites use a 301 redirection code that
probably should be a
>     >>>>> 307. For example, https://www.oracle.com and
https://www.oracle.com/
>     >>>>> both redirect to https://www.oracle.com/index.html
with a 301. I
>     >>>>> suspect the company still wants oracle.com to be
recognized as the
>     >>>>> primary entry point of their web presence (to
reserve the right to
>     >>>>> move the redirection to a different location
later), I haven't
>     >>>>> checked with their PR department though. If
that's true, the redirect
>     >>>>> probably should be a 307, which should be fixed by
their IT
>     >>>>> department which I haven't contacted yet
either.
>     >>>>> 
>     >>>>> $ curl -i https://www.oracle.com
>     >>>>> HTTP/2 301
>     >>>>> server: AkamaiGHost
>     >>>>> content-length: 0
>     >>>>> location: https://www.oracle.com/index.html
>     >>>>> ...
>     >>>>> 
>     >>>>> ## User agent detection
>     >>>>> 
>     >>>>> twitter.com responds with a 400 error for requests
without a user
>     >>>>> agent string hinting at an accepted browser.
>     >>>>> 
>     >>>>> $ curl -i https://twitter.com/
>     >>>>> HTTP/2 400
>     >>>>> ...
>     >>>>> <body>...<p>Please switch to a
supported browser...</p>...</body>
>     >>>>> 
>     >>>>> $ curl -s -i https://twitter.com/ -A
"Mozilla/5.0 (X11; Ubuntu; Linux
>     >>>>> x86_64; rv:84.0) Gecko/20100101 Firefox/84.0"
| head -n 1
>     >>>>> HTTP/2 200
>     >>>>> 
>     >>>>> # Impact
>     >>>>> 
>     >>>>> While the latter problem *could* be fixed by
supplying a browser-like
>     >>>>> user agent string, the former problem is virtually
unfixable -- so
>     >>>>> many web sites should use 307 instead of 301 but
don't. The above
>     >>>>> list is also incomplete -- think of unreliable
links, HTTP links,
>     >>>>> other failure modes...
>     >>>>> 
>     >>>>> This affects me as a package maintainer, I have
the choice to either
>     >>>>> change the links to incorrect versions, or remove
them altogether.
>     >>>>> 
>     >>>>> I can also choose to explain each broken link to
CRAN, this subjects
>     >>>>> the team to undue burden I think. Submitting a
package with NOTEs
>     >>>>> delays the release for a package which I must
release very soon to
>     >>>>> avoid having it pulled from CRAN, I'd rather
not risk that -- hence I
>     >>>>> need to remove the link and put it back later.
>     >>>>> 
>     >>>>> I'm aware of
https://github.com/r-lib/urlchecker, this alleviates the
>     >>>>> problem but ultimately doesn't solve it.
>     >>>>> 
>     >>>>> # Proposed solution
>     >>>>> 
>     >>>>> ## Allow-list
>     >>>>> 
>     >>>>> A file inst/URL that lists all URLs where failures
are allowed --
>     >>>>> possibly with a list of the HTTP codes accepted
for that link.
>     >>>>> 
>     >>>>> Example:
>     >>>>> 
>     >>>>> https://oracle.com/ 301
>     >>>>>
https://twitter.com/drob/status/1224851726068527106 400
>     > ______________________________________________
>     > R-devel at r-project.org mailing list
>     > https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

R devel - Jan 2021 - URL checks

[Rd] URL checks

[Rd] URL checks

[Rd] URL checks