thr3ads.net - R help - [R] Creating a web site checker using R [Aug 2019]

If this information is useful, please help other people find it:
Share via:

Chris Evans

2019-Aug-08 17:54 UTC

[R] Creating a web site checker using R

I use R a great deal but the huge web crawling power of it isn't an area
I've used. I don't want to reinvent a cyberwheel and I suspect someone
has done what I want.  That is a program that would run once a day (easy for me
to set up as a cron task) and would crawl a single root of a web site (mine) and
get the file size and a CRC or some similar check value for each page as pulled
off the site (and, obviously, I'd want it not to follow off site links). The
other key thing would be for it to store the values and URLs and be capable of
being run in "create/update database" mode or in "check
pages" mode and for the change mode run to Email me a warning if a page
changes.  The reason I want this is that two of my sites have recently had
content "disappear": neither I nor the ISP can see what's happened
and we are lacking the very useful diagnostic of the date when the change
happened which might have mapped it some component of WordPress, plugins or
themes having updated.

I am failing to find anything such and all the services that offer site checking
of this sort are prohibitively expensive for me (my sites are zero income and
either personal or offering free utilities and information).

If anyone has done this, or something similar, I'd love to hear if you were
willing to share it.  Failing that, I think I will have to create this but I
know it will take me days as this isn't my area of R expertise and as, to be
brutally honest, I'm a pretty poor programmer.  If I go that way, I'm
sure people may be able to point me to things I may be (legitimately) able to
recycle in parts to help construct this.

Thanks in advance,

Chris

-- 
Chris Evans <chris at psyctc.org> Skype: chris-psyctc
Visiting Professor, University of Sheffield <chris.evans at
sheffield.ac.uk>
I do some consultation work for the University of Roehampton <chris.evans at
roehampton.ac.uk> and other places but this <chris at psyctc.org>
remains my main Email address.
I have "semigrated" to France, see:
https://www.psyctc.org/pelerinage2016/semigrating-to-france/ if you want to book
to talk, I am trying to keep that to Thursdays and my diary is now available at:
https://www.psyctc.org/pelerinage2016/ecwd_calendar/calendar/
Beware: French time, generally an hour ahead of UK.  That page will also take
you to my blog which started with earlier joys in France and Spain!

Enrico Schumann

2019-Aug-09 06:36 UTC

head link

[R] Creating a web site checker using R

>>>>> "Chris" == Chris Evans <chrishold at
psyctc.org> writes:
    Chris> I use R a great deal but the huge web crawling power of
    Chris> it isn't an area I've used. I don't want to reinvent a
    Chris> cyberwheel and I suspect someone has done what I want.
    Chris> That is a program that would run once a day (easy for
    Chris> me to set up as a cron task) and would crawl a single
    Chris> root of a web site (mine) and get the file size and a
    Chris> CRC or some similar check value for each page as pulled
    Chris> off the site (and, obviously, I'd want it not to follow
    Chris> off site links). The other key thing would be for it to
    Chris> store the values and URLs and be capable of being run
    Chris> in "create/update database" mode or in "check
pages"
    Chris> mode and for the change mode run to Email me a warning
    Chris> if a page changes.  The reason I want this is that two
    Chris> of my sites have recently had content "disappear":
    Chris> neither I nor the ISP can see what's happened and we
    Chris> are lacking the very useful diagnostic of the date when
    Chris> the change happened which might have mapped it some
    Chris> component of WordPress, plugins or themes having
    Chris> updated.

    Chris> I am failing to find anything such and all the services
    Chris> that offer site checking of this sort are prohibitively
    Chris> expensive for me (my sites are zero income and either
    Chris> personal or offering free utilities and information).

    Chris> If anyone has done this, or something similar, I'd love
    Chris> to hear if you were willing to share it.  Failing that,
    Chris> I think I will have to create this but I know it will
    Chris> take me days as this isn't my area of R expertise and
    Chris> as, to be brutally honest, I'm a pretty poor
    Chris> programmer.  If I go that way, I'm sure people may be
    Chris> able to point me to things I may be (legitimately) able
    Chris> to recycle in parts to help construct this.

    Chris> Thanks in advance,

    Chris> Chris

    Chris> -- 
    Chris> Chris Evans <chris at psyctc.org> Skype: chris-psyctc
    Chris> Visiting Professor, University of Sheffield <chris.evans at
sheffield.ac.uk>
    Chris> I do some consultation work for the University of Roehampton
<chris.evans at roehampton.ac.uk> and other places but this <chris at
psyctc.org> remains my main Email address.
    Chris> I have "semigrated" to France, see:
https://www.psyctc.org/pelerinage2016/semigrating-to-france/ if you want to book
to talk, I am trying to keep that to Thursdays and my diary is now available at:
https://www.psyctc.org/pelerinage2016/ecwd_calendar/calendar/
    Chris> Beware: French time, generally an hour ahead of UK.  That page
will also take you to my blog which started with earlier joys in France and
Spain!

Not an answer, but perhaps two pointers/ideas:

1) Since you know cron, I suppose you work on a
   Unix-like system, and you likely have a programme
   called 'wget' either installed or can easily install
   it. 'wget' has an option 'mirror', which allows you
   to mirror a website.

2) There is tools::md5sum for computing checksums. You
   could store those to a file and check changes in the
   files content (e.g. via 'diff').


regards
        Enrico
-- 
Enrico Schumann
Lucerne, Switzerland
http://enricoschumann.net

R help - Aug 2019 - Creating a web site checker using R

[R] Creating a web site checker using R

[R] Creating a web site checker using R