>>>>> "Chris" == Chris Evans <chrishold at
psyctc.org> writes:
Chris> I use R a great deal but the huge web crawling power of
Chris> it isn't an area I've used. I don't want to reinvent a
Chris> cyberwheel and I suspect someone has done what I want.
Chris> That is a program that would run once a day (easy for
Chris> me to set up as a cron task) and would crawl a single
Chris> root of a web site (mine) and get the file size and a
Chris> CRC or some similar check value for each page as pulled
Chris> off the site (and, obviously, I'd want it not to follow
Chris> off site links). The other key thing would be for it to
Chris> store the values and URLs and be capable of being run
Chris> in "create/update database" mode or in "check
pages"
Chris> mode and for the change mode run to Email me a warning
Chris> if a page changes. The reason I want this is that two
Chris> of my sites have recently had content "disappear":
Chris> neither I nor the ISP can see what's happened and we
Chris> are lacking the very useful diagnostic of the date when
Chris> the change happened which might have mapped it some
Chris> component of WordPress, plugins or themes having
Chris> updated.
Chris> I am failing to find anything such and all the services
Chris> that offer site checking of this sort are prohibitively
Chris> expensive for me (my sites are zero income and either
Chris> personal or offering free utilities and information).
Chris> If anyone has done this, or something similar, I'd love
Chris> to hear if you were willing to share it. Failing that,
Chris> I think I will have to create this but I know it will
Chris> take me days as this isn't my area of R expertise and
Chris> as, to be brutally honest, I'm a pretty poor
Chris> programmer. If I go that way, I'm sure people may be
Chris> able to point me to things I may be (legitimately) able
Chris> to recycle in parts to help construct this.
Chris> Thanks in advance,
Chris> Chris
Chris> --
Chris> Chris Evans <chris at psyctc.org> Skype: chris-psyctc
Chris> Visiting Professor, University of Sheffield <chris.evans at
sheffield.ac.uk>
Chris> I do some consultation work for the University of Roehampton
<chris.evans at roehampton.ac.uk> and other places but this <chris at
psyctc.org> remains my main Email address.
Chris> I have "semigrated" to France, see:
https://www.psyctc.org/pelerinage2016/semigrating-to-france/ if you want to book
to talk, I am trying to keep that to Thursdays and my diary is now available at:
https://www.psyctc.org/pelerinage2016/ecwd_calendar/calendar/
Chris> Beware: French time, generally an hour ahead of UK. That page
will also take you to my blog which started with earlier joys in France and
Spain!
Not an answer, but perhaps two pointers/ideas:
1) Since you know cron, I suppose you work on a
Unix-like system, and you likely have a programme
called 'wget' either installed or can easily install
it. 'wget' has an option 'mirror', which allows you
to mirror a website.
2) There is tools::md5sum for computing checksums. You
could store those to a file and check changes in the
files content (e.g. via 'diff').
regards
Enrico
--
Enrico Schumann
Lucerne, Switzerland
http://enricoschumann.net