thr3ads.net - Rails - Google, MSN, Yahoo spiders crawling off my ''database universe''? [Jun 2008]

If this information is useful, please help other people find it:
Share via:

KathysKode-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org

2008-Jun-19 17:21 UTC

Google, MSN, Yahoo spiders crawling off my ''database universe''?

I recently figured out how to create a fairly complex Google Sitemap
file and am happy to share this code with anyone who asks. As I have a
highly nested database a common URL for me will look something like;

www.MyWebsite.com/parents/23/children/45.

The spiders come to my website and ''crawl'' along and increase
these
sequences which eventually puts them into;

www.MyWebsite.com/parents/23/children/46

and as this URL has gone off the edge of my database universe, my
exception_notification features send me an email.
Is there a way to put logic somewhere so that if a spider (or person)
is messing around and requests a URL that isn''t there, that a routine
kicks in telling them "You''ve gone off the edge of my database
universe" on a view, and then takes them back to where they were?

Thank you for any thoughts you may offer.
Kathleen
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---

Dhaval Parikh

2008-Jun-20 04:54 UTC

head link

Re: Google, MSN, Yahoo spiders crawling off my ''database uni

Well u can setup proper robotx.txt and disallow certain things to be 
accessed from ur site..This is the only way to put restriction on 
crawling.

Hope this helps

Thanks

Dhaval Parikh
Software Engineer
http://www.railshouse.com
sales(AT)railshouse(DOT)com
-- 
Posted via http://www.ruby-forum.com/.

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---

Matt Stone1

2008-Jun-20 06:30 UTC

head link

Re: Google, MSN, Yahoo spiders crawling off my ''database universe''?

KathysKode-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
wrote:> I recently figured out how to create a fairly complex Google Sitemap
> file and am happy to share this code with anyone who asks. As I have a
> highly nested database a common URL for me will look something like;
> 
> www.MyWebsite.com/parents/23/children/45.
> 
> The spiders come to my website and ''crawl'' along and
increase these
> sequences which eventually puts them into;
> 
> www.MyWebsite.com/parents/23/children/46
> 
> and as this URL has gone off the edge of my database universe, my
> exception_notification features send me an email.
> Is there a way to put logic somewhere so that if a spider (or person)
> is messing around and requests a URL that isn''t there, that a
routine
> kicks in telling them "You''ve gone off the edge of my
database
> universe" on a view, and then takes them back to where they were?
> 
> Thank you for any thoughts you may offer.
> Kathleen
Hi there,

I don''t think that spiders work that way. I thought that they follow 
existing links - rather than looking at the url & guessing what another 
one could be.

If there are no to a page, a spider should not get to it.

Unless the link used to exist, but does not now - ie: you are 
dynamically genereating your urls using table id''s & have removed
some
records from the table.

In this case, the links will still be in the search engines index & 
polled every so often. This is not a bad thing, as the link will 
eventually fall out of the search engines index. You can request removal 
through the webmaster admin tools if this bugs you.

Yahoo''s spider deliberately generates a dummy url to generate a
"404 -
page not found" error to understand how your site handles this. Do
don''t
be too worried that Yahoo has weird links for you site in its index.

Otherwise you can put something like this in your logic.

user_agent = request.user_agent.downcase
if ![ ''msnbot'', ''yahoo!
slurp'',''googlebot'' ].detect { |b|
user_agent.include? b  }
  #This request is not from one of the msn, yahoo, or google spiders
  # and so process accordingly

end

rgds,
- matt
http://www.internetschminternet.com
-- 
Posted via http://www.ruby-forum.com/.

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk-unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en
-~----------~----~----~----~------~----~------~--~---

Rails - Jun 2008 - Google, MSN, Yahoo spiders crawling off my 'database universe'?

Google, MSN, Yahoo spiders crawling off my ''database universe''?

Re: Google, MSN, Yahoo spiders crawling off my ''database uni

Re: Google, MSN, Yahoo spiders crawling off my ''database universe''?