Matt Barnicle
2007-Feb-08 07:18 UTC
[Xapian-discuss] Getting custom field data from the page through crawling
Now on to my next question.. I've got the search and indexing working well for now.. My next quest is to implement a system of creating custom fields in the index. Our site is fully dynamic. That is, every page is generated in PHP and there are enough different kinds of pages that I wouldn't want to get into the business of indexing the DB directly, so I think that using htdig to crawl the site is the best way to go.. But, I would like to be able to search for things by field such as 'type', 'category', 'name', 'city', etc. I thought about it a lot and also did a lot of reading and research in the list archives but couldn't come up with any way of passing this information from the built pages to the database. I was hoping I could store this in meta tags, like: <meta name="myorg.item.type" content="event" /> <meta name="myorg.item.category" content="theatre" /> <meta name="myorg.item.name" content="The Nutcracker Suite" /> <meta name="myorg.item.start_date" content="2007-02-10" /> <meta name="myorg.item.end_date" content="2007-02-16" /> That won't work the best though, because htdig won't store that information in a meaningful way to allow me to retrieve it in order to set the fields myself later. So, the one workaround solution I could come up with was to maybe edit the htdig2omega script, and for each doc read from db.docs, I then do an HTTP request on the URL, read it, parse these tags and then print the fields, which will map to the settings I specify in htdig2omega.script. But of course, I'm doing two page lookups when I spider the site.. Once for the main htdig crawl, and a second time during the db conversion. Is there a better way to achieve this result? A second question that goes along with that one.. Can I have multiple field datum with the same name? For example sometimes an event falls under more than one category, like 'theatre' and 'performing arts'. That's a basic example, but there are others where there are many options like if the page type is 'venue', which services that venue offers like wheelchair accessibility, closed caption, braille, and more.. Our site visitors will be searching on these attributes to find for example, events happening on a certain date at venues that offer certain services. - Matt
Olly Betts
2007-Feb-09 06:42 UTC
[Xapian-discuss] Getting custom field data from the page through crawling
On Wed, Feb 07, 2007 at 11:21:37PM -0800, Matt Barnicle wrote:> Is there a better way to achieve this result?I think you probably want to use a web crawling library which gives you full access to the page text for each page. I don't know such libraries well enough to recommend a particular one though. Another approach is to mirror the site locally (with wget for example) and then index from this local mirror.> A second question that goes along with that one.. Can I have multiple > field datum with the same name?Yes, Omega's $field{} command documents how it is handled: If multiple instances of field exist the field values are returned tab separated I've always thought this was a slightly odd feature though - if you really want this, it seems better to just put the tab-separated values into a single field and save yourself the bytes required to repeat "FIELDNAME=" each time... Cheers, Olly