Hi all, I recently tried out Xapian and used it to create an index of about 2000 pdf files. It took a while to index, but it served my needs very nicely and was simple to set up. I have another idea for an application of the Xapian indexing system. I think that it's probably not exactly what Xapian is all about, but nevertheless, I wonder if you have any comments or alternative suggestions. The aim is to provide a system that indexes Apache web server logs for a news-style website content management system. We index articles, issues, sections of a set of monthly or weekly publications. Articles have topic tags and we also have information about who (username) is visiting out site, and when and from where. What we want to be able to do is to index the webserver's accesses so that we can do full drill-down and find all hits from people in a particular country on a particular day, or all hits on a particular article, etc. I thought that Xapian, particularly using its boolean mode of operation, might be suitable for this type of indexing and querying in a way that perhaps conventional RDBMS are not. Each 'hit' would become a 'document' in Xapian, so there would soon be millions of 'documents' but with relatively few 'keywords' such as username, date, article title, etc. Would you agree with that thought? If not, would you suggest a different approach, perhaps some more suitable software? I was thinking of Splunk and wondering how they might have implemented their system. Would such indexing and search be feasible with a single shared server? Is it possible to output aggregate and time-series data from Xapian, or is it only possible to get ranked search results? My experience so far is just with Omega, so I'm not sure what the possibilities with the API might be here. Has anyone used Xapian in this kind of way? Any suggestions much appreciated, Cheers JP -- http://www.curioussymbols.com/
Michael Schlenker
2006-May-17 09:23 UTC
[Xapian-discuss] Using Xapian for webserver logs...?
John Pye schrieb:> I have another idea for an application of the Xapian indexing system. I > think that it's probably not exactly what Xapian is all about, but > nevertheless, I wonder if you have any comments or alternative suggestions. > > The aim is to provide a system that indexes Apache web server logs for a > news-style website content management system. We index articles, issues, > sections of a set of monthly or weekly publications. Articles have topic > tags and we also have information about who (username) is visiting out > site, and when and from where. > > What we want to be able to do is to index the webserver's accesses so > that we can do full drill-down and find all hits from people in a > particular country on a particular day, or all hits on a particular > article, etc.Sounds more like you want a RDBMS and do data warehousing/decision support type stuff with it.> > I thought that Xapian, particularly using its boolean mode of operation, > might be suitable for this type of indexing and querying in a way that > perhaps conventional RDBMS are not. Each 'hit' would become a 'document' > in Xapian, so there would soon be millions of 'documents' but with > relatively few 'keywords' such as username, date, article title, etc. > Would you agree with that thought? If not, would you suggest a different > approach, perhaps some more suitable software? I was thinking of Splunk > and wondering how they might have implemented their system. Would such > indexing and search be feasible with a single shared server?You could do things like that with Xapians API, the main question is 'why?'. You seem to not do any meaningful fulltext search. I would simply parse the logfiles, store the 'dimensions' your interested in into a suitable RDBMS, an then use that to drill down. A RDBMS is probably more suitable for this task, but you have to invest some time to design proper table structures for the type of questions you want answered. What can be useful is combining xapian with a RDBMS to index documents for fulltext search as an alternative access path to metadata retrieved from an RDBMS. Depends on your application. For web server log files i don't see it. Michael