Felix Antonius Wilhelm Ostmann
2006-Dec-07 09:02 UTC
[Xapian-discuss] using Xapian as backend for google
We want to build the next google ... ok, not so big ;) currently we are testing a liddle bit with xapian and it is amazing! respect! know i must figure out how we can use xapian in the best way. generating many flint-indexes so we can generate it fast on many machines and merge it. the frontend will be a webserver with apache and mod_perl ... is it the best way to run xapian-tcpsrv on other maschines as backend? i think so ... or is another webserver with mod_perl and perl-bindings the ideal solution? My question: can someone tell me something about building the backend for the next google? :) what is important? Raid0 VS Raid1, SCSI VS SATA, many smaller backends VS some big backends? What would be the bottleneck (i think DISC I/O)? Is the xapian-tcpsrv the best way? Can anyone tell me something about such an project? One other questions: "similar results from one domain". How can we arrive that goal? The MatchDecider watch over the values with the domainname and accept only two documents from one domain? Is that the way? Thanks for your time :) And sorry for my poor englisch :( MfG Felix Antonius Wilhelm Ostmann
On Thu, Dec 07, 2006 at 10:02:03AM +0100, Felix Antonius Wilhelm Ostmann wrote:> know i must figure out how we can use xapian in the best way. generating > many flint-indexes so we can generate it fast on many machines and merge > it. the frontend will be a webserver with apache and mod_perl ... is it > the best way to run xapian-tcpsrv on other maschines as backend? i think > so ... or is another webserver with mod_perl and perl-bindings the ideal > solution? My question: can someone tell me something about building the > backend for the next google? :) what is important?> Raid0 VS Raid1RAID 1 should be faster for reading, and actually has redundancy so it can survive a disk dying, but you get half as much storage volume from the same disks. In other words, it'll cost about twice as much. Incidentally, there are many more RAID configurations than just these two. Wikipedia has an overview: http://en.wikipedia.org/wiki/RAID> SCSI VS SATAIt depends on budget and how big you want to grow. SATA is cheaper and probably similar in speed to where SCSI was a few years ago, but iSCSI and Fibre Channel are likely to end up faster in most cases.> many smaller backends VS some big backends?There are definitely downsides to having too many backend servers. But if you have a lot of data, splitting a search over several machines can be a win. You'll need to profile if you want to find the sweet spot for your setup, but I'd think it's likely to be nearer a few than a few hundred. Note that there's some overhead to using the remote backend, and also some to using multiple databases. Another possible architecture is to just have several servers searching replicated copies of a single large database.> What would be the bottleneck (i think DISC I/O)?It's likely to be. Note that there's scope for improving matters with enhancements to Xapian here - there are some obvious things to improve (which I'm working my way through), and profiling should reveal more. For a large operation, it's worth investing some time in such fine tuning as it can seriously reduce the amount of hardware you need to buy and house!> Is the xapian-tcpsrv the best way? Can anyone tell me something about > such an project?Webtop used xapian-tcpsrv to spread searches over a number of boxes (10 or so IIRC). The index size was around 500 million documents, but with modern hardware that's much less of a challenge than it was more than 6 years ago. Also the remote backend has been completely rewritten since then, and the local backend Webtop used was the legacy "muscat36 da" one, which flint should outperform by some margin.> One other questions: "similar results from one domain". > How can we arrive that goal? The MatchDecider watch over the values with > the domainname and accept only two documents from one domain? Is that > the way?If you just want two documents from any one domain, it wouldn't be hard to extend the collapse feature to leave N documents behind instead of just one. Only collapsing "similar" results is harder - first you need to decide how to define "similar" I guess. Cheers, Olly