thr3ads.net - Xapian discuss - [Xapian-discuss] using Xapian as backend for google [Dec 2006]

If this information is useful, please help other people find it:
Share via:

Felix Antonius Wilhelm Ostmann

2006-Dec-07 09:02 UTC

[Xapian-discuss] using Xapian as backend for google

We want to build the next google ... ok, not so big  ;)  currently we 
are testing a liddle bit with xapian and it is amazing! respect!

know i must figure out how we can use xapian in the best way. generating 
many flint-indexes so we can generate it fast on many machines and merge 
it. the frontend will be a webserver with apache and mod_perl ... is it 
the best way to run xapian-tcpsrv on other maschines as backend? i think 
so ... or is another webserver with mod_perl and perl-bindings the ideal 
solution? My question: can someone tell me something about building the 
backend for the next google? :) what is important? Raid0 VS Raid1, SCSI 
VS SATA, many smaller backends VS some big backends? What would be the 
bottleneck (i think DISC I/O)? Is the xapian-tcpsrv the best way? Can 
anyone tell me something about such an project?

One other questions: "similar results from one domain".
How can we arrive that goal? The MatchDecider watch over the values with 
the domainname and accept only two documents from one domain? Is that 
the way?

Thanks for your time :)
And sorry for my poor englisch :(

MfG
Felix Antonius Wilhelm Ostmann

Olly Betts

2006-Dec-08 05:23 UTC

head link

[Xapian-discuss] using Xapian as backend for google

On Thu, Dec 07, 2006 at 10:02:03AM +0100, Felix Antonius Wilhelm Ostmann
wrote:> know i must figure out how we can use xapian in the best way. generating 
> many flint-indexes so we can generate it fast on many machines and merge 
> it. the frontend will be a webserver with apache and mod_perl ... is it 
> the best way to run xapian-tcpsrv on other maschines as backend? i think 
> so ... or is another webserver with mod_perl and perl-bindings the ideal 
> solution? My question: can someone tell me something about building the 
> backend for the next google? :) what is important? 
> Raid0 VS Raid1
RAID 1 should be faster for reading, and actually has redundancy so it
can survive a disk dying, but you get half as much storage volume from
the same disks.  In other words, it'll cost about twice as much.

Incidentally, there are many more RAID configurations than just these
two.  Wikipedia has an overview:

http://en.wikipedia.org/wiki/RAID
> SCSI VS SATA
It depends on budget and how big you want to grow.  SATA is cheaper and
probably similar in speed to where SCSI was a few years ago, but iSCSI
and Fibre Channel are likely to end up faster in most cases.
> many smaller backends VS some big backends?
There are definitely downsides to having too many backend servers.  But
if you have a lot of data, splitting a search over several machines can
be a win.  You'll need to profile if you want to find the sweet spot for
your setup, but I'd think it's likely to be nearer a few than a few
hundred.

Note that there's some overhead to using the remote backend, and also
some to using multiple databases.  Another possible architecture is
to just have several servers searching replicated copies of a single
large database.
> What would be the bottleneck (i think DISC I/O)?
It's likely to be.  Note that there's scope for improving matters with
enhancements to Xapian here - there are some obvious things to improve
(which I'm working my way through), and profiling should reveal more.
For a large operation, it's worth investing some time in such fine
tuning as it can seriously reduce the amount of hardware you need to buy
and house!
> Is the xapian-tcpsrv the best way? Can anyone tell me something about
> such an project?
Webtop used xapian-tcpsrv to spread searches over a number of boxes
(10 or so IIRC).  The index size was around 500 million documents, but
with modern hardware that's much less of a challenge than it was more
than 6 years ago.

Also the remote backend has been completely rewritten since then, and
the local backend Webtop used was the legacy "muscat36 da" one, which
flint should outperform by some margin.
> One other questions: "similar results from one domain".
> How can we arrive that goal? The MatchDecider watch over the values with 
> the domainname and accept only two documents from one domain? Is that 
> the way?
If you just want two documents from any one domain, it wouldn't be hard
to extend the collapse feature to leave N documents behind instead of
just one.

Only collapsing "similar" results is harder - first you need to decide
how to define "similar" I guess.

Cheers,
    Olly

Xapian discuss - Dec 2006 - using Xapian as backend for google

[Xapian-discuss] using Xapian as backend for google

[Xapian-discuss] using Xapian as backend for google