Bart van Bragt
2005-Sep-08 11:44 UTC
[Xapian-discuss] Xapian and 10M (small) documents. What to expect?
I've been thinking about integrating phpBB with Xapian for quite some time now and I guess I really should start to get things rolling. I haven't had a decent search on my site (www.bokt.nl) for ages now and the users are getting pretty annoyed by that fact :) I'm currently trying to figure out how I'm going to set this up, I probably also first need to get some new hardware to facilitate search. Does anyone have an idea what kind of hardware I would need to search 10 million documents (approx 4GB of text) with approx 10.000 new postings per day? Real-time indexing is nice but I can also batch this up so we can do this during the night (servers are mostly idle during the night: http://status.bokt.nl/ ). Do I need a dedicated machine for searching? The site isn't exactly generating huge amounts of money so it would be very nice if we could use a (beefy) server to do both webserving and searching. Or combine the database and the search but I don't think those two combine really well, I'm guessing that the main bottleneck is going to be I/O? Does anyone have experience with integrating Xapian with (PHP) forums? I know Arjan has plenty of experience with gathering.tweakers.net :D Talking about which... I'd very much prefer to index individual postings instead of combining all posts in a topic to one document. The main reason for this is that combining large topics results in lots of hits on those large topics because they contain a LOT of search terms. This is my main grief when searching on gathering.tweakers.net, you have to wade through lots of 300 page topics that do contain your searchwords but in quite separate postings on separate pages. Most of the times those 300 page topics have no link at all with the subject that searching for. IMO searching in postings instead of topics should solve that problem. The main drawback is a (very significant?) performance loss I guess... Indexing topics would result in only 500k documents instead of 10M. There seems to be a fairly large resemblance between gmain and phpBB indexing (both are about indexing topics/threads and lots of small postings). Is the gmane setup going to be public? Is it already known what hardware this system is going to need? Thanks in advance! Bart van Bragt
Olly Betts
2005-Sep-08 14:09 UTC
[Xapian-discuss] Xapian and 10M (small) documents. What to expect?
On Thu, Sep 08, 2005 at 12:44:14PM +0200, Bart van Bragt wrote:> Does anyone have an idea what kind of hardware I would need to > search 10 million documents (approx 4GB of text) with approx 10.000 new > postings per day?It rather depends on the search load, and the character of the data matters too. If you can get the system running on an existing box you can get an idea how these factors apply to your situation. As general advice, I/O speed is likely to be the limiting factor for database of this sort of size, so lots of RAM and fast disks are best. I'd probably go for SATA rather than SCSI if buying new these days. RAID will probably help.> Do I need a dedicated machine for searching? The site isn't exactly > generating huge amounts of money so it would be very nice if we could > use a (beefy) server to do both webserving and searching. Or > combine the database and the search but I don't think those two combine > really well, I'm guessing that the main bottleneck is going to be I/O?Almost certainly. I'd be tempted to consider two less beefy servers which gives you more scope for adding more RAM and sharing the I/O load, though may need more rackspace and increase hosting costs.> The main drawback is a (very significant?) performance > loss I guess... Indexing topics would result in only 500k documents > instead of 10M.Though with 10M documents, each will be indexed by fewer terms, and the raw position list data size will be much the same in both cases.> There seems to be a fairly large resemblance between gmain and phpBB > indexing (both are about indexing topics/threads and lots of small > postings). Is the gmane setup going to be public?It should be going live as the main search in the next week or two. The recent (and current) obstacles are all down to a disk crash and the machine having to be reinstalled - I keep finding things which aren't installed or aren't running and have to ask Lars to fix them.> Is it already known what hardware this system is going to need?The current hardware is: Athlon64 3000+ 3GB RAM mixture of SCSI and SATA disks (not RAID AFAIK) That's rather overspecified for the search load at present, but gmane is growing fairly rapidly so it's good to have room for expansion. It also means a full rebuild of the database from scratch takes about 2 days. It's running Debian woody x86, I think just because that was easy to install. I've not had a chance to compare x86 vs x86_64 for Xapian yet on the same hardware. Cheers, Olly
Arjen van der Meijden
2005-Sep-09 06:58 UTC
[Xapian-discuss] Xapian and 10M (small) documents. What to expect?
On 8-9-2005 12:44, Bart van Bragt wrote:> I'm currently trying to figure out how I'm going to set this up, I > probably also first need to get some new hardware to facilitate search. > Does anyone have an idea what kind of hardware I would need to > search 10 million documents (approx 4GB of text) with approx 10.000 new > postings per day? > Real-time indexing is nice but I can also batch this up so we can do > this during the night (servers are mostly idle during the night: > http://status.bokt.nl/ ).Real-time indexing will not allow you to use the faster-to-search compacted databases. Database-compaction takes an hour or so with our database. Which goes down from about 15G "working" to 11G "compacted" in the Flint format. So that is another reason not to index real-time.> Do I need a dedicated machine for searching? The site isn't exactly > generating huge amounts of money so it would be very nice if we could > use a (beefy) server to do both webserving and searching. Or > combine the database and the search but I don't think those two combine > really well, I'm guessing that the main bottleneck is going to be I/O?I'd suggest a dedicated machine. We have been running it on a webserver with 2G of memory a while back, but especially the phrase searches were very slow. With your per-posting set-up, the data to sift through per phrase search will probably be smaller though. The more memory, the better, cpu's aren't very interesting but your disk-system is. In our recent .plan you can read what our next search machine will be: http://www.tweakers.net/plan/292 Which is probably currently overspecified, but it will be used to facilitate another search database and is expected to cope with the growth in size and features for at least three years.> Does anyone have experience with integrating Xapian with (PHP) forums? I > know Arjan has plenty of experience with gathering.tweakers.net :DWe don't use the php-bindings. In the beginning we'd convert the GET-parameters to Omega-compatible ones and then just call the Omega-application to do the hard work for us. The result of Omega was formatted with a nicely fitted query-template allowing us to easily interpret that in PHP. Currently we have one machine with Omega running behind a xinetd-superdeamon and our webservers interface with that using TCP/IP, but it basically is the same as calling the local application. In my experience that easily beats the old "remote database" in terms of performance, since that used to send all result-data over the line expecting the client to sort the results. Whether it still beats the current remote-setup I don't know, but we're not just going to change a working set-up to figure that out ;)> Talking about which... I'd very much prefer to index individual postings > instead of combining all posts in a topic to one document. The main > reason for this is that combining large topics results in lots of hits > on those large topics because they contain a LOT of search terms. This > is my main grief when searching on gathering.tweakers.net, you have to > wade through lots of 300 page topics that do contain your searchwords > but in quite separate postings on separate pages. Most of the times > those 300 page topics have no link at all with the subject that > searching for. IMO searching in postings instead of topics should solve > that problem. The main drawback is a (very significant?) performance > loss I guess... Indexing topics would result in only 500k documents > instead of 10M.In my personal experience those large topics aren't that usefull as search results indeed, that's why the within-document-frequency will likely push them down the search-result-list if they are really useless. Then again, you can search within that topic when you want to be sure it really does(n't) contain your terms. My main concern with per-posting searching is that you'll end up with lots of small fragments of a document, which may result in not being able to find a certain topic because the terms you specified were scattered over the seperate postings. Good luck. Arjen