Greetings all, I'm about to evaluate Xapian for a future project and would appreciate a few comments from those in the know: Indexing 1. Is Xapian similar to Lucene in the sense that you can define as many fields as you want, and assign various weights (which influence search result sorting) to these fields? I gather from the docs that you can, but I just need confirmation. 2. Let's say you're indexing websites; can you then merge/combine many smaller indexes into larger ones for later searching? Searching 1. I gather from the docs that you can sort results according to your own field/s, followed by the default document scoring (think "page-rank"). Correct? 2. ~/docs/remote.htm mentions distributed searching - we want to spread the search load around our cluster by splitting the index into many manageable-sized indexes (to ensure sub-second performance), with a "master" node which combines search results and end-users see. Is my understanding correct and are there any pitfalls/bottlenecks? 3. Removing duplicates: this can be done programmatically I know (but is slow on our chosen platform - Perl), but does Xapian provide this mechanism built-in? For example: a search result might return several pages from a web site, but we want to remove these dups and only provide a single result (highest ranking) per website (eg, with a link for "More from this site..." - al-la Google, which will be a separate search displaying all the site-duplicates). 4. If the mechanism to remove duplicates exists, will this still work cluster-wide in distributed searching? 5. Does Xapian provide a mechanism for identifying the actual field in a search result which triggered the hit? eg, let's say you have TITLE, BODY, OTHER as fields in your index. If a search found your term in the BODY field, does Xapian provide this as feedback? 5. This is difficult I know: how does Xapian compare performance-wise? Has anyone done any basic benchmarking? Thanks for any information you can provide. Regards Henry
Felix Antonius Wilhelm Ostmann
2008-Nov-03 15:53 UTC
[Xapian-discuss] A few questions wrt Xapian
That sounds exactly like our first Xapian-Project (also with perl) :) But we dont have the problem with the cluster. All other works very well! merging many indexs unique results (with MatchDecider, dont know how that works with cluster). You should have a look at Omega. I see, my answers are not very usefull, i hope someone else will answer all the questions exact :-/ Henka schrieb:> Greetings all, > > I'm about to evaluate Xapian for a future project and would appreciate > a few comments from those in the know: > > Indexing > > 1. Is Xapian similar to Lucene in the sense that you can define as > many fields as you want, and assign various weights (which influence > search result sorting) to these fields? I gather from the docs that > you can, but I just need confirmation. > > 2. Let's say you're indexing websites; can you then merge/combine > many smaller indexes into larger ones for later searching? > > > Searching > > 1. I gather from the docs that you can sort results according to your > own field/s, followed by the default document scoring (think > "page-rank"). Correct? > > 2. ~/docs/remote.htm mentions distributed searching - we want to > spread the search load around our cluster by splitting the index into > many manageable-sized indexes (to ensure sub-second performance), with > a "master" node which combines search results and end-users see. Is > my understanding correct and are there any pitfalls/bottlenecks? > > 3. Removing duplicates: this can be done programmatically I know > (but is slow on our chosen platform - Perl), but does Xapian provide > this mechanism built-in? For example: a search result might return > several pages from a web site, but we want to remove these dups and > only provide a single result (highest ranking) per website (eg, with a > link for "More from this site..." - al-la Google, which will be a > separate search displaying all the site-duplicates). > > 4. If the mechanism to remove duplicates exists, will this still work > cluster-wide in distributed searching? > > 5. Does Xapian provide a mechanism for identifying the actual field > in a search result which triggered the hit? eg, let's say you have > TITLE, BODY, OTHER as fields in your index. If a search found your > term in the BODY field, does Xapian provide this as feedback? > > 5. This is difficult I know: how does Xapian compare > performance-wise? Has anyone done any basic benchmarking? > > > > Thanks for any information you can provide. > > Regards > Henry > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss at lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss > > >-- Mit freundlichen Gr??en Felix Antonius Wilhelm Ostmann -------------------------------------------------- Websuche Search Technology GmbH & Co. KG Martinistra?e 3 - D-49080 Osnabr?ck - Germany Tel.: +49 541 40666-0 - Fax: +49 541 40666-22 Email: info at websuche.de - Website: www.websuche.de -------------------------------------------------- AG Osnabr?ck - HRA 200252 - Ust-Ident: DE814737310 Komplement?rin: Websuche Search Technology Verwaltungs GmbH - AG Osnabr?ck - HRB 200359 Gesch?ftsf?hrer: Diplom Kaufmann Martin Steinkamp --------------------------------------------------
Henka wrote:> 1. Is Xapian similar to Lucene in the sense that you can define as > many fields as you want, and assign various weights (which influence > search result sorting) to these fields? I gather from the docs that > you can, but I just need confirmation.Yes, you can do this.> 2. Let's say you're indexing websites; can you then merge/combine > many smaller indexes into larger ones for later searching?Yes (use the xapian-compact tool to do this). You can also search across several indexes without merging them together first - the results of searches performed this way are essentially identical to those across a merged index.> Searching > > 1. I gather from the docs that you can sort results according to your > own field/s, followed by the default document scoring (think > "page-rank"). Correct?Yes, you can store arbitrary extra info with each document and sort by it.> 2. ~/docs/remote.htm mentions distributed searching - we want to > spread the search load around our cluster by splitting the index into > many manageable-sized indexes (to ensure sub-second performance), with > a "master" node which combines search results and end-users see. Is > my understanding correct and are there any pitfalls/bottlenecks?Yes, that exists. There are probably several pitfalls/bottlenecks, but I can't think of any particularly significant ones, and it is quite usable anyway.> 3. Removing duplicates: this can be done programmatically I know > (but is slow on our chosen platform - Perl), but does Xapian provide > this mechanism built-in? For example: a search result might return > several pages from a web site, but we want to remove these dups and > only provide a single result (highest ranking) per website (eg, with a > link for "More from this site..." - al-la Google, which will be a > separate search displaying all the site-duplicates).Yes - this is called "collapsing" in xapian.> 4. If the mechanism to remove duplicates exists, will this still work > cluster-wide in distributed searching?Yes.> 5. Does Xapian provide a mechanism for identifying the actual field > in a search result which triggered the hit? eg, let's say you have > TITLE, BODY, OTHER as fields in your index. If a search found your > term in the BODY field, does Xapian provide this as feedback?You can identify the terms which matched a query, and hence determine the fields relating to it, yes.> 5. This is difficult I know: how does Xapian compare > performance-wise? Has anyone done any basic benchmarking?I have no useful figures to hand. Please share any you create. ;-) -- Richard
Quoting "Olly Betts" <olly at survex.com>:> I could, but Richard has answered already. > > Apologies for being less responsive than usual - I'm back in the UK for a few > weeks and currently my internet access is the local library which > firewalls ssh.No problem - thanks for responding. I'm busy studying the docs and once I've wrapped my mind around how Xapian does things, I'll be coding some simple comparative tests. You're right, of course, benchmarks can be terribly misleading and seldom accurate. Cheers Henry