Andrey Kong
2006-Dec-09 01:59 UTC
[Xapian-discuss] Question: Query weights, Rset usage, lowercase
Hi First of all, thank you for Xapian, its fast and stable. I use it to develop a search for Chinese data, and it works well. I've been playing Xapian for 1 week (PHP) and able to insert terms, data, values in to the database and also able to search them out. I add prefixes mannually (PT, PD, PP, PM...)(title, domain, URL path, meta...), and values(0) = timestamp. Currently, I use Xapian to search, resulting the IDs for mySql database, and retrive Descriptions..etc from the mysql with the unique Key IDs. here are my questions: 1)How much cost if I put the Descriptions inside the Xapian.document.data field? (assume the Descriptions are unHTML contents of web pages), will the Xapian DB become very big and affects the preformance? (i have 1M docs when testing) 2)Since now i am able to search the Title(prefix PT, weight=20) and Descriptions(no prefix, weight=1) of the database, I begin wondering how to assign different weights to the Query. How to achive: Query using "OR" (Microsoft , Keyboard , Mouse) which the term "Microsoft" =weight 5 | "Keyboard" = wieght 1 | "Mouse" = weight 1 Because its normal that ppl will type in the most important terms first and then the less important terms later, so i want to make the query in the same approach. 3)Since I add my own prefixes manually, I wonder does Xapian change all Terms into lowercase automatically? Or I need to do it manually? 4)when i query ("search engine") , if I add 3 docs to the Rset, does this "Rset related to -search engine-" remains in the database? So next time I have the same query "search engine", the 3 docs in the Rset can be retrived from the database? how to do that? Finally, once again, Xapian is very fast, thank you for the great project. I think it will be even more great, if there are 2-5 lines of example of usage in the API document. If every function has a 3-5 lines of codes of example of usage, we can understand the function and usage in 5secs. Without the example, I say I used 3-5 Hours to test it out myself, some just gave up... Thanks Andrey K.
Olly Betts
2006-Dec-09 03:54 UTC
[Xapian-discuss] Question: Query weights, Rset usage, lowercase
On Sat, Dec 09, 2006 at 09:55:11AM +0800, Andrey Kong wrote:> 1)How much cost if I put the Descriptions inside the > Xapian.document.data field? (assume the Descriptions are unHTML > contents of web pages), will the Xapian DB become very big and > affects the preformance? (i have 1M docs when testing)Assuming the usual pattern of searching for 10 or so matches, this shouldn't be a problem at all. The document data is stored in a separate file, so there should be no effect on matching, aside from competing for disk cache. You'll have similar competition for disk cache anyway if you're pulling the same data from an SQL database hosted on the same machine.> 2)Since now i am able to search the Title(prefix PT, weight=20) and > Descriptions(no prefix, weight=1) of the database, I begin wondering > how to assign different weights to the Query. How to achive: > > Query using "OR" (Microsoft , Keyboard , Mouse) > > which the term "Microsoft" =weight 5 | "Keyboard" = wieght 1 | "Mouse" = weight 1Just set the within query frequency (wqf) - e.g. Query("microsoft", 5).> Because its normal that ppl will type in the most important terms > first and then the less important terms later, so i want to make the > query in the same approach.I have my doubts about this idea. The risk is that you'll improve results for some queries while making others worse. I think people tend to enter queries with the natural word order. Sometimes the more important terms will be first, and sometimes they won't. In this case, "Microsoft" is performing an adjectival role by defining a narrower scope for the words which follow, which is why it's perhaps more important. But this varies between languages - in spanish it would probably be "mouse de Microsoft" not "Microsoft mouse".> 3)Since I add my own prefixes manually, I wonder does Xapian change > all Terms into lowercase automatically? Or I need to do it manually?Xapian treats terms as opaque pieces of data, so you'll need to lowercase them yourself if that's what you want. Otherwise it wouldn't be possible to implement a case-sensitive search.> 4)when i query ("search engine") , if I add 3 docs to the Rset, does > this "Rset related to -search engine-" remains in the database? So > next time I have the same query "search engine", the 3 docs in the > Rset can be retrived from the database? how to do that?The RSet isn't stored in the database. The RSet represents a set of relevance judgements which a user has made pertaining to a particular query (or more generally to a particular "information need"). If you want to store it, it almost certainly needs to be per user and probably per query too. In a web application, I'd suggest storing it in a cookie.> I think it will be even more great, if there are 2-5 lines of example > of usage in the API document.Yes, that would be good (though I think many would need a larger example to be useful). However, it would be a substantial amount of work and we're all already very busy. Patches are welcome of course (if anyone wants to work on this, please add examples to the doxygen comments in the headers, not the HTML documentation which is automatically generated from them!)> If every function has a 3-5 lines of codes of example of usage, we can > understand the function and usage in 5secs. Without the example, I say > I used 3-5 Hours to test it out myself, some just gave up...I'd suggest you simply search the examples (or failing that, Omega) for the particular method you want to see in context. For most methods, that will find you an actual working piece of code using the method. Cheers, Olly