ath
2006-Oct-04 21:35 UTC
[Xapian-discuss] How to really make use of omega/xapian? (for omega with PHP Mysql)
For the past two weeks I have been experimenting with xapian and omega (omega in particular) and there are still some things I'm not clear about. I have some PHP/mysql-driven websites, all the content are in the database, so I wanted to extract those content and put them in xapian/omega (XO) for searches. I managed to make a script to extract the right information out of the mysql-db thanks to the dbi2omega-script as inspiration, There are a few things I'm not sure if I've done it correctly or how to do them: 1) my indexing-scheme I was planning to migrate a forum's search functionality to XO first as a try. A forum has forums, topics and posts. I figure it'd be best if I create 2 indexes. One for the topics as documents (title, description, topicstarter and the first post) and another for the posts themselves (topicname, postcontent, author,etc) I started with the topic-script and came up with the following indexscript. topic_id : unique=Q boolean=Q field=tid topic_title : unhtml weight=4 index=S field=title forum_id : index=G field=forum_id starter_name: unhtml index field=author starter_id: field=authorid last_poster_name: unhtml index=A field=last_poster_name last_poster_id: field=last_poster_id starttime: field=modtime url : weight=2 index field=url state : field=state first-post: unhtml truncate=300 field=sample first-post: unhtml weight=3 index field=body In the end I want to be able to search on topictitle, author and forum. Is this indexscript suitable for that? --------------------------------------------------------------------------------------- 2) How can I search on the indexes with the given indexscheme? If I, lessay I want to search for topics started by a certain author (testwriter), I'd assume I only have to do such a search in omega. omega?A=testwriter&DEFAULTOP=or&DB=default&FMT=query&xDB=default&xFILTERS=--O However, I'm not getting any results back if I do so. 3) How can I safely integrate omega on my site? I have a grouppermission scheme going on on my site. You need to be in a certain group to search for content in certain forums. I found this post http://thread.gmane.org/gmane.comp.search.xapian.devel/112/focus=113 but it didn't really help. How can I put these (<QUERY>) AND (XWORLD:yes OR XUSER:bill OR XGROUP:users OR XGROUP:wheel) into use with omega. Where do I put the XWORLD, XUSER, XGROUP-things in the index? And doesn't that mean that a user only have to out XGROUP:wheel in the query and still gets to see evertying? 4) How can i make sure that illegal characters are filtered out in omega? I sometimes have multilingual characters in the content, and these has caused the xml-output of omega to go haywire. How can I make sure that these kind of characters are filtered out? I've already used unhtml, what else can I do? Thanks for the time. -- View this message in context: http://www.nabble.com/How-to-really-make-use-of-omega-xapian--%28for-omega-with-PHP-Mysql%29-tf2384872.html#a6647798 Sent from the Xapian - Discuss mailing list archive at Nabble.com.
Olly Betts
2006-Oct-05 09:17 UTC
[Xapian-discuss] How to really make use of omega/xapian? (for omega with PHP Mysql)
On Wed, Oct 04, 2006 at 01:35:09PM -0700, ath wrote:> first-post: unhtml truncate=300 field=sample > first-post: unhtml weight=3 index field=bodyIt would be more efficient to only "unhtml" once: first-post: unhtml weight=3 index field=body truncate=300 field=sample Do you actually want to store the whole field in Xapian? You can, but it's not required in order to index it, and it's potentially large...> In the end I want to be able to search on topictitle, author and forum. Is > this indexscript suitable for that?It looks plausible, though I don't know exactly what's in each field.> 2) How can I search on the indexes with the given indexscheme? > > If I, lessay I want to search for topics started by a certain author > (testwriter), I'd assume I only have to do such a search in omega. > omega?A=testwriter&DEFAULTOP=or&DB=default&FMT=query&xDB=default&xFILTERS=--O > However, I'm not getting any results back if I do so.Term prefixes aren't the same as CGI parameters. See the Omega documentation "docs/termprefixes.txt", in particular the last section on "Probabilistic Fields". If you want separate form fields for "author" and "body" queries, you can't quite achieve this using Omega unmodified at present. That really should be possible - file a wishlist bug and I'll take a look when I'm not in the middle of a release. Or if you want to work on a patch, I can point you in the right direction.> 3) How can I safely integrate omega on my site? > > I have a grouppermission scheme going on on my site. You need to be in a > certain group to search for content in certain forums. > I found this post > http://thread.gmane.org/gmane.comp.search.xapian.devel/112/focus=113 but it > didn't really help. How can I put these (<QUERY>) AND (XWORLD:yes OR > XUSER:bill OR XGROUP:users OR XGROUP:wheel) into use with omega. > Where do I put the XWORLD, XUSER, XGROUP-things in the index? > And doesn't that mean that a user only have to out XGROUP:wheel in the query > and still gets to see evertying?You'll need to modify Omega for this. The query string is passed to the Xapian::QueryParser object which returns a Xapian::Query object (function set_probabilistic in query.cc). You then just need to combine this with your permissions filter, something like this: Xapian::Query permissions("XGROUP:squirrels"); // Or whatever query = qp.parse_query(query_string); query = Xapian::Query(Xapian::Query::OP_FILTER, query, permissions);> 4) How can i make sure that illegal characters are filtered out in omega? > I sometimes have multilingual characters in the content, and these has > caused the xml-output of omega to go haywire. How can I make sure that these > kind of characters are filtered out? I've already used unhtml, what else can > I do?Bear in mind that the released versions of Omega assume iso-8859-1 (utf-8 support will be in the 1.0 release) so wide and multibyte characters won't currently be handled correctly. Using $html{} in your query template should escape characters which are problematic in HTML and XML. If you're not using that, you really need to so as to avoid potential cross-site scripting type attacks. If you're already using that, can you give an example of this problem? Cheers, Olly