Edwin Smulders
2007-Jul-24 03:04 UTC
[Xapian-discuss] Practical example/explanation using an existing database
Hi, I'm reading up on the usage of Xapian to find out if we can use it for Wine's Application Database, and I'm having a bit of trouble seeing the general picture. I could use some practical information (through words or code) on how to search an existing (mysql) database. As far as I can tell it can be used with a mysql db, and I read that Xapian first makes an index (in it's own database/tables ?) and then searches through that index. Now a few questions come to mind and I couldn't find the answers in the documentation. Firstly, how exactly does the indexing work in regard to telling Xapian what to search through? Do we write an SQL query returning all the data we want indexed? or maybe do we tell it what tables/columns to index (ie. does it generate queries?) And how is the index updated, a regular rescan or an update whenever data in our system updates? The other question that came to mind is, once everything is indexed, how is the data returned on a search? This is best explained in an example: If a user would be entering a a search term and I (the programmer) want to search the database, can i specifically tell Xapian to search in for example the application names, or the descriptions, or both? I hope somebody can clarify this for me, right now it all looks quite difficult to implement. Edwin Smulders
Alexander Lind
2007-Jul-24 04:20 UTC
[Xapian-discuss] Practical example/explanation using an existing database
Edwin Smulders wrote:> Hi, > > I'm reading up on the usage of Xapian to find out if we can use it for > Wine's Application Database, and I'm having a bit of trouble seeing > the general picture. I could use some practical information (through > words or code) on how to search an existing (mysql) database. > > As far as I can tell it can be used with a mysql db, and I read that > Xapian first makes an index (in it's own database/tables ?)Yes, its own database.> and then > searches through that index. Now a few questions come to mind and I > couldn't find the answers in the documentation. > > Firstly, how exactly does the indexing work in regard to telling > Xapian what to search through? Do we write an SQL query returning all > the data we want indexed?Thats pretty much how I do it myself.> or maybe do we tell it what tables/columns > to index (ie. does it generate queries?)Core Xapian does not come with any such functionality, but Omega (which is a packaged search engine app that you can get from xapian.org) I think may have something that can digest a table in a db by itself. Not sure about this, I don't use Omega, but should be easy to find out.> And how is the index updated, a regular rescan or an update whenever > data in our system updates?If you don't use Omega its up to yourself how you want to maintain the index, Omega has the functionality to do it for you I think. Personally I flag entries in the db table when they change, and then I update them in Xapian in a batch.> > The other question that came to mind is, once everything is indexed, > how is the data returned on a search? This is best explained in an > example: > If a user would be entering a a search term and I (the programmer) > want to search the database, can i specifically tell Xapian to search > in for example the application names, or the descriptions, or both?You can achieve this by using prefixes in the index, but would it not be better to just index terms in application names with more weight, and terms from descriptions with less weight and search among all terms always?> > I hope somebody can clarify this for me, right now it all looks quite > difficult to implement.I had a hard time grasping Xapian also in the beginning, the manual is pretty technical, but the example code provided helped a lot. But the time it took to learn it was time well invested, Xapian is absolutely awesome. Fast, stable, and actively developed. And the authors are very helpful. Alec> > > Edwin Smulders > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss@lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss >
Jim
2007-Jul-24 12:52 UTC
[Xapian-discuss] Practical example/explanation using an existing database
Edwin Smulders wrote:> Hi, > > I'm reading up on the usage of Xapian to find out if we can use it for > Wine's Application Database, and I'm having a bit of trouble seeing > the general picture. I could use some practical information (through > words or code) on how to search an existing (mysql) database. > > As far as I can tell it can be used with a mysql db, and I read that > Xapian first makes an index (in it's own database/tables ?) and then > searches through that index. Now a few questions come to mind and I > couldn't find the answers in the documentation.As Alexander said, Xapian is just a library that allows you to build and search. Scriptindex is a program using Xapian that takes input from a file as a set of text fields and an index file telling what to do with the fields and builds an Xapian indexed database from the text. Each set of fields in the input would represent a row in the database (presumably) and one of the fields should contain a unique value (like an ID) that would be used to fetch the row during a search. For example, if we had a Mysql db with the following schema: id bigint lastname varchar(40) firstname varchar(40) address varchar(80) You could write a simple script/program to extract the data from the db and create an input file for scriptindex that looks like: id=0 lastname=Brown-White firstname=John address=123 Main Street =anywhere, =VA =22222 =USA id=1 lastname=Johnson firstname=Jack =aroo address=234 Story Lane =somewhere, =WA =09876-0988 =USA Etc. Each block represents a record fetched from the db. The index file might look like: id : field boolean=Q unique=Q lastname: text firsname: text address: text Scriptindex would then read both files (a data file and an index file) and create a searchable database that omega could read. Omega would then be called via something like: omega?P=lastname:Johnson%20ANDfirstname:Roger or omega?P=address:23123 (to find all the people in zip 23123) What omega returns will probably have to be interpreted by a program that actually goes to the mysql db and fetches the row and formats it in the way you want the data presented. This is not the best way to index the data, but for simplicity I left off a lot since you wanted a concept not the details. For simplicity I used the same names for the fields in Scriptindex as in the database but that is not necessary. Jim.> > Firstly, how exactly does the indexing work in regard to telling > Xapian what to search through? Do we write an SQL query returning all > the data we want indexed? or maybe do we tell it what tables/columns > to index (ie. does it generate queries?) > And how is the index updated, a regular rescan or an update whenever > data in our system updates? > > The other question that came to mind is, once everything is indexed, > how is the data returned on a search? This is best explained in an > example: > If a user would be entering a a search term and I (the programmer) > want to search the database, can i specifically tell Xapian to search > in for example the application names, or the descriptions, or both?By using the "field:" syntax you may search any field(s) that you want.> > I hope somebody can clarify this for me, right now it all looks quite > difficult to implement.Not at all.> > > Edwin Smulders > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss@lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss >
James Aylett
2007-Jul-24 15:20 UTC
[Xapian-discuss] Practical example/explanation using an existing database
(I'm probably repeating some information already given by others - hopefully this helps by being all in one place!) On Tue, Jul 24, 2007 at 04:04:53AM +0200, Edwin Smulders wrote:> Firstly, how exactly does the indexing work in regard to telling > Xapian what to search through? Do we write an SQL query returning all > the data we want indexed? or maybe do we tell it what tables/columns > to index (ie. does it generate queries?) > And how is the index updated, a regular rescan or an update whenever > data in our system updates?Hi, Edwin. You should use Omega on top of Xapian, which gives you most of the search engine you'll need. Omega comes with a script dbi2omega which dumps an SQL database into something suitable for running scriptindex over. You'll need to figure out how to update the index yourself. If it's small, just rebuild the entire thing; if not, and you can detect changed entries from the SQL database, modifying dbi2omega to include a WHERE clause wouldn't be difficult.> The other question that came to mind is, once everything is indexed, > how is the data returned on a search?A lot of this depends on how your index plan works. Basically, scriptindex takes an input file (the data, produced in this case by dbi2omega) and an index file, which describes how the Xapian database is built. Say you have a table: ---------------------------------------------------------------------- CREATE TABLE `t` ( id INT NOT NULL, name VARCHAR(50) NOT NULL DEFAULT "", description VARCHAR(255) NOT NULL DEFAULT "", content TEXT NOT NULL DEFAULT "", PRIMARY KEY(id) ); ---------------------------------------------------------------------- You might decide that you want people to be able to search on contents of names, contents of descriptions, or across any text (name, description or content). The way you do searching within particular fields in Xapian is to use term prefixes. Look at docs/termprefixes.txt in the Omega distribution for a background on this. So you'd want to have: * name indexed with no prefix * name indexed with a prefix of 'S' (subject) * description indexed with no prefix * description indexed with a prefix of 'S' [1] * content indexed with no prefix Your index file might look like: ---------------------------------------------------------------------- id: field=id unique=Q name: field=name index index=S description: index index=S truncate=200 field=sample content: index ---------------------------------------------------------------------- The field= bits put fields into the document data, so that you can extract them later. (See the omegascript documentation.) You read the lines from left to right, so the 'description' one (for instance) says: * first, index the text * then index the text again, with a prefix of 'S' * then truncate the text to 200 characters at most, but avoiding truncating a word * then put the truncated text into the ``sample'' field Then you need to set up a suitable template in omega. I'd recommend using the default to start off with until you have a feeling for what's going on, and then start use the xml template to get data into the rest of your system, which can hook up against the database as needed. (Or you may not need this. Depends what you're trying to do.) J -- /--------------------------------------------------------------------------\ James Aylett xapian.org james@tartarus.org uncertaintydivision.org
John Wards
2007-Jul-25 16:52 UTC
[Xapian-discuss] Re: Practical example/explanation using an existing database
Hi Edwin, Here is some sample PHP code to build an index. Not perfect but works and should give you a kick start.... ----------- //Define some constants for ease of use define("XapianClientID",1); define("XapianInfo",2); define("XapianName",3); define("XapianCounty",4); define("XapianCountyID",5); $result = db_query("SELECT clientid,info,name,county,countyid FROM business"); include('xapian.php'); try { // Create a database, overwrite if exists. $database = new XapianWritableDatabase("/home/mydirectory/xapiantest/testdatabase", Xapian::DB_CREATE_OR_OVERWRITE); $indexer = new XapianTermGenerator(); $stemmer = new XapianStem("english"); $indexer->set_stemmer($stemmer); while ($data = db_fetch_assoc($result)) { $doc = new XapianDocument(); $doc->add_value(XapianClientID,$data["clientid"]); $doc->add_value(XapianInfo,$data["info"]); $doc->add_value(XapianName,$data["name"]); $doc->add_value(XapianCounty,$data["county"]); $doc->add_value(XapianCountyID,$data["countyid"]); //Adds a boolean term $doc->add_term("XCO".$data["countyid"]); //Assign the document to the TermGenerator which will generate the terms used for searching $indexer->set_document($doc); $indexer->index_text("$data["info"]") //Give this a higher weight $indexer->index_text("$data["name"]",2); . $indexer->index_text("$data["county"]"); // Add the document to the database. $database->add_document($doc); } // Set the database handle to Null to ensure that it gets closed // down cleanly or unflushed changes may be lost. $database = Null; } catch (Exception $e) { print $e->getMessage() . "\n"; exit(1); } ----------- Here is some code to search ---------- //Define some constants for ease of use define("XapianClientID",1); define("XapianInfo",2); define("XapianName",3); define("XapianCounty",4); define("XapianCountyID",5); if(isset($_GET["page"])){ $page = intval($_GET["page"]); }else{ $page =0; } include('xapian.php'); // Open the database for searching try { $database = new XapianDatabase("/home/mydirectory/xapiantest/testdatabase"); // Start an enquire session. $enquire = new XapianEnquire($database); $query_string = $_GET["q"]; $qp = new XapianQueryParser(); $stemmer = new XapianStem("english"); $qp->set_stemmer($stemmer); $qp->set_database($database); $qp->set_stemming_strategy(XapianQueryParser::STEM_ALL); $qp->set_default_op(XapianQuery::OP_AND); //Returns a XapianQuery Object $query = $qp->parse_query($query_string); if($_GET["inpcountyid"]){ $query = new XapianQuery(XapianQuery::OP_FILTER,$query,new XapianQuery('XCO'.$_GET["inpcountyid"])); } $enquire->set_query($query); //Returns a XapianMSet object $matches = $enquire->get_mset($page, 10); print "Parsed query is: {$query->get_description()}\n<br \>"; print "{$matches->get_matches_estimated()} results found:\n <br \>"; //Returns a XapianMSetIterator object $i = $matches->begin(); while (!$i->equals($matches->end())) { //Returns XapianDocument object $odoc = $i->get_document(); echo "Score: ".$i->get_percent()."%<br />"; echo "Client ID: ".$odoc->get_value(XapianClientID)."<br />"; echo "Bus Name: ".$odoc->get_value(XapianName)."<br />"; echo "Additional Info: ".$odoc->get_value(XapianInfo)."<br />"; echo "County: ".$odoc->get_value(XapianCounty)."<br />"; $i->next(); } } catch (Exception $e) { print $e->getMessage() . "\n"; exit(1); } ----------
James Aylett
2007-Jul-25 17:53 UTC
[Xapian-discuss] Re: Practical example/explanation using an existing database
On Wed, Jul 25, 2007 at 04:52:37PM +0100, John Wards wrote:> Here is some sample PHP code to build an index. Not perfect but works > and should give you a kick start....Can I suggest that you use document data as a field store, rather than document values? That way you can give them textual names, instead of mapping to numbers. J -- /--------------------------------------------------------------------------\ James Aylett xapian.org james@tartarus.org uncertaintydivision.org
Edwin Smulders
2007-Jul-25 19:10 UTC
[Xapian-discuss] Re: Practical example/explanation using an existing database
Well thanks everyone for the help, the amount of reactions is really great. I'll have to give everything a closer look, if I still have problems I'll come back here :) On 7/24/07, Edwin Smulders <edwin.smulders@gmail.com> wrote:> Hi, > > I'm reading up on the usage of Xapian to find out if we can use it for > Wine's Application Database, and I'm having a bit of trouble seeing > the general picture. I could use some practical information (through > words or code) on how to search an existing (mysql) database. > > As far as I can tell it can be used with a mysql db, and I read that > Xapian first makes an index (in it's own database/tables ?) and then > searches through that index. Now a few questions come to mind and I > couldn't find the answers in the documentation. > > Firstly, how exactly does the indexing work in regard to telling > Xapian what to search through? Do we write an SQL query returning all > the data we want indexed? or maybe do we tell it what tables/columns > to index (ie. does it generate queries?) > And how is the index updated, a regular rescan or an update whenever > data in our system updates? > > The other question that came to mind is, once everything is indexed, > how is the data returned on a search? This is best explained in an > example: > If a user would be entering a a search term and I (the programmer) > want to search the database, can i specifically tell Xapian to search > in for example the application names, or the descriptions, or both? > > I hope somebody can clarify this for me, right now it all looks quite > difficult to implement. > > > Edwin Smulders >
John Wards
2007-Jul-26 09:36 UTC
[Xapian-discuss] Re: Practical example/explanation using an existing database
On Wed, 2007-07-25 at 17:53 +0100, James Aylett wrote:> Can I suggest that you use document data as a field store, rather than > document values? That way you can give them textual names, instead of > mapping to numbers.How do you add data as a field store? In the API, which has taken me a few days to get my head around, all I see is the ability to add values and add data. As far as I can tell add data only takes one argument which is the data. Yes I could do some regular expression work when I get data but with defining my constants in an include file which I use on indexing and searching I have usable names to get/set values. Cheers John