john.alveris at Safe-mail.net
2015-Jul-26 18:39 UTC
[Xapian-discuss] Get term from document by position
mple (see attachment).> > Attachments get stripped out by the mailing list, so I?ve made a private gist of the two files here: <https://gist.github.com/jaylett/ce8455b37e2b84422346>. > > Actually, when I run it I get 0 matches, which would explain why you?re just getting the start of the document. However if I adjust things (match the stemming strategy for TermGenerator to that for QueryParser), it still gives me the opening rather than a useful snippet.Sorry, my mistake. The modified test.cpp file should be this (i just added indexer.set_stemming_strategy(Xapian::TermGenerator::STEM_ALL_Z), line 34): ============= Begin of the modified test.cpp file======== #include <xapian.h> #include <iostream> #include <string> #include <cstdlib> // For exit(). #include <cstring> #include <fstream> class MyText { public: std::string text_str; void set_string(); }; std::string database_dir="db_dir"; std::string query_string="extracellular microbe"; int main(int argc, char **argv) { // indexing Xapian::WritableDatabase db_w(database_dir, Xapian::DB_CREATE_OR_OVERWRITE); MyText text_to_index; text_to_index.set_string(); Xapian::TermGenerator indexer; Xapian::Stem stemmer("english"); indexer.set_stemmer(stemmer); Xapian::Document doc; indexer.set_document(doc); indexer.set_stemming_strategy(Xapian::TermGenerator::STEM_ALL_Z); indexer.index_text(text_to_index.text_str); db_w.add_document(doc); db_w.commit(); db_w.close(); //searching Xapian::Database db(database_dir); Xapian::Enquire enquire(db); Xapian::QueryParser qp; qp.set_stemmer(stemmer); qp.set_database(db); qp.set_default_op(Xapian::Query::OP_NEAR); qp.set_stemming_strategy(Xapian::QueryParser::STEM_ALL_Z); std::cout << "\n###################################################\n"; std::cout << "query string: " << query_string << "\n"; std::cout << "\n###################################################\n"; Xapian::Query query = qp.parse_query(query_string); std::cout << "\nParsed query is: " << query.get_description() << "\n\n\n"; // Find the top 10 results for the query. enquire.set_query(query); Xapian::MSet matches = enquire.get_mset(0, 10); // Display the results. std::cout << matches.get_matches_estimated() << " results found.\n"; Xapian::Snipper snippet_generator; snippet_generator.set_stemmer(stemmer); snippet_generator.set_mset(matches); std::string snippet=snippet_generator.generate_snippet(text_to_index.text_str); std::cout << "\n###################################################\n"; std::cout << "snippet:\n" << snippet << "\n"; std::cout << "\n###################################################\n"; //cout << "Matches 1-" << matches.size() << ":\n" << endl; //for (Xapian::MSetIterator i = matches.begin(); i != matches.end(); ++i) { // cout << i.get_rank() + 1 << ": " << i.get_weight() << " docid=" << *i // << " [" << i.get_document().get_data() << "]\n\n"; } //saves content of text.txt to text_str // void MyText::set_string() { text_str=""; std::ifstream myfile ("text.txt"); std::string line; if (myfile.is_open()) { while ( std::getline (myfile,line) ) { text_str=text_str+" "+line; } myfile.close(); } else { std::cout << "Unable to open file text.txt"; exit(1); } }
On 26 Jul 2015, at 19:39, john.alveris at Safe-mail.net wrote:>> Actually, when I run it I get 0 matches, which would explain why you?re just getting the start of the document. However if I adjust things (match the stemming strategy for TermGenerator to that for QueryParser), it still gives me the opening rather than a useful snippet. > > Sorry, my mistake. The modified test.cpp file should [have] > indexer.set_stemming_strategy(Xapian::TermGenerator::STEM_ALL_Z)John ? that gave a single match, as expected. I played around with the Snipper (under python, but that won?t make a difference), indexing each page as a separate document, and it does give query-aware snippets, however: 1. It only provides one, where your approach can provide ?first instance?second instance?? kinds of snippet (which in some circumstances is considerably more useful). 2. It didn?t reliably find what I?d consider the ?best? single snippet. I don?t understand the approach that?s being used in Snipper so I don?t know if it?s a question of tuning the approach, making some algorithmic part of it more flexible or swappable, or if we need multiple ways of attacking the problem dependent on details of the data and queries; although from looking at the code it does share some of the things that you?re doing, and if you haven?t looked it the source it?s probably worth it to see how it works with term positions (even though it may turn out to be no more efficient that what you?re doing). J -- James Aylett, occasional trouble-maker xapian.org