hi, i use xapian to index a txt file, it's size is 268M. i take each line as a document, and each line has two field like 13445511 | 111115151. the recored size is 10000000. the XAPIAN_FLUSH_THRESHOLD set 1000000. it takes 1026544ms to index the file, it is more slower than lucene. The lucene speed is about 40000 records per second. code: try { Xapian::WritableDatabase database("testindex", Xapian::DB_CREATE_OR_OPEN); mybase::Timeval now; std::string line; while (getline(fin, line)) { int pos = line.find('|'); if (pos != std::string::npos) { std::string imsi = line.substr(0, pos); std::string msisdn = line.substr(pos + 1); Xapian::Document doc; doc.add_term(imsi); doc.add_term(msisdn); database.add_document(doc); } } database.close(); std::cout << now.elapsed() << std::endl; } catch (const Xapian::Error& error) { std::cout << error.get_msg() << std::endl; } the following is the index result: total 1.9G -rw-rw-r-- 1 warren warren 0 11-21 17:07 flintlock -rw-rw-r-- 1 warren warren 28 11-21 17:07 iamchert -rw-rw-r-- 1 warren warren 22K 11-21 17:24 postlist.baseA -rw-rw-r-- 1 warren warren 20K 11-21 17:22 postlist.baseB -rw-rw-r-- 1 warren warren 1.4G 11-21 17:24 postlist.DB -rw-rw-r-- 1 warren warren 2.0K 11-21 17:24 record.baseA -rw-rw-r-- 1 warren warren 1.8K 11-21 17:22 record.baseB -rw-rw-r-- 1 warren warren 121M 11-21 17:24 record.DB -rw-rw-r-- 1 warren warren 6.7K 11-21 17:24 termlist.baseA -rw-rw-r-- 1 warren warren 6.1K 11-21 17:22 termlist.baseB -rw-rw-r-- 1 warren warren 428M 11-21 17:24 termlist.DB too big! is there any problem about my code, and is there any way to impove index speed? thank you
On Wed, Nov 21, 2012 at 05:46:26PM +0800, superthread wrote:> i use xapian to index a txt file, it's size is 268M. i take each line > as a document, and each line has two field like 13445511 | 111115151. > the recored size is 10000000. the XAPIAN_FLUSH_THRESHOLD set 1000000.How did you pick that XAPIAN_FLUSH_THRESHOLD setting? It could be it's not as high as you could set it, or it could be it's high enough that you're creating VM pressure and a lower setting would actually be faster. Also, what version of Xapian are you using, and with which database backend? One of the changes in brass over chert is: + Batched posting list changes during indexing use significantly less memory. So using brass should at least allow you to set XAPIAN_FLUSH_THRESHOLD higher, and the reduced memory usage might make it faster even for the same setting. These are very small documents, which isn't a case I think anyone has looked at closely, so it would be interesting to profile it. There are some tips here: http://trac.xapian.org/wiki/ProfilingXapian Cheers, Olly