hi,
i use xapian to index a txt file, it's size is 268M. i take each line as a
document, and each line has two field like 13445511 | 111115151. the recored
size is 10000000. the XAPIAN_FLUSH_THRESHOLD set 1000000. it takes 1026544ms to
index the file, it is more slower than lucene. The lucene speed is about 40000
records per second.
code:
try
{
Xapian::WritableDatabase database("testindex",
Xapian::DB_CREATE_OR_OPEN);
mybase::Timeval now;
std::string line;
while (getline(fin, line))
{
int pos = line.find('|');
if (pos != std::string::npos)
{
std::string imsi = line.substr(0, pos);
std::string msisdn = line.substr(pos + 1);
Xapian::Document doc;
doc.add_term(imsi);
doc.add_term(msisdn);
database.add_document(doc);
}
}
database.close();
std::cout << now.elapsed() << std::endl;
}
catch (const Xapian::Error& error)
{
std::cout << error.get_msg() << std::endl;
}
the following is the index result:
total 1.9G
-rw-rw-r-- 1 warren warren 0 11-21 17:07 flintlock
-rw-rw-r-- 1 warren warren 28 11-21 17:07 iamchert
-rw-rw-r-- 1 warren warren 22K 11-21 17:24 postlist.baseA
-rw-rw-r-- 1 warren warren 20K 11-21 17:22 postlist.baseB
-rw-rw-r-- 1 warren warren 1.4G 11-21 17:24 postlist.DB
-rw-rw-r-- 1 warren warren 2.0K 11-21 17:24 record.baseA
-rw-rw-r-- 1 warren warren 1.8K 11-21 17:22 record.baseB
-rw-rw-r-- 1 warren warren 121M 11-21 17:24 record.DB
-rw-rw-r-- 1 warren warren 6.7K 11-21 17:24 termlist.baseA
-rw-rw-r-- 1 warren warren 6.1K 11-21 17:22 termlist.baseB
-rw-rw-r-- 1 warren warren 428M 11-21 17:24 termlist.DB
too big!
is there any problem about my code, and is there any way to impove index speed?
thank you
On Wed, Nov 21, 2012 at 05:46:26PM +0800, superthread wrote:> i use xapian to index a txt file, it's size is 268M. i take each line > as a document, and each line has two field like 13445511 | 111115151. > the recored size is 10000000. the XAPIAN_FLUSH_THRESHOLD set 1000000.How did you pick that XAPIAN_FLUSH_THRESHOLD setting? It could be it's not as high as you could set it, or it could be it's high enough that you're creating VM pressure and a lower setting would actually be faster. Also, what version of Xapian are you using, and with which database backend? One of the changes in brass over chert is: + Batched posting list changes during indexing use significantly less memory. So using brass should at least allow you to set XAPIAN_FLUSH_THRESHOLD higher, and the reduced memory usage might make it faster even for the same setting. These are very small documents, which isn't a case I think anyone has looked at closely, so it would be interesting to profile it. There are some tips here: http://trac.xapian.org/wiki/ProfilingXapian Cheers, Olly