I got some performance issue for document.get_data() and enquire.get_mset(). It costs 35 seconds for matches enquire.get_mset(0,200), and 3 seconds for iterating all doc in matches to get_data. Is't normal? My index contains 30millions documents. I use python binding to operate xapian. Bellow it's my index structure # value: 0:date, 1:site # data: json message which contains: author, url, message(30 words) Do you have any idea to improve the search performance , especially doc.get_data? my code snippet database = xapian.Database("%s/athena" % DATA_PATH) enquire = xapian.Enquire(database) enquire.set_weighting_scheme(xapian.BM25Weight()) query = parse(keywords) enquire.set_query(query) matches = enquire.get_mset(start, 200) matches.fetch() result = [json.loads(match.document.get_data()) for match in matches]
On Wed, Oct 23, 2013 at 01:30:51PM +0800, Tong Liu wrote:> I got some performance issue for document.get_data() and > enquire.get_mset(). It costs 35 seconds for matches > enquire.get_mset(0,200), and 3 seconds for iterating all doc in matches to > get_data. Is't normal? My index contains 30millions documents. I use python > binding to operate xapian. Bellow it's my index structure > # value: 0:date, 1:site > # data: json message which contains: author, url, message(30 words)That sounds much slower than I'd expect. Is that the cold cache time? If so, does rerunning the same query take much less time?> Do you have any idea to improve the search performance , especially > doc.get_data? > > my code snippet > > database = xapian.Database("%s/athena" % DATA_PATH) > enquire = xapian.Enquire(database) > enquire.set_weighting_scheme(xapian.BM25Weight()) > query = parse(keywords)What are you passing in for keywords here?> enquire.set_query(query) > matches = enquire.get_mset(start, 200)Is start 0 here?> matches.fetch()With a local database, it probably won't help to call fetch().> result = [json.loads(match.document.get_data()) for match in matches]So your time includes parsing the JSON - try changing that to this to focus on the time actually taken by Xapian and its python bindings: result = [match.document.get_data() for match in matches] Also, what Xapian version are you using? Cheers, Olly
1 That sounds much slower than I'd expect. Is that the cold cache time? If so, does rerunning the same query take much less time? Yes, It's cold cache time. The time of next query with same words is faster. 2 > query = parse(keywords) What are you passing in for keywords here? such as: BMW, iPhone5s... 3 > matches = enquire.get_mset(start, 200) Is start 0 here? Yes, start=0 4 result = [match.document.get_data() for match in matches] So your time includes parsing the JSON - try changing that to this to focus on the time actually taken by Xapian and its python bindings I have tried to run without dcode json, but there is no difference. Xapian version is 1.2.15. 2013/10/30 Olly Betts <olly at survex.com>> On Wed, Oct 23, 2013 at 01:30:51PM +0800, Tong Liu wrote: > > I got some performance issue for document.get_data() and > > enquire.get_mset(). It costs 35 seconds for matches > > enquire.get_mset(0,200), and 3 seconds for iterating all doc in matches > to > > get_data. Is't normal? My index contains 30millions documents. I use > python > > binding to operate xapian. Bellow it's my index structure > > # value: 0:date, 1:site > > # data: json message which contains: author, url, message(30 words) > > That sounds much slower than I'd expect. Is that the cold cache time? > If so, does rerunning the same query take much less time? > > > Do you have any idea to improve the search performance , especially > > doc.get_data? > > > > my code snippet > > > > database = xapian.Database("%s/athena" % DATA_PATH) > > enquire = xapian.Enquire(database) > > enquire.set_weighting_scheme(xapian.BM25Weight()) > > query = parse(keywords) > > What are you passing in for keywords here? > > > enquire.set_query(query) > > matches = enquire.get_mset(start, 200) > > Is start 0 here? > > > matches.fetch() > > With a local database, it probably won't help to call fetch(). > > > result = [json.loads(match.document.get_data()) for match in matches] > > So your time includes parsing the JSON - try changing that to this to > focus on the time actually taken by Xapian and its python bindings: > > result = [match.document.get_data() for match in matches] > > Also, what Xapian version are you using? > > Cheers, > Olly >