Dulitha Kularathne
2015-Feb-23 17:26 UTC
[Xapian-devel] GSOC 2015 Performance Test Suite Project
Hi l!! In the following path it seems like some performance tests are already defined. /xapian/xapian-core/tests/perftest <xapian-devel at lists.xapian.org>*/* So can you give me some explanation regarding those tests. To what extent are they completed?? What more is expected ?? eg:- The areas already tested in categories related to each of the performance requirements (speed, memory, disk space & etc.) Also please enlighten me with the aspects that are expected to test?? eg :- If performance, what kind of data is expected to process and are there any specific processes that are performance critical. I hope that the data population for a performance test would take a significant part. Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20150223/f87fe0dd/attachment-0002.html>
Hi Dulitha, On Mon, Feb 23, 2015 at 10:56:21PM +0530, Dulitha Kularathne wrote:> In the following path it seems like some performance tests are already > defined. > > /xapian/xapian-core/tests/perftest <xapian-devel at lists.xapian.org>*/* > > So can you give me some explanation regarding those tests. To what extent > are they completed?? What more is expected ??Quoting from: http://trac.xapian.org/wiki/GSoCProjectIdeas#Project:PerformanceTestSuite | xapian-core/tests/perftest/ contains some "performance tests", but | they use randomly generated data, so the results may not reflect what | users will see We don't want to test on randomly generated data, unless we can somehow be sure that its characteristics are representative of real data in the ways which affect what we're trying to measure (which is hard to ensure). As a particular example, one benchmark using randomly generated queries I saw many years ago showed Xapian as slower than the system they were comparing to. But if you actually looked at the queries it was slower on, it was cases where the query didn't match anything, and their randomly generated queries exercised that case far more than real world queries would - words used in real world queries don't occur anything like independently. If you excluded the non-matching queries, Xapian was dramatically faster than the other system. While that obviously suggests there was scope for improving Xapian's handling of cases where there are no matches, I would say the main lesson to take away is that randomly generating test data for performance tests can easily lead to bogus results. Hence: | The tests should really use real-world data for both the documents | being indexed and the queries being run. It's not hard to find freely licensed document sets (wikipedia for example). Finding one with suitable corresponding query logs is rather harder, mostly because query logs tend to end up including sensitive data (addresses, credit card numbers, phone numbers, etc) and so there's the cost of sanitising them.> eg:- The areas already tested in categories related to each of the > performance requirements (speed, memory, disk space & etc.)I would suggest you study the existing code to determine that. You'll want to be familiar with it before writing your proposal - if you're using it, you'll need to know what it does; if you aren't using it, you'll need to be able to clearly explain why you're not planning to use it.> Also please enlighten me with the aspects that are expected to test?? > > eg :- If performance, what kind of data is expected to process and are > there any specific processes that are performance critical.Searching is particularly speed sensitive (as users are usually waiting for results while we perform the search), but indexing speed is also important.> I hope that the data population for a performance test would take a > significant part.Sorry, I don't understand what you are trying to say here. Cheers, Olly