Sanat Jain
2016-Mar-14  18:59 UTC
query regarding matcher optimisation and proposal submission
Hello Sir/Ma'am
I would be really grateful if you could clear my following queries:
1) In the ticket #215 (Boolean OR could be optimised further)
https://trac.xapian.org/ticket/215
 (i)  Is there a predefined function to sort the posting lists in order of
term frequency? If yes then where can I find it?
  (ii)  What does the following paragraph means as given in the above link:
 ?We'd need to keep track of which sub-postlists have been moved up to the
    current position, and which haven't. When next() is called, we'd
call
next() on any sub-postlists which are up-to-date, but we would need to call
skip_to() on any other sub-postlists which are further behind.?
(iii) And can you please tell me what is the difference between next() and
skip_to()?
2) is there any explanation for ticket #394 (Speed up phrase queries with a
"settling pond")
3) Also can you please tell me where can I find some explanation of
OP_SYNONYM as required in
Ticket #400 (Optimise AND_MAYBE when the RHS has a maxweight of 0)
               https://trac.xapian.org/ticket/400
4) I am new to GSOC so can you please guide me, where should I submit my
first draft proposal to you for your feedback, should it be this email or
should I submit it on GSOC main website and then edit it later?
 5) i am planning to take ticket #215 before mid term evaluation and ticket
#400 or # 394 after it, please guide me if this is acceptable approach or
suggest any changes.
Thank you so much for your kind help.
Regards,
Sanat Kumar Jain
B.E. Computer Engineering
Thapar University, Patiala
India(GMT +5hr 30min)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20160315/66c13091/attachment.html>
Olly Betts
2016-Mar-15  00:34 UTC
query regarding matcher optimisation and proposal submission
Hi Sanat, On Tue, Mar 15, 2016 at 12:29:41AM +0530, Sanat Jain wrote:> 1) In the ticket #215 (Boolean OR could be optimised further) > https://trac.xapian.org/ticket/215 > > (i) Is there a predefined function to sort the posting lists in order of > term frequency? If yes then where can I find it?You'll need to learn to find your own way around the code. We aim to answer questions promptly, but we aren't available 24/7 - if you stop, ask a question and have to wait for an answer every time you hit an unknown, that's going to seriously reduce the amount you can get done. Tools like "git grep" are very useful for finding answers to such questions: $ git grep -i postlist|grep -i sort|grep -i freq api/queryinternal.cc: // Sort the postlists so that the postlist with the greatest term frequency api/queryinternal.cc: sort(pls.begin(), pls.end(), ComparePostListTermFreqAscending()); Admittedly I have the advantage of having some idea of what the answer looks like, but even just a search for all the places that call sort() would give you a manageable list of places to look: $ git grep '\<sort('> (ii) What does the following paragraph means as given in the above link: > > ?We'd need to keep track of which sub-postlists have been moved up to the > current position, and which haven't. When next() is called, we'd call > next() on any sub-postlists which are up-to-date, but we would need to call > skip_to() on any other sub-postlists which are further behind.? > > (iii) And can you please tell me what is the difference between next() and > skip_to()?Look at the header where the class you're interested in is defined, and in most cases the methods have documentation comments - in this case, see api/postlist.h. It would be hard to explain the quoted paragraph if you don't understand the purpose of these methods.> 2) is there any explanation for ticket #394 (Speed up phrase queries with a > "settling pond")There's an explanation in the ticket's description... I assume you've read that and it wasn't what you wanted, but it's hard for me to know in what you are after. It's generally very hard to give a helpful answer to vague or general questions. Try to ask precise questions if you want helpful answers.> 3) Also can you please tell me where can I find some explanation of > OP_SYNONYM as required in > > Ticket #400 (Optimise AND_MAYBE when the RHS has a maxweight of 0) > > https://trac.xapian.org/ticket/400In the API docs for the Xapian::Query class (or you can find everywhere it is used with: git grep OP_SYNONYM> 4) I am new to GSOC so can you please guide me, where should I submit my > first draft proposal to you for your feedback, should it be this email or > should I submit it on GSOC main website and then edit it later?The GSoC website has been completely redone this year, and I'm not yet sure what the new workflow for this is. I know the final proposals are submitted as PDFs this year, and I can see "Draft Proposals" and "PDF Proposals" in the dashboard, so I guess you can submit a "Draft Proposal" as something other than PDF, but we don't yet have any proposals submitted, so I can't see what they look like. PDF isn't the most helpful format for review, as it's not simple for us to "diff" two versions of a proposal in PDF form to see what's changed, and having to reread a whole proposal every time you make some changes is inefficient. If you're working in some text source format (LaTeX, reStructuredText, etc) then showing us the source is more helpful. You might want to just stick the source in a git repo, which means you can't lose it if your computer crashes, and we can easily diff versions, etc. You're welcome to send drafts to the mailing list. Please don't send them by private email to mentors though, as it's impossible for us to track what's going on then. You don't need to be concerned about plagiarism - it is very obvious if we get two proposals with text in common, and it'll be clear who actually wrote the text in question. Passing other other people's work as your own is particularly taboo in the FOSS world.> 5) i am planning to take ticket #215 before mid term evaluation and ticket > #400 or # 394 after it, please guide me if this is acceptable approach or > suggest any changes.You don't say much about your level of experience, but two tickets in three months seems a little under ambitious on the face of it. We expect a proposal to analyse the work to be done and break it down into small enough jobs that you can sensible reason about how long they might take. I'd suggest taking each ticket in turn and doing that until you think you have 3 months worth of work. It's also a good idea to include some "stretch goals", so you have a plan for how to fill the time if things go faster than expected. Cheers, Olly