Chi Liu
2014-Mar-16 22:13 UTC
[Xapian-devel] [GSoC 2014] About "Clustering of Search Results"
Hello, I have submitted my proposal on GSoC. But I have little idea about the timeline. Many things are difficult to be determined. Cheers, Liu Chi 2014-03-11 21:33 GMT+08:00 Olly Betts <olly at survex.com>:> On Tue, Mar 11, 2014 at 10:11:31AM +0800, Chi Liu wrote: > > Thank you for your patient explanation about the project. My > > understanding about the project "Clustering of Search Results" is that > > we mainly focus on processing speed of the existing code. > > We need something which can cluster larger result sets faster than the > current code. Speeding up the existing code might be the best way to do > that, but we could start again. If we start again, I'd suggest it would > be prudent to try to understand why the previous attempt didn't succeed. > We don't want to end up repeating that. > > > By "find new approaches" I mean trying other known clustering algorithms. > > OK - that's fine then. > > > What I am concerned is whether the low efficiency is caused by > > improper algorithm. I am reading the existing clustering branch code > > and have not completely finished yet. I might be able to talk more > > about existing code in my application of GSoC. But now, I really can > > not comment before fully understanding exiting code. > > Sure. > > > My idea about measure clustering effectiveness is that when we trying > > other known clustering algorithms, we can use the old clustering > > result as a baseline. If the difference of clustering results is > > acceptable and new clustering algorithm has high efficiency, we may > > find a better approach. I will give more details about this in my > > application of GSoC. > > Great. > > Cheers, > Olly >-- Chi Liu +86-15210624786 Undergraduate Student Team of Search Engine and Web Mining School of Electronic Engineering and Computer Science Peking University, Beijing, 100871, P.R.China -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140317/121a8af8/attachment-0002.html>
Chi Liu
2014-Mar-18 06:36 UTC
[Xapian-devel] [GSoC 2014] About "Clustering of Search Results"
*Olly Betts* March 18, 2014, 5:07 a.m.<http://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/liuchi/5629499534213120#c5707702298738688> wrote:>Thanks for your proposal, and sorry for not getting to it sooner. OverallI like the proposal. >One thing I'm not clear on - are you intending to base your implementation of the existing branch,>or to start afresh? And why are you taking that approach?I am intending to base my implementation of the existing branch. Base my implementation on the existing branch will let us understand more clearly what caused the disappointing efficiency of the old code. If we have time at last, we could implement a different clustering algorithm afresh with the experience of avoiding disappointing efficiency. I have added this to my proposal.>Do you have a plan for how to get human judgements to compare thealgorithms?> We've tried this before (with the snippets project in 2012) but sadlysending out a request to our>mailing lists asking people to run through a comparison in a simple web UIresulted in hardly any>uptake. This makes me a bit concerned about the number of "human judgeeffectiveness" entries>in your project plan - I think we either need a plan for motivating peoplebetter, or a way to>compare which doesn't require many judgements.Massive human judgement is not necessary. What I mean human judgement is to generate several test cases artificially to help me to know whether the clustering algorithm could assign documents to correct groups. A test case include a query, a list of search results and a group assignment of these search results. Then we could measure the effectiveness and modify the code promptly. The set of test cases don't need to be very large and I could generate it by myself. I have added this to my proposal.>The overlap with courses and exams isn't a big problem. A commonworkaround is to start>coding early during the community bonding period - in your case that wouldmean you could even>take 2 weeks off GSoC for your final exams, and still actually have madegood progress by the>mid-term. I don't know how busy you are with courses, etc during thecommunity bonding period though. Yes, I could start coding early.>We ask students to submit a patch for Xapian so we can get a better feelfor what their skills and aptitudes are. >If you've already submitted a patch, could you give us a URL? If not, it's better if the patch is something>in an area related to your project, but that's not a firm requirement -you can either find a bug in the>tracker, take a look at http://trac.xapian.org/wiki/ProjectIdeas or workon a first step towards your project. I am regret I have not noticed this before. And I will start to do this right now. 2014-03-17 6:13 GMT+08:00 Chi Liu <liuchi09 at gmail.com>:> Hello, > I have submitted my proposal on GSoC. > But I have little idea about the timeline. Many things are difficult to be > determined. > > > Cheers, > Liu Chi > > > 2014-03-11 21:33 GMT+08:00 Olly Betts <olly at survex.com>: > > On Tue, Mar 11, 2014 at 10:11:31AM +0800, Chi Liu wrote: >> > Thank you for your patient explanation about the project. My >> > understanding about the project "Clustering of Search Results" is that >> > we mainly focus on processing speed of the existing code. >> >> We need something which can cluster larger result sets faster than the >> current code. Speeding up the existing code might be the best way to do >> that, but we could start again. If we start again, I'd suggest it would >> be prudent to try to understand why the previous attempt didn't succeed. >> We don't want to end up repeating that. >> >> > By "find new approaches" I mean trying other known clustering >> algorithms. >> >> OK - that's fine then. >> >> > What I am concerned is whether the low efficiency is caused by >> > improper algorithm. I am reading the existing clustering branch code >> > and have not completely finished yet. I might be able to talk more >> > about existing code in my application of GSoC. But now, I really can >> > not comment before fully understanding exiting code. >> >> Sure. >> >> > My idea about measure clustering effectiveness is that when we trying >> > other known clustering algorithms, we can use the old clustering >> > result as a baseline. If the difference of clustering results is >> > acceptable and new clustering algorithm has high efficiency, we may >> > find a better approach. I will give more details about this in my >> > application of GSoC. >> >> Great. >> >> Cheers, >> Olly >> > > > > -- > Chi Liu > +86-15210624786 > Undergraduate Student > Team of Search Engine and Web Mining > School of Electronic Engineering and Computer Science > Peking University, Beijing, 100871, P.R.China >-- Chi Liu +86-15210624786 Undergraduate Student Team of Search Engine and Web Mining School of Electronic Engineering and Computer Science Peking University, Beijing, 100871, P.R.China -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140318/ed5d4004/attachment-0002.html>
Chi Liu
2014-Mar-24 05:57 UTC
[Xapian-devel] [GSoC 2014] About "Clustering of Search Results"
Olly Betts wrote:>Thanks for the patch, but could you generate a diff for it (that's whatpeople usually mean when>they talk about a patch)? Just sending the changed files makes it harderto apply, since we have>to know the exact version of the code you started from, and we can't readthe patch first to check>the changes look good. A diff is also typically much smaller. You can alsoput your changes to>the Xapian git repo on a branch, push them to github, and open a "pullrequest". I have generated the patch by git diff and upload here https://github.com/incredibleliuchi/xapian_patch_earlyenglishstemmer And I have also open a "pull request" on github. Regards, Liuchi -- Chi Liu +86-15210624786 Undergraduate Student Team of Search Engine and Web Mining School of Electronic Engineering and Computer Science Peking University, Beijing, 100871, P.R.China -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140324/87f978cd/attachment-0002.html>
Olly Betts
2014-Mar-27 05:49 UTC
[Xapian-devel] [GSoC 2014] About "Clustering of Search Results"
On Mon, Mar 24, 2014 at 01:57:09PM +0800, Chi Liu wrote:> I have generated the patch by git diff and upload here > https://github.com/incredibleliuchi/xapian_patch_earlyenglishstemmer > > And I have also open a "pull request" on github.Thanks, looks good. Now merged. Cheers, Olly