nijil yes
2011-Mar-28 08:26 UTC
[Xapian-devel] Draft Application for GSoC 11 - Text extraction libraries - please review
Proposal for Google Summer of Code 2011 (draft) Appling organisation:Xapian Name : Nijil.Y E-mail address: nijil.y at gmail.com IRC nickname : laserbled Biography I am 4thyear Computer Science and Engineering undergraduate student at CUSAT University from India.I am interested in open source and search engines , cluster computing , HPC and AI would be my areas of interest. * Analytical, detail oriented with strong programming skills; work diligently on long,challenging assignments. * Maintain excellent interpersonal communication, time management, and problem resolution skills. * Quick to get accustomed to new technologies Eligibility I am fully eligible by google norms and will be available through out the timespan given for GsoC 2011. Background Information 1:Have you taken part in GSoC and/or GHOP and/or GCI before? If so, in what role(s)? Tell us about how it went, and any areas you would have liked more help with. Ans:I have applied for GSoC 2010 to an organization called Berkman Center which is a wing of Harvard University. It was for a news indexing application called Mediacloud.I was not selected as there were more deserving candidates. 2:Please tell us about any previous experience you have with Xapian, or other systems for indexed text search. Ans:I have been interested in web search engines and indexing and crawling as a whole.I have done a small indexing and searching application in java which could index from multiple servers and extract contents from html , pdf , office formats , ppts etc.I wanted to extent it to a distributed level of indexing but was not able to continue.I have also built a small content extraction tool with perl and cpan modules.I have been reading and disecting xapian code for the past 2 weeks and am feeling comfortable with the code as its neat and organised.I have been concentrating more on the Omega tool provided with the xapian package and am looking forward to work with it. I have been reading omindex.cc for a while now and am trying to learn more about the system. 3:Do you have previous experience with Free Software and Open Source other than Xapian? Ans:I joined dreamwidth community a while back .That helped me to learn peal a bit.It is a journaling tool.I have not been active though.I think I submitted a small patch.Now I am a part of linuxpmi (Linux Process Migration Infrastructure ), a community which has kernel patches for process migration.The group has hardly 10 members and is called the new openmosix and was in a hibernation state when I joined a couple of months back.Plan to work with them as it is in lines of my interest. 4:Do you have any other relevant prior experience (courses taken at college, hobbies, holiday jobs, etc)? Ans:I have taken all the usual courses in computer Science.I have done a Office Automation to maintain fees , details and admission logs for my university which has around 3000 students and also the details of faculty.Platform used were Visual Studio , ASP.NET and MySql. My major project is study of SSI Virtualisation using code extraction so that multithreaded applications could be parellised and run on cluster nodes. Platform used is C and QEMU.So threadmigration etc are possible implementations.The completion of the whole project would take sometime though.I play around with my system hardware and do the tweaking myself. 5:What development platforms, tools and methods do you prefer to use? Ans:I have been using linux platform completely for the past 1 and a half years.Before that I was a partial user.I work on an ubuntu box 10.10.Tools would be vim , grep , geany etc. 6:Have you previously been responsible (as an employee/volunteer/student/etc) for a project of a similar size? Ans:Projects done till now are given at the end of this document. 7:What timezone will you be in during the coding period? Ans: UTC/GMT +5:30 hours. I would be in India during that time.But working time could be flexible as I am a nightbird mostly.So wont be an issue. 8:Will your Summer of Code project be the main focus of your time during the program? Ans:Oh yes.Absolutely 9:How many hours a week will you realistically be able to devote to your project? Ans:I would work on the project 8-9 hrs a day at the minimum.That would compute to around 50-55hrs for six days.Planning to take sunday off.But ofcourse if it doesnt meet up with the shedule I am more than willing to work on sundays too.No issues. 10:Are you applying for other projects in GSoC 2011? If so, with which organisations? (We don't regard you applying for other projects negatively, but we like to know so that we can plan for possible scenarios when assigning mentors). Ans:NO.I debated a lot on it and since I wont be able to do any preliminary work on other orgs proj I wont be able to call myself commited and I belive it takes a more than just writing a proposal and submitting it.So I thought I better stick to one org and work on that and try my luck. Project Title : Text-Extraction Libraries to index file-formats Summary Currently Omega has built-in support for HTML, plain text, and uncompressed AbiWord documents.Other files are being textracted using external programs which casue a overhead.I am planning to use libraries to replace this external programs and preserve and improve the file support list.Also with that plan to add new filetype support for audio , email , and if possible 3d file formats , archive formats , packaging formats and database file support.Another nifty feature planning to develop will be a thumbnail generation system which will as a entry of the thumbnail generated from a file so that it can be viewed during retrival.The main aim is to avoid externel cpu programs eating up cpu time and reduce that with the help of library services. Why have you chosen this project? Am a supporter of opensource , is interested in indexing and search engine based services and I found xapian interesting as it would be some thing I could put to use on daily terms and where I would be doing something I love .Also since I have conceptual knowledge about these aspects , am hoping that that would come to my aid. Benefits The open source community would definitely benifit , so will the xapian userbase.As of now Omega indexer doesnt provide fileformat support implicilty.It does explicilty by making user of external programs thus increasing its cpu foot print.Also for each format 1 or more than 1 externel filter progrma is needed to be run.We can completely avoid that once this project is over.Also the user base is likely to increase as it will be getting a indexer along with a better and robut file indexing support capability.It could then be filled with a gui interface and made as a desktop indexing and search application. Project Details A:The project I plan to undertakecan be summerized as below 1:Replace existing external filter programs with shared libraries. 2:Add new file-format support. 3:Add thumbnail generation feature. 4:Add a testing framework 5:minimize 'ignore' file list. B:What is new or different about your approach which hasn't been done or wasn't possible before? Currently we require external programs like xpdf, unzip , xls2csv, catdoc etc ? which the user would need to have installed on the users coputer to make use of the fileindexing for those corresponding formats.But it has a problem.It requires that a new process need to be started everytime we come across a fileformat and the external program is run which would then extract the contents and metadata.But that would cost a lot of cpu and each new process started would be extra load on the cpu and thus increase the footprint on xapian system.So the possible option would be to replace those filters and use shared libraries instead.That would take care of that isse.Also adding new fileformat support would definitely increase the usablity and flexibility of xapian and omega . Another point would be to built a testing framework which would test the effectiveness of the system and text the indexing is flawless.A framework need to be built as curebtly there isnt one. Thumbnail generation would be nifty feature which would make the the search UI much easier.The site could make use of javascript feature and could display along with search result. Possible file types that I play to include would contain mainly with zip archive formats , office formats , document formats.Possibly extended to secondary objectives like media format basically to index metadata and to 3d and 2 d files as those too would contain metadatas.If all the above does get completed in time the next posssible options would be to to exend the indexing to programming languages and repositories.Though that would require tinkering with other componenets. C:Do you have any preliminary findings or results which suggest that your approach is possible and likely to succeed? The approach was suggested in the ideas page and it seems perfectly fine.And the possibility of success is very high and the things that could possible go wrong are pretty low. Project Timeline 26THApril - 12THMay: Familiarization with Mentor and codebase of present XapianSystem. Study of patterns and conventions used .Specific study of Omega app and its inter-dependencies.Discussing the drawbacks of the current system. Improving C and Perl Coding Skill . 13THMay ? 22THMay: Finding the required C libraries , comparing and fxing the best possible once for each file types,Also the once with the minimum cpu usage.Also the requirements of the testing framework are discussed.Review about the errors that occurs in the present system. Full Blue print of what is to be done Exactly 23THMay ? 24THMay: Current project status and goals reviewed by mentor. Timeline is Corrected if needed. 25THMay ? 15THJune: Implementation Of skeletal System and Checking its integration with OmegaSystem and Testing Whether it address basic problems that was faced earlier. Support for basic formats are introduced.Logging of errors and Suggestion to improve the system is taken. 17THJune - 2THJune: Solving Bugs and Error and Finish the work of a basic working system. Interact with development team to see whether it works well with the other models and is error free. System testing and Test runs done. Results evaluated. 2THJune - 26THJune:Adding a few extra formats and testing as mentioed above.Implementation of thumbnail generation. 26THJune - 28THJune:Review by mentor and suggestions by mentor on whatshould be added to the system. Prepare for adding tweaks and performance improvements. 30THJune ? 6THJuly:Incorporating extra changes and adding tweaks and performance improvements. 8THJuly - 10THJuly: Documentation of Work till then complete with analysis of current and previous system. Getting Ready For Mid Term Evaluation 12TH July: Submission for Mid term Evaluation. 16THJuly - 19THJuly: Discussion of advance features that need to be implemented. Exception handling and fault tolerance issues discussed . Assuming that 85 % of the work is completed.Most of the fileformats are spported which include archives , office and documents and packages. 20THJuly - 6THAugust: Implementation of thorough test case framework. 6THAugust? 10THAugust:Extensive test on new System .Run on all conditions.Logging of errors and bugs and Exception . 10THAugust ? 13THAugust:System final run and Ready for upload . 13THAugust ? 18THAugust: Buffer time if somethings is needed to be done or if I loose a few days. 18THAugust - 20THAugust: Submission of final evaluations to Google by both students and mentors.Wonderful Time. Take A few Days To Chill out and Join the team for the remaining journey to achieve our ultimate goal. Previous Discussion of your Project I have discussed my project extensively with ojwb.Also I have contacted the person (Jean-Francois Dockes) who manages Recoll ( an application using xapian as a backend for searching ). Projects done till now * Project : Office Automation Client : School of Engineering office ,CUSAT Team Size : 4 Cochin University of Science & Technology (CUSAT) is a government owned autonomous university in Kochi (Cochin), Kerala, India. The university awards degrees in various fields of engineering and allied subjects at the undergraduate, postgraduate and doctoral levels. Nearly 1,000 students enroll yearly in various areas of undergraduate and postgraduate study in this university(totalling to 4000). The Office automation Project was to develop a web application in order to maintain the day-to-day activities of the School of Engineering office ,CUSAT. The web application has to enable the management of Student Details such as Fee Management of CUSAT, Issuing of Certificates, Staff Management etc. using a single Gateway. Platform: ASP.NET/ C#, MySQL Server and Windows Server 2003 * Activities & Responsibility * Requirement Analysis * Prototype Development * Database Design * Environment Set Up * Development * Unit Testing * Project : Generic LAN Search Team Size : 2 To develop a generic LAN search engine which can crawl the Local serves and index files. Activities & Responsibility * System Study * Finding Best Algorithms * Programming Platform:Java, MySQL Server, JSP * Project:Single System Image over Virtualization Client : FISAT CHPC(Center for High Perfomance Computing) Team Size : 2 Description A SSI over virtualization used for implementation over a cluster or super computer which will provide transparency to the application running on the server.It can be used to achieve paralell processing with minimum change in the client programs, to utilize idle CPU processing , paralellize heavily threaded applications etc. The project is still going on and is not likely to get over any soon. Platform: C/C++, Assembly, Qemu(VM), Linux Skill Set Languages : Java , C/C++ , HTML , Perl , ASP.NET/C# Database : MySQL Operating system : Windows , Linux Software Packages : Any software which follows usual conventions Personal Dossier Date of Birth : 26-March,1990 Father's Name : Yesudasan A Languages known : English, Malayalam, Hindi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20110328/cef3d212/attachment.html>