I''d like to use ferret to build an imap indexer and search utility, but want to check first to see if anyone else is working on this and offer my help. Anyone? Also, if you could provide any helpful pointers on indexing directories via ferret, it''ll be very much appreciated. I''m a lucene nuby. Thanks! John -- Posted via http://www.ruby-forum.com/.
John Wells wrote:>I''d like to use ferret to build an imap indexer and search utility, but >want to check first to see if anyone else is working on this and offer >my help. Anyone? > >This could be really challenging if you want it to work for multiple IMAP servers. If you target a specific one, though, you might have better luck. The biggest issue I see is that the UID of messages, although implied to always be the same by the IMAP RFC, my understanding is that it''s not always the same on all implementations. Also, it may be tough to keep track of all changes to a user''s inbox. If there''s a way to communicate with the IMAP server via an API specific to that server, especially if there''s a hook that can be called on updates to the message store, that would be ideal. Good luck! Jen
jennyw jennyw wrote:> This could be really challenging if you want it to work for multiple > IMAP servers. If you target a specific one, though, you might have > better luck. The biggest issue I see is that the UID of messages, > although implied to always be the same by the IMAP RFC, my understanding > is that it''s not always the same on all implementations. Also, it may be > tough to keep track of all changes to a user''s inbox. If there''s a way > to communicate with the IMAP server via an API specific to that server, > especially if there''s a hook that can be called on updates to the > message store, that would be ideal.Thanks Jen. I know Zoe (http://www.zoe.nu) uses Lucene to index IMAP dirs, but I''m uncertain how it goes about it...that might be a place to start. Thanks! -- Posted via http://www.ruby-forum.com/.
On Jan 11, 2006, at 7:17 AM, John Wells wrote:> jennyw jennyw wrote: >> This could be really challenging if you want it to work for multiple >> IMAP servers. If you target a specific one, though, you might have >> better luck. The biggest issue I see is that the UID of messages, >> although implied to always be the same by the IMAP RFC, my >> understanding >> is that it''s not always the same on all implementations. Also, it >> may be >> tough to keep track of all changes to a user''s inbox. If there''s >> a way >> to communicate with the IMAP server via an API specific to that >> server, >> especially if there''s a hook that can be called on updates to the >> message store, that would be ideal. > > Thanks Jen. I know Zoe (http://www.zoe.nu) uses Lucene to index IMAP > dirs, but I''m uncertain how it goes about it...that might be a > place to > start. Thanks!ZOE uses the IMAP (and POP, and others) networking protocols to read e-mail and then to index it in all sorts of intense and sophisticated ways. I''m not sure what Java library ZOE uses for this, but knowing the creator of it (we met once a couple of years ago) he probably built his own IMAP API from scratch using sockets. net/imap is built into Ruby itself, and is probably the way to start what you''re doing. Erik
Erik Hatcher wrote:>ZOE uses the IMAP (and POP, and others) networking protocols to read >e-mail and then to index it in all sorts of intense and sophisticated >ways. I''m not sure what Java library ZOE uses for this, but knowing >the creator of it (we met once a couple of years ago) he probably >built his own IMAP API from scratch using sockets. > >I''m pretty sure ZOE downloads all e-mail from the server and into its own message store. You then point your e-mail client to ZOE as your server. Last I checked, ZOE only supported POP clients, though. Jen
On Jan 12, 2006, at 12:49 PM, jennyw wrote:> Erik Hatcher wrote: > >> ZOE uses the IMAP (and POP, and others) networking protocols to read >> e-mail and then to index it in all sorts of intense and sophisticated >> ways. I''m not sure what Java library ZOE uses for this, but knowing >> the creator of it (we met once a couple of years ago) he probably >> built his own IMAP API from scratch using sockets. >> >> > I''m pretty sure ZOE downloads all e-mail from the server and into its > own message store. You then point your e-mail client to ZOE as your > server. Last I checked, ZOE only supported POP clients, though.I guess its a bit confusing on what aspect we''re talking about here. ZOE is both a client and a server. ZOE is both a POP and IMAP _client_, but also a POP server as well as an SMTP server. I think it also serves as an IMAP server, though I''m not entirely sure. <http://guests.evectors.it/zoe/> Pretty snazzy, and it''s use of Lucene is uncanny. The main point here is that ZOE does speak IMAP and can grab mails from it. Erik
Erik Hatcher wrote:> The main point here is that ZOE does speak IMAP and can grab mails > from it.Yep, and using net/imap in combination with Ferret is working very well so far. What a great project...thanks! John -- Posted via http://www.ruby-forum.com/.
John Wells wrote:> Yep, and using net/imap in combination with Ferret is working very well > so far.Correction...was working fine. It seems to freeze up when the index directory size hits around 178 megs (I''m indexing a 2.2 G mail account). Has anyone else experienced any problems with large indexes? Strace''ing to the process shows no activity at all, yet CPU utilization by the process in at 97.6%. Any ideas? Btw, the index it was able to create works great...I can''t wait to have the whole 2 GB indexed! Thanks, John -- Posted via http://www.ruby-forum.com/.
Here''s the stack trace when I control+c out of it: /usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/analysis/tokenizers.rb:49:in `scan_until'': Interrupt from /usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/analysis/tokenizers.rb:49:in `next'' from /usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/analysis/token_filters.rb:21:in `next'' from /usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/analysis/token_filters.rb:52:in `next'' from /usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/document_writer.rb:122:in `invert_document'' from /usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/document_writer.rb:88:in `each'' from /usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/document_writer.rb:88:in `invert_document'' from /usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/document_writer.rb:58:in `add_document'' from /usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/index_writer.rb:158:in `add_document'' from /usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/index.rb:270:in `<<'' from /usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/index.rb:238:in `synchronize'' from /usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/index.rb:238:in `<<'' from /home/jb/ruby/fermail.rb:43:in `index_it'' from /home/jb/ruby/fermail.rb:18:in `each'' from /home/jb/ruby/fermail.rb:18:in `index_it'' from /home/jb/ruby/fermail.rb:70 from /home/jb/ruby/fermail.rb:64:in `each'' from /home/jb/ruby/fermail.rb:64 -- Posted via http://www.ruby-forum.com/.
I removed the message it was hanging on, but it''s still stopping at 178 meg, no matter what I do. Any ideas what might be causing this? I have plenty of disk space... Thanks, John -- Posted via http://www.ruby-forum.com/.
Hi John, I''m not exactly sure what is causing your problem. It may just be that the 178Mgb mark is the point where you have 10,000 documents being merged or something. Do you know how documents are in the index at that point? Anyway, I don''t really have time to look into it right now as I think most of these types of problems will be sorted out when I finally release the new version of Ferret backed by cFerret. I can''t say when that will be but hopefully it won''t be too far away. Sorry to keep everyone waiting. Cheers, Dave On 1/14/06, John Wells <lists at sourceillustrated.com> wrote:> I removed the message it was hanging on, but it''s still stopping at 178 > meg, no matter what I do. Any ideas what might be causing this? I have > plenty of disk space... > > Thanks, > John > > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >
David Balmain wrote:> I''m not exactly sure what is causing your problem. It may just be that > the 178Mgb mark is the point where you have 10,000 documents being > merged or something. Do you know how documents are in the index at > that point? Anyway, I don''t really have time to look into it right now > as I think most of these types of problems will be sorted out when I > finally release the new version of Ferret backed by cFerret. I can''t > say when that will be but hopefully it won''t be too far away.Hello Dave, It stops consistently at 2902 documents, but when I disabled fetching of the email body it went beyond this. Strange error indeed. I''m going to continue trying to figure out what''s going on. Any chance you could reenable the cFerret svn repository on your server? Tried to download per instructions but received connection refused. Thanks for your help! John -- Posted via http://www.ruby-forum.com/.
> Any chance you could reenable the cFerret svn repository on your server? > Tried to download per instructions but received connection refused.done
David Balmain wrote:>> Any chance you could reenable the cFerret svn repository on your server? >> Tried to download per instructions but received connection refused.David, Thanks. Btw, I''m very interested in still understanding what''s causing my current problem. I''d like to take a stab at it myself, but would ask for a pointer on getting started. What approach would you take in tracking this problem down? I thought about running the script in the debugger but man, the added overhead would''ve caused it to run forever. Any debug logging I can enable in ferret? Anything else you could suggest? Thanks for the great work and the help! John -- Posted via http://www.ruby-forum.com/.
John, I am very interested in the Ruby-Ferret IMAP search tool. Did you already manage to index 2Gb of emails? Are you willing to share your code so I can also search thru my email? It''s not yet 2Gb but keeps on growing :) Joost -- Posted via http://www.ruby-forum.com/.
Joost wrote:> John, I am very interested in the Ruby-Ferret IMAP search tool. Did you > already manage to index 2Gb of emails? Are you willing to share your > code so I can also search thru my email? It''s not yet 2Gb but keeps on > growing :)Hi Joost, Well, it''s certainly not perfect code...more of a dirty hack to try it out. And, as noted, if I try to index the body it doesn''t fair very well. That said, I''d be happy to share it. I''ll post it later tonight when I have access to it. Thanks, John -- Posted via http://www.ruby-forum.com/.
Ok...it''s neither pretty nor clean nor idiomatic Ruby (I''m a nuby ;), but as a dirty hack it works (unless you fetch the body...that is). Let me know if you have any questions: #!/usr/bin/env ruby require ''rubygems'' require ''ferret'' include Ferret include Ferret::Document require ''net/imap'' index = Index::Index.new(:path=>"/path/to/index/goes/here") $count = 0 $imap = Net::IMAP.new(''server_ip_address_goes_here'', 143, false) $imap.login(''username_goes_here'', ''password_goes_here'') print $imap.examine("INBOX") def index_it(imapobj, index, box) imapobj.search(["ALL"]).each do |message_id| begin msg = imapobj.fetch(message_id, "(UID RFC822.SIZE ENVELOPE BODY[TEXT])")[0] envelope = msg.attr["ENVELOPE"] body = msg.attr["BODY[TEXT]"] uid = msg.attr["UID"] size = msg.attr["RFC822.SIZE"] date = envelope.date subject = envelope.subject if envelope.from != nil and envelope.from.size > 0 from = envelope.from[0].name end sender = envelope.sender to = envelope.to in_reply_to = envelope.in_reply_to doc = Document.new doc << Field.new("id", message_id, Field::Store::YES, Field::Index::TOKENIZED) doc << Field.new("body", body, Field::Store::NO, Field::Index::TOKENIZED) doc << Field.new("from", from, Field::Store::YES, Field::Index::TOKENIZED) doc << Field.new("subject", subject, Field::Store::YES, Field::Index::TOKENIZED) doc << Field.new("date", date, Field::Store::YES, Field::Index::TOKENIZED) doc << Field.new("uid", uid, Field::Store::YES, Field::Index::TOKENIZED) doc << Field.new("size", size, Field::Store::YES, Field::Index::TOKENIZED) doc << Field.new("sender", sender, Field::Store::YES, Field::Index::TOKENIZED) doc << Field.new("in_reply_to", in_reply_to, Field::Store::YES, Field::Index::TOKENIZED) doc << Field.new("mailbox", box, Field::Store::YES, Field::Index::UNTOKENIZED) index << doc $count = $count + 1 print "#{$count} : #{from} <==> #{subject}\n" $retry = 0 rescue => detail print detail print detail.backtrace.join("\n") print "Retrying" $retry = 1 + $retry if $retry < 20 retry else print "Retry threshold reached. Exiting..." exit!(99) end $retry = 0 end end end $imap.examine("INBOX") $imap.list("", "*").each do |box| name = box.name print "NAME: #{name}:#{box.class}\n" if name and name != "" and name !~/customflags/ begin $imap.select(name) index_it($imap, index, name) rescue => detail print "ERROR: " + detail.message + "\n" end end end -- Posted via http://www.ruby-forum.com/.
Hi John, Thanks for the quick reaction. I''m a nuby too :) At the moment I haven''t got the time to look at the code.. when I have I''ll certainly do. I hope there is a new version of Ferret out by then..so it''ll work completely & fast. Thanks, Joost -- Posted via http://www.ruby-forum.com/.
Joost wrote:> Hi John, > > Thanks for the quick reaction. I''m a nuby too :) At the moment I haven''t > got the time to look at the code.. when I have I''ll certainly do. I hope > there is a new version of Ferret out by then..so it''ll work completely & > fast.Ok... ;) Btw, that code only creates the index. You''ll then have to implement code to search it, and you''ll probably want it to dig out the UID for you. Here''s a sample of a search: ############################################ #!/usr/bin/env ruby require ''rubygems'' require ''ferret'' include Ferret require ''net/imap'' 50.times { print "-" }; print "\n" index = Index::Index.new(:path=>"/path/to/index/goes/here") index.search_each(''body:"'' + ARGV[0] + ''"'') do |doc, score| puts "Document #{doc} found with a score of #{score}" print index[doc]["from"] + " <--> " + index[doc]["subject"] + + index[doc]["uid"] + "\n" end 50.times { print "-" }; print "\n" ############################################ -- Posted via http://www.ruby-forum.com/.