thr3ads.net - Ferret talk - [Ferret-talk] Ferret with IMAP dirs [Jan 2006]

If this information is useful, please help other people find it:
Share via:

John Wells

2006-Jan-10 13:55 UTC

[Ferret-talk] Ferret with IMAP dirs

I''d like to use ferret to build an imap indexer and search utility, but
want to check first to see if anyone else is working on this and offer 
my help. Anyone?

Also, if you could provide any helpful pointers on indexing directories 
via ferret, it''ll be very much appreciated. I''m a lucene nuby.

Thanks!
John

-- 
Posted via http://www.ruby-forum.com/.

jennyw

2006-Jan-10 20:27 UTC

head link

[Ferret-talk] Ferret with IMAP dirs

John Wells wrote:
>I''d like to use ferret to build an imap indexer and search utility,
but
>want to check first to see if anyone else is working on this and offer 
>my help. Anyone?
>  
>This could be really challenging if you want it to work for multiple 
IMAP servers. If you target a specific one, though, you might have 
better luck. The biggest issue I see is that the UID of messages, 
although implied to always be the same by the IMAP RFC, my understanding 
is that it''s not always the same on all implementations. Also, it may
be
tough to keep track of all changes to a user''s inbox.  If
there''s a way
to communicate with the IMAP server via an API specific to that server, 
especially if there''s a hook that can be called on updates to the 
message store, that would be ideal.

Good luck!

Jen

John Wells

2006-Jan-11 12:17 UTC

head link

[Ferret-talk] Ferret with IMAP dirs

jennyw jennyw wrote:> This could be really challenging if you want it to work for multiple
> IMAP servers. If you target a specific one, though, you might have
> better luck. The biggest issue I see is that the UID of messages,
> although implied to always be the same by the IMAP RFC, my understanding
> is that it''s not always the same on all implementations. Also, it
may be
> tough to keep track of all changes to a user''s inbox.  If
there''s a way
> to communicate with the IMAP server via an API specific to that server,
> especially if there''s a hook that can be called on updates to the
> message store, that would be ideal.
Thanks Jen. I know Zoe (http://www.zoe.nu) uses Lucene to index IMAP 
dirs, but I''m uncertain how it goes about it...that might be a place to
start. Thanks!

-- 
Posted via http://www.ruby-forum.com/.

Erik Hatcher

2006-Jan-12 13:37 UTC

head link

[Ferret-talk] Ferret with IMAP dirs

On Jan 11, 2006, at 7:17 AM, John Wells wrote:> jennyw jennyw wrote:
>> This could be really challenging if you want it to work for multiple
>> IMAP servers. If you target a specific one, though, you might have
>> better luck. The biggest issue I see is that the UID of messages,
>> although implied to always be the same by the IMAP RFC, my  
>> understanding
>> is that it''s not always the same on all implementations. Also,
it
>> may be
>> tough to keep track of all changes to a user''s inbox.  If
there''s
>> a way
>> to communicate with the IMAP server via an API specific to that  
>> server,
>> especially if there''s a hook that can be called on updates to
the
>> message store, that would be ideal.
>
> Thanks Jen. I know Zoe (http://www.zoe.nu) uses Lucene to index IMAP
> dirs, but I''m uncertain how it goes about it...that might be a  
> place to
> start. Thanks!
ZOE uses the IMAP (and POP, and others) networking protocols to read  
e-mail and then to index it in all sorts of intense and sophisticated  
ways.  I''m not sure what Java library ZOE uses for this, but knowing  
the creator of it (we met once a couple of years ago) he probably  
built his own IMAP API from scratch using sockets.

net/imap is built into Ruby itself, and is probably the way to start  
what you''re doing.

	Erik

jennyw

2006-Jan-12 17:49 UTC

head link

[Ferret-talk] Ferret with IMAP dirs

Erik Hatcher wrote:
>ZOE uses the IMAP (and POP, and others) networking protocols to read  
>e-mail and then to index it in all sorts of intense and sophisticated  
>ways.  I''m not sure what Java library ZOE uses for this, but
knowing
>the creator of it (we met once a couple of years ago) he probably  
>built his own IMAP API from scratch using sockets.
>  
>I''m pretty sure ZOE downloads all e-mail from the server and into  its 
own message store.  You then point your e-mail client to ZOE as your 
server. Last I checked, ZOE only supported POP clients, though.

Jen

Erik Hatcher

2006-Jan-12 19:00 UTC

head link

[Ferret-talk] Ferret with IMAP dirs

On Jan 12, 2006, at 12:49 PM, jennyw wrote:> Erik Hatcher wrote:
>
>> ZOE uses the IMAP (and POP, and others) networking protocols to read
>> e-mail and then to index it in all sorts of intense and sophisticated
>> ways.  I''m not sure what Java library ZOE uses for this, but
knowing
>> the creator of it (we met once a couple of years ago) he probably
>> built his own IMAP API from scratch using sockets.
>>
>>
> I''m pretty sure ZOE downloads all e-mail from the server and into 
its
> own message store.  You then point your e-mail client to ZOE as your
> server. Last I checked, ZOE only supported POP clients, though.
I guess its a bit confusing on what aspect we''re talking about here.   
ZOE is both a client and a server.  ZOE is both a POP and IMAP  
_client_, but also a POP server as well as an SMTP server.  I think  
it also serves as an IMAP server, though I''m not entirely sure.

	<http://guests.evectors.it/zoe/>

Pretty snazzy, and it''s use of Lucene is uncanny.

The main point here is that ZOE does speak IMAP and can grab mails  
from it.

	Erik

John Wells

2006-Jan-13 14:31 UTC

head link

[Ferret-talk] Ferret with IMAP dirs

Erik Hatcher wrote:> The main point here is that ZOE does speak IMAP and can grab mails
> from it.
Yep, and using net/imap in combination with Ferret is working very well 
so far.

What a great project...thanks!

John

-- 
Posted via http://www.ruby-forum.com/.

John Wells

2006-Jan-14 04:11 UTC

head link

[Ferret-talk] Ferret with IMAP dirs

John  Wells wrote:> Yep, and using net/imap in combination with Ferret is working very well 
> so far.
Correction...was working fine. It seems to freeze up when the index 
directory size hits around 178 megs (I''m indexing a 2.2 G mail
account).

Has anyone else experienced any problems with large indexes? Strace''ing
to the process shows no activity at all, yet CPU utilization by the 
process in at 97.6%.

Any ideas?

Btw, the index it was able to create works great...I can''t wait to have
the whole 2 GB indexed!

Thanks,
John

-- 
Posted via http://www.ruby-forum.com/.

John Wells

2006-Jan-14 04:41 UTC

head link

[Ferret-talk] Ferret with IMAP dirs

Here''s the stack trace when I control+c out of it:
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/analysis/tokenizers.rb:49:in
`scan_until'': Interrupt
        from 
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/analysis/tokenizers.rb:49:in
`next''
        from 
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/analysis/token_filters.rb:21:in
`next''
        from 
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/analysis/token_filters.rb:52:in
`next''
        from 
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/document_writer.rb:122:in
`invert_document''
        from 
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/document_writer.rb:88:in
`each''
        from 
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/document_writer.rb:88:in
`invert_document''
        from 
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/document_writer.rb:58:in
`add_document''
        from 
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/index_writer.rb:158:in
`add_document''
        from 
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/index.rb:270:in 
`<<''
        from 
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/index.rb:238:in 
`synchronize''
        from 
/usr/lib/ruby/gems/1.8/gems/ferret-0.3.2/lib/ferret/index/index.rb:238:in 
`<<''
        from /home/jb/ruby/fermail.rb:43:in `index_it''
        from /home/jb/ruby/fermail.rb:18:in `each''
        from /home/jb/ruby/fermail.rb:18:in `index_it''
        from /home/jb/ruby/fermail.rb:70
        from /home/jb/ruby/fermail.rb:64:in `each''
        from /home/jb/ruby/fermail.rb:64


-- 
Posted via http://www.ruby-forum.com/.

John Wells

2006-Jan-14 14:56 UTC

head link

[Ferret-talk] Ferret with IMAP dirs

I removed the message it was hanging on, but it''s still stopping at 178
meg, no matter what I do. Any ideas what might be causing this? I have 
plenty of disk space...

Thanks,
John

-- 
Posted via http://www.ruby-forum.com/.

David Balmain

2006-Jan-14 23:28 UTC

head link

[Ferret-talk] Ferret with IMAP dirs

Hi John,

I''m not exactly sure what is causing your problem. It may just be that
the 178Mgb mark is the point where you have 10,000 documents being
merged or something. Do you know how documents are in the index at
that point? Anyway, I don''t really have time to look into it right now
as I think most of these types of problems will be sorted out when I
finally release the new version of Ferret backed by cFerret. I can''t
say when that will be but hopefully it won''t be too far away.

Sorry to keep everyone waiting.

Cheers,
Dave

On 1/14/06, John Wells <lists at sourceillustrated.com>
wrote:> I removed the message it was hanging on, but it''s still stopping
at 178
> meg, no matter what I do. Any ideas what might be causing this? I have
> plenty of disk space...
>
> Thanks,
> John
>
> --
> Posted via http://www.ruby-forum.com/.
> _______________________________________________
> Ferret-talk mailing list
> Ferret-talk at rubyforge.org
> http://rubyforge.org/mailman/listinfo/ferret-talk
>

John Wells

2006-Jan-15 05:57 UTC

head link

[Ferret-talk] Ferret with IMAP dirs

David Balmain wrote:> I''m not exactly sure what is causing your problem. It may just be
that
> the 178Mgb mark is the point where you have 10,000 documents being
> merged or something. Do you know how documents are in the index at
> that point? Anyway, I don''t really have time to look into it right
now
> as I think most of these types of problems will be sorted out when I
> finally release the new version of Ferret backed by cFerret. I
can''t
> say when that will be but hopefully it won''t be too far away.
Hello Dave,

It stops consistently at 2902 documents, but when I disabled fetching of 
the email body it went beyond this. Strange error indeed. I''m going to 
continue trying to figure out what''s going on.

Any chance you could reenable the cFerret svn repository on your server? 
Tried to download per instructions but received connection refused.

Thanks for your help!

John

-- 
Posted via http://www.ruby-forum.com/.

David Balmain

2006-Jan-15 11:31 UTC

head link

[Ferret-talk] Ferret with IMAP dirs

> Any chance you could reenable the cFerret svn repository on your server?
> Tried to download per instructions but received connection refused.
done

John Wells

2006-Jan-16 17:01 UTC

head link

[Ferret-talk] Ferret with IMAP dirs

David Balmain wrote:>> Any chance you could reenable the cFerret svn repository on your
server?
>> Tried to download per instructions but received connection refused.
David,

Thanks.

Btw, I''m very interested in still understanding what''s causing
my
current problem. I''d like to take a stab at it myself, but would ask
for
a pointer on getting started. What approach would you take in tracking 
this problem down? I thought about running the script in the debugger 
but man, the added overhead would''ve caused it to run forever.

Any debug logging I can enable in ferret? Anything else you could 
suggest?

Thanks for the great work and the help!

John

-- 
Posted via http://www.ruby-forum.com/.

Joost

2006-Jan-25 10:05 UTC

head link

[Ferret-talk] Ferret with IMAP dirs

John, I am very interested in the Ruby-Ferret IMAP search tool. Did you 
already manage to index 2Gb of emails? Are you willing to share your 
code so I can also search thru my email? It''s not yet 2Gb but keeps on 
growing :)

Joost

-- 
Posted via http://www.ruby-forum.com/.

John Wells

2006-Jan-25 21:07 UTC

head link

[Ferret-talk] Ferret with IMAP dirs

Joost wrote:> John, I am very interested in the Ruby-Ferret IMAP search tool. Did you 
> already manage to index 2Gb of emails? Are you willing to share your 
> code so I can also search thru my email? It''s not yet 2Gb but
keeps on
> growing :)
Hi Joost,

Well, it''s certainly not perfect code...more of a dirty hack to try it 
out. And, as noted, if I try to index the body it doesn''t fair very 
well.

That said, I''d be happy to share it. I''ll post it later
tonight when I
have access to it.

Thanks,
John

-- 
Posted via http://www.ruby-forum.com/.

John Wells

2006-Jan-26 02:14 UTC

head link

[Ferret-talk] Ferret with IMAP dirs

Ok...it''s neither pretty nor clean nor idiomatic Ruby (I''m a
nuby ;),
but as a dirty hack it works (unless you fetch the body...that is).

Let me know if you have any questions:

#!/usr/bin/env ruby

require ''rubygems''
require ''ferret''
include Ferret
include Ferret::Document
require ''net/imap''

index = Index::Index.new(:path=>"/path/to/index/goes/here")
$count = 0
$imap = Net::IMAP.new(''server_ip_address_goes_here'', 143,
false)

$imap.login(''username_goes_here'',
''password_goes_here'')

print $imap.examine("INBOX")

def index_it(imapobj, index, box)
	imapobj.search(["ALL"]).each do |message_id|
		begin
    	msg = imapobj.fetch(message_id, "(UID RFC822.SIZE ENVELOPE 
BODY[TEXT])")[0]
			envelope = msg.attr["ENVELOPE"]
			body = msg.attr["BODY[TEXT]"]
			uid = msg.attr["UID"]
			size = msg.attr["RFC822.SIZE"]
			date = envelope.date
			subject = envelope.subject
    	if envelope.from != nil and envelope.from.size > 0
				from = envelope.from[0].name
			end
			sender = envelope.sender
			to = envelope.to
			in_reply_to = envelope.in_reply_to
			doc = Document.new
			doc << Field.new("id", message_id, Field::Store::YES, 
Field::Index::TOKENIZED)
   		doc << Field.new("body",  body,  Field::Store::NO, 
Field::Index::TOKENIZED)
			doc << Field.new("from", from, Field::Store::YES, 
Field::Index::TOKENIZED)
			doc << Field.new("subject", subject, Field::Store::YES, 
Field::Index::TOKENIZED)
			doc << Field.new("date", date, Field::Store::YES, 
Field::Index::TOKENIZED)
			doc << Field.new("uid", uid, Field::Store::YES, 
Field::Index::TOKENIZED)
			doc << Field.new("size", size, Field::Store::YES, 
Field::Index::TOKENIZED)
			doc << Field.new("sender", sender, Field::Store::YES, 
Field::Index::TOKENIZED)
			doc << Field.new("in_reply_to", in_reply_to,
Field::Store::YES,
Field::Index::TOKENIZED)
			doc << Field.new("mailbox", box, Field::Store::YES, 
Field::Index::UNTOKENIZED)

			index << doc
			$count = $count + 1
		  print "#{$count} : #{from} <==> #{subject}\n"
			$retry = 0
		rescue => detail
			print detail
  		print detail.backtrace.join("\n")
  		print "Retrying"
			$retry = 1 + $retry
			if $retry < 20
	  		retry
			else
				print "Retry threshold reached. Exiting..."
				exit!(99)
			end
			$retry = 0
		end
	end
end

$imap.examine("INBOX")

$imap.list("", "*").each do |box|
	name = box.name
	print "NAME: #{name}:#{box.class}\n"
	if name and name != "" and name !~/customflags/
		begin
		$imap.select(name)
		index_it($imap, index, name)
		rescue => detail
			print "ERROR: " + detail.message + "\n"
		end
	end
end

-- 
Posted via http://www.ruby-forum.com/.

Joost

2006-Jan-26 13:38 UTC

head link

[Ferret-talk] Ferret with IMAP dirs

Hi John,

Thanks for the quick reaction. I''m a nuby too :) At the moment I
haven''t
got the time to look at the code.. when I have I''ll certainly do. I
hope
there is a new version of Ferret out by then..so it''ll work completely
&
fast.

Thanks, Joost

-- 
Posted via http://www.ruby-forum.com/.

John Wells

2006-Jan-26 13:53 UTC

head link

[Ferret-talk] Ferret with IMAP dirs

Joost wrote:> Hi John,
> 
> Thanks for the quick reaction. I''m a nuby too :) At the moment I
haven''t
> got the time to look at the code.. when I have I''ll certainly do.
I hope
> there is a new version of Ferret out by then..so it''ll work
completely &
> fast.
Ok... ;)

Btw, that code only creates the index. You''ll then have to implement 
code to search it, and you''ll probably want it to dig out the UID for 
you. Here''s a sample of a search:
############################################
#!/usr/bin/env ruby
require ''rubygems''
require ''ferret''
include Ferret
require ''net/imap''

50.times { print "-" }; print "\n"

index = Index::Index.new(:path=>"/path/to/index/goes/here")

index.search_each(''body:"'' + ARGV[0] +
''"'') do |doc, score|
    puts "Document #{doc} found with a score of #{score}"
                print index[doc]["from"] + " <--> " + 
index[doc]["subject"] +  + index[doc]["uid"] +
"\n"
end

50.times { print "-" }; print "\n"
############################################

-- 
Posted via http://www.ruby-forum.com/.

Apparently Analagous Threads

Search for more maybe matching threads

Ferret talk - Jan 2006 - Ferret with IMAP dirs

[Ferret-talk] Ferret with IMAP dirs

[Ferret-talk] Ferret with IMAP dirs

[Ferret-talk] Ferret with IMAP dirs

[Ferret-talk] Ferret with IMAP dirs

[Ferret-talk] Ferret with IMAP dirs

[Ferret-talk] Ferret with IMAP dirs

[Ferret-talk] Ferret with IMAP dirs

[Ferret-talk] Ferret with IMAP dirs

[Ferret-talk] Ferret with IMAP dirs

[Ferret-talk] Ferret with IMAP dirs

[Ferret-talk] Ferret with IMAP dirs

[Ferret-talk] Ferret with IMAP dirs

[Ferret-talk] Ferret with IMAP dirs

[Ferret-talk] Ferret with IMAP dirs

[Ferret-talk] Ferret with IMAP dirs

[Ferret-talk] Ferret with IMAP dirs

[Ferret-talk] Ferret with IMAP dirs

[Ferret-talk] Ferret with IMAP dirs

[Ferret-talk] Ferret with IMAP dirs

Apparently Analagous Threads