thr3ads.net - Ferret talk - [Ferret-talk] Thinking of using aaf- looking for advice [Apr 2007]

If this information is useful, please help other people find it:
Share via:

Danny Burkes

2007-Apr-21 04:13 UTC

[Ferret-talk] Thinking of using aaf- looking for advice

Hi-

I''m technical lead at Lingr (http://www.lingr.com), a chatroom-based
social networking site.  We''ve currently got several million user
utterances stored in MySQL, and we''re looking to build a local search
functionality.  I''ve played around with aaf and I really like it, but I
have some questions.


1.  Is anyone out there using aaf to index a corpus of this size?  If
so, how has your scaling experience been?

2.  We would be running one central aaf server instance, talking to it
over drb from our many application servers.  We add tens of thousands of
utterances per day- anyone out there indexing this many items on a daily
basis over drb?  If so, how has your experience been in terms of
stability?

3.  All of our utterance data is in UTF8, but we don''t know what
language a particular utterance is in.  It''s common to have both latin
and non-latin text even in the same room.  How can I index both types of
strings effectively within the same model field index?

4.  Any suggestions on how to build the initial index in an offline way?
I suspect it will probably take many hours to build the initial index.

5.  I suspect we will have to disable_ferret(:always) on our utterance
model, then update the index manually on some periodic basis (cron job,
backgroundrb worker, etc.).  The reason for this is that we don''t want
to introduce any delay into the process of storing a new utterance,
which occurs in realtime during a chat session.  Anyone have experience
doing this?

Any advice is appreciated!

Best Regards,

Danny Burkes

-- 
Posted via http://www.ruby-forum.com/.

Ryan King

2007-Apr-22 19:26 UTC

head link

[Ferret-talk] Thinking of using aaf- looking for advice

On 4/20/07, Danny Burkes <dburkes at netable.com>
wrote:> Hi-
>
> I''m technical lead at Lingr (http://www.lingr.com), a
chatroom-based
> social networking site.  We''ve currently got several million user
> utterances stored in MySQL, and we''re looking to build a local
search
> functionality.  I''ve played around with aaf and I really like it,
but I
> have some questions.
>
>
> 1.  Is anyone out there using aaf to index a corpus of this size?  If
> so, how has your scaling experience been?
Yes. I have server models with more the 4M rows, all indexed with AAF.
My experience has been that AAF is very stable. Most of my challenges
have been with ferret upgrades breaking index format.
> 2.  We would be running one central aaf server instance, talking to it
> over drb from our many application servers.  We add tens of thousands of
> utterances per day- anyone out there indexing this many items on a daily
> basis over drb?  If so, how has your experience been in terms of
> stability?
Yes. Rock solid.
> 3.  All of our utterance data is in UTF8, but we don''t know what
> language a particular utterance is in.  It''s common to have both
latin
> and non-latin text even in the same room.  How can I index both types of
> strings effectively within the same model field index?
Why not just use UTF8?
> 4.  Any suggestions on how to build the initial index in an offline way?
> I suspect it will probably take many hours to build the initial index.
Jens has talked about developing a better rebuild_index for AAF that does this.

However, if your search system isn''t online (ie, the feature
isn''t
enabled in the front end), why would you need anything special? The
AAF DRb server can server requests while you''re running a rebuild (as
long as you don''t use the current rebuild_index method).
> 5.  I suspect we will have to disable_ferret(:always) on our utterance
> model, then update the index manually on some periodic basis (cron job,
> backgroundrb worker, etc.).  The reason for this is that we don''t
want
> to introduce any delay into the process of storing a new utterance,
> which occurs in realtime during a chat session.  Anyone have experience
> doing this?
It''s pretty fast. The only time you''d see a slowdown is when
you
encounter a lock in the DRb server.

-ryan

Danny Burkes

2007-Apr-23 01:52 UTC

head link

[Ferret-talk] Thinking of using aaf- looking for advice

Ryan King wrote:> Yes. I have server models with more the 4M rows, all indexed with AAF.
> My experience has been that AAF is very stable. Most of my challenges
> have been with ferret upgrades breaking index format.
> 
> Yes. Rock solid.
> 
Great to know- thanks very much!
>> 3.  All of our utterance data is in UTF8, but we don''t know
what
>> language a particular utterance is in.  It''s common to have
both latin
>> and non-latin text even in the same room.  How can I index both types
of
>> strings effectively within the same model field index?
> 
> Why not just use UTF8?
>
Sorry, I should have been more clear- what I was referring to was not
storage, but rather tokenization.  My understanding is that many people
use a simple Regex-based one-token-per-character tokenizer for non-Latin
languages, but, since our languages are mixed, I wasn''t sure what type
of approach to tokenization would be best.  Clearly we can''t use that
one-token-per-character analyzer on latin text, right?
> However, if your search system isn''t online (ie, the feature
isn''t
> enabled in the front end), why would you need anything special? The
> AAF DRb server can server requests while you''re running a rebuild
(as
> long as you don''t use the current rebuild_index method).
> 
Perhaps I''m remembering incorrectly, but my recollection was that, the
first time I created a new record for a model that uses aaf, the whole
instance blocked while aaf was creating the index.  Did I remember that
wrong?

If that is the way that it works, then, clearly, I need to start the
rebuild from outside of the application, before any users can create new
model objects.

Further, are you saying that model creations during the rebuild won''t
block (I guess they realize that a rebuild is already happening and just
return immediately)?
>> 5.  I suspect we will have to disable_ferret(:always) on our utterance
>> model, then update the index manually on some periodic basis (cron job,
>> backgroundrb worker, etc.).  The reason for this is that we
don''t want
>> to introduce any delay into the process of storing a new utterance,
>> which occurs in realtime during a chat session.  Anyone have experience
>> doing this?
> 
> It''s pretty fast. The only time you''d see a slowdown is
when you
> encounter a lock in the DRb server.
>
And what would cause that?  Do normal model creates cause a lock?

Thanks so much for your info. so far and for any further advice you can
give me.

Best Regards,

Danny

-- 
Posted via http://www.ruby-forum.com/.

Jens Kraemer

2007-Apr-23 13:17 UTC

head link

[Ferret-talk] Thinking of using aaf- looking for advice

On Mon, Apr 23, 2007 at 03:52:51AM +0200, Danny Burkes wrote:
[..]> 
> Sorry, I should have been more clear- what I was referring to was not
> storage, but rather tokenization.  My understanding is that many
> people use a simple Regex-based one-token-per-character tokenizer for
> non-Latin languages, but, since our languages are mixed, I wasn''t
sure
> what type of approach to tokenization would be best.  Clearly we
can''t
> use that one-token-per-character analyzer on latin text, right?
Right :-) 

Some heuristics to get an idea about which language you''re working on
right now might be a good idea to select a proper analyzing algorithm.

The Nutch search engine (Java, Lucene-based) seems to have something
like that, possibly we could port this:
http://wiki.apache.org/nutch/MultiLingualSupport
> > However, if your search system isn''t online (ie, the feature
isn''t
> > enabled in the front end), why would you need anything special? The
> > AAF DRb server can server requests while you''re running a
rebuild (as
> > long as you don''t use the current rebuild_index method).
> > 
> 
> Perhaps I''m remembering incorrectly, but my recollection was that,
the
> first time I created a new record for a model that uses aaf, the whole
> instance blocked while aaf was creating the index.  Did I remember that
> wrong?
No, that''s correct. You can force a rebuild by calling
Model.rebuild_index from the console. 
> If that is the way that it works, then, clearly, I need to start the
> rebuild from outside of the application, before any users can create new
> model objects.
> 
> Further, are you saying that model creations during the rebuild
won''t
> block (I guess they realize that a rebuild is already happening and just
> return immediately)?
Unfortunately the DRb server doesn''t realize this, yet. As Ryan wrote,
I
plan to rework the re-indexing stuff in the near future, most likely
then there will be some kind of index rotation and a queue remembering
model updates that occured while a rebuild is going on.
> >> 5.  I suspect we will have to disable_ferret(:always) on our
utterance
> >> model, then update the index manually on some periodic basis (cron
job,
> >> backgroundrb worker, etc.).  The reason for this is that we
don''t want
> >> to introduce any delay into the process of storing a new
utterance,
> >> which occurs in realtime during a chat session.  Anyone have
experience
> >> doing this?
> > 
> > It''s pretty fast. The only time you''d see a slowdown
is when you
> > encounter a lock in the DRb server.
> >
> 
> And what would cause that?  Do normal model creates cause a lock?
Index updates are synchronized as there only may be one thread writing
to the index at a time. In case immediate indexing of new or updated
records is not needed, I see no problem in doing this later from cron or
backgroundrb based on some flag or timestamp. 

Ferret *is* fast, but you also have to take into account the DRb round
trip time, so this really could make sense for a chat application.

cheers,
Jens


-- 
Jens Kr?mer
webit! Gesellschaft f?r neue Medien mbH
Schnorrstra?e 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
kraemer at webit.de | www.webit.de
 
Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

Danny Burkes

2007-Apr-23 14:42 UTC

head link

[Ferret-talk] Thinking of using aaf- looking for advice

> Some heuristics to get an idea about which language you''re working
on
> right now might be a good idea to select a proper analyzing algorithm.
>
I think that''s probably the only way to do this effectively, but how
can
I specify a particular analyzer on a per-model-instance basis?  I only 
recall seeing that aaf allows analyzer specification on a per-model 
basis.

Or perhaps you were thinking of a single analyzer that does heuristics 
itself and decides how to tokenized based on the input string?
> No, that''s correct. You can force a rebuild by calling
> Model.rebuild_index from the console.
> 
OK, fair enough.
>> Further, are you saying that model creations during the rebuild
won''t
>> block (I guess they realize that a rebuild is already happening and
just
>> return immediately)?
> 
> Unfortunately the DRb server doesn''t realize this, yet. As Ryan
wrote, I
> plan to rework the re-indexing stuff in the near future, most likely
> then there will be some kind of index rotation and a queue remembering
> model updates that occured while a rebuild is going on.
> 
So how would you suggest that ever get the index "caught up"?  The
first
rebuild_index will probably take many hours, and, while that''s
building,
thousands on new model instances will be created.  Since we can''t turn 
on automatic indexing (at least until the index is up to date), how do 
we get the index up to date?
> Index updates are synchronized as there only may be one thread writing
> to the index at a time. In case immediate indexing of new or updated
> records is not needed, I see no problem in doing this later from cron or
> backgroundrb based on some flag or timestamp.
> 
> Ferret *is* fast, but you also have to take into account the DRb round
> trip time, so this really could make sense for a chat application.
>
Yeah, sounds like we would have to do it periodically, and tell users to 
expect a few minutes of index latency.  That''s fine- it certainly beats
the typical latency that already we get from Google crawlers, which are 
currently our only local search functionality.

Best Regards,

Danny

-- 
Posted via http://www.ruby-forum.com/.

Jens Kraemer

2007-Apr-23 15:08 UTC

head link

[Ferret-talk] Thinking of using aaf- looking for advice

On Mon, Apr 23, 2007 at 04:42:50PM +0200, Danny Burkes
wrote:> > Some heuristics to get an idea about which language you''re
working on
> > right now might be a good idea to select a proper analyzing algorithm.
> >
> 
> I think that''s probably the only way to do this effectively, but
how can
> I specify a particular analyzer on a per-model-instance basis?  I only 
> recall seeing that aaf allows analyzer specification on a per-model 
> basis.
> 
> Or perhaps you were thinking of a single analyzer that does heuristics 
> itself and decides how to tokenized based on the input string?
Yeah, that''s what I was thinking of. 

That uber-analyzer could determine the language/type of language used in
a document and then delegate to a specialized analyzer. Same would have
to be done for query analysis - here (because of small text size) it
would be good if a hint could be supplied by the application (i.e. user
profile, ui language used).

[..]> > Unfortunately the DRb server doesn''t realize this, yet. As
Ryan wrote, I
> > plan to rework the re-indexing stuff in the near future, most likely
> > then there will be some kind of index rotation and a queue remembering
> > model updates that occured while a rebuild is going on.
> > 
> 
> So how would you suggest that ever get the index "caught up"? 
The first
> rebuild_index will probably take many hours, and, while that''s
building,
> thousands on new model instances will be created.  Since we can''t
turn
> on automatic indexing (at least until the index is up to date), how do 
> we get the index up to date?
I''d just remember the time I started rebuild_index, and after
it''s
finished, index all records that have been created afterwards, not doing
a rebuild_index but reading/indexing them one by one. 

Or even better, let your background indexer (that later will handle the
regular index updates) do this by marking all records older than the
rebuild timestamp as ''already indexed'' and then starting it to
handle
these not yet indexed records.

regards,
Jens


-- 
Jens Kr?mer
webit! Gesellschaft f?r neue Medien mbH
Schnorrstra?e 76 | 01069 Dresden
Telefon +49 351 46766-0 | Telefax +49 351 46766-66
kraemer at webit.de | www.webit.de
 
Amtsgericht Dresden | HRB 15422
GF Sven Haubold, Hagen Malessa

Possibly Parallel Threads

Search for more possibly parallel threads

Ferret talk - Apr 2007 - Thinking of using aaf- looking for advice

[Ferret-talk] Thinking of using aaf- looking for advice

[Ferret-talk] Thinking of using aaf- looking for advice

[Ferret-talk] Thinking of using aaf- looking for advice

[Ferret-talk] Thinking of using aaf- looking for advice

[Ferret-talk] Thinking of using aaf- looking for advice

[Ferret-talk] Thinking of using aaf- looking for advice

Possibly Parallel Threads