Danny Burkes
2007-Apr-21 04:13 UTC
[Ferret-talk] Thinking of using aaf- looking for advice
Hi- I''m technical lead at Lingr (http://www.lingr.com), a chatroom-based social networking site. We''ve currently got several million user utterances stored in MySQL, and we''re looking to build a local search functionality. I''ve played around with aaf and I really like it, but I have some questions. 1. Is anyone out there using aaf to index a corpus of this size? If so, how has your scaling experience been? 2. We would be running one central aaf server instance, talking to it over drb from our many application servers. We add tens of thousands of utterances per day- anyone out there indexing this many items on a daily basis over drb? If so, how has your experience been in terms of stability? 3. All of our utterance data is in UTF8, but we don''t know what language a particular utterance is in. It''s common to have both latin and non-latin text even in the same room. How can I index both types of strings effectively within the same model field index? 4. Any suggestions on how to build the initial index in an offline way? I suspect it will probably take many hours to build the initial index. 5. I suspect we will have to disable_ferret(:always) on our utterance model, then update the index manually on some periodic basis (cron job, backgroundrb worker, etc.). The reason for this is that we don''t want to introduce any delay into the process of storing a new utterance, which occurs in realtime during a chat session. Anyone have experience doing this? Any advice is appreciated! Best Regards, Danny Burkes -- Posted via http://www.ruby-forum.com/.
On 4/20/07, Danny Burkes <dburkes at netable.com> wrote:> Hi- > > I''m technical lead at Lingr (http://www.lingr.com), a chatroom-based > social networking site. We''ve currently got several million user > utterances stored in MySQL, and we''re looking to build a local search > functionality. I''ve played around with aaf and I really like it, but I > have some questions. > > > 1. Is anyone out there using aaf to index a corpus of this size? If > so, how has your scaling experience been?Yes. I have server models with more the 4M rows, all indexed with AAF. My experience has been that AAF is very stable. Most of my challenges have been with ferret upgrades breaking index format.> 2. We would be running one central aaf server instance, talking to it > over drb from our many application servers. We add tens of thousands of > utterances per day- anyone out there indexing this many items on a daily > basis over drb? If so, how has your experience been in terms of > stability?Yes. Rock solid.> 3. All of our utterance data is in UTF8, but we don''t know what > language a particular utterance is in. It''s common to have both latin > and non-latin text even in the same room. How can I index both types of > strings effectively within the same model field index?Why not just use UTF8?> 4. Any suggestions on how to build the initial index in an offline way? > I suspect it will probably take many hours to build the initial index.Jens has talked about developing a better rebuild_index for AAF that does this. However, if your search system isn''t online (ie, the feature isn''t enabled in the front end), why would you need anything special? The AAF DRb server can server requests while you''re running a rebuild (as long as you don''t use the current rebuild_index method).> 5. I suspect we will have to disable_ferret(:always) on our utterance > model, then update the index manually on some periodic basis (cron job, > backgroundrb worker, etc.). The reason for this is that we don''t want > to introduce any delay into the process of storing a new utterance, > which occurs in realtime during a chat session. Anyone have experience > doing this?It''s pretty fast. The only time you''d see a slowdown is when you encounter a lock in the DRb server. -ryan
Danny Burkes
2007-Apr-23 01:52 UTC
[Ferret-talk] Thinking of using aaf- looking for advice
Ryan King wrote:> Yes. I have server models with more the 4M rows, all indexed with AAF. > My experience has been that AAF is very stable. Most of my challenges > have been with ferret upgrades breaking index format. > > Yes. Rock solid. >Great to know- thanks very much!>> 3. All of our utterance data is in UTF8, but we don''t know what >> language a particular utterance is in. It''s common to have both latin >> and non-latin text even in the same room. How can I index both types of >> strings effectively within the same model field index? > > Why not just use UTF8? >Sorry, I should have been more clear- what I was referring to was not storage, but rather tokenization. My understanding is that many people use a simple Regex-based one-token-per-character tokenizer for non-Latin languages, but, since our languages are mixed, I wasn''t sure what type of approach to tokenization would be best. Clearly we can''t use that one-token-per-character analyzer on latin text, right?> However, if your search system isn''t online (ie, the feature isn''t > enabled in the front end), why would you need anything special? The > AAF DRb server can server requests while you''re running a rebuild (as > long as you don''t use the current rebuild_index method). >Perhaps I''m remembering incorrectly, but my recollection was that, the first time I created a new record for a model that uses aaf, the whole instance blocked while aaf was creating the index. Did I remember that wrong? If that is the way that it works, then, clearly, I need to start the rebuild from outside of the application, before any users can create new model objects. Further, are you saying that model creations during the rebuild won''t block (I guess they realize that a rebuild is already happening and just return immediately)?>> 5. I suspect we will have to disable_ferret(:always) on our utterance >> model, then update the index manually on some periodic basis (cron job, >> backgroundrb worker, etc.). The reason for this is that we don''t want >> to introduce any delay into the process of storing a new utterance, >> which occurs in realtime during a chat session. Anyone have experience >> doing this? > > It''s pretty fast. The only time you''d see a slowdown is when you > encounter a lock in the DRb server. >And what would cause that? Do normal model creates cause a lock? Thanks so much for your info. so far and for any further advice you can give me. Best Regards, Danny -- Posted via http://www.ruby-forum.com/.
Jens Kraemer
2007-Apr-23 13:17 UTC
[Ferret-talk] Thinking of using aaf- looking for advice
On Mon, Apr 23, 2007 at 03:52:51AM +0200, Danny Burkes wrote: [..]> > Sorry, I should have been more clear- what I was referring to was not > storage, but rather tokenization. My understanding is that many > people use a simple Regex-based one-token-per-character tokenizer for > non-Latin languages, but, since our languages are mixed, I wasn''t sure > what type of approach to tokenization would be best. Clearly we can''t > use that one-token-per-character analyzer on latin text, right?Right :-) Some heuristics to get an idea about which language you''re working on right now might be a good idea to select a proper analyzing algorithm. The Nutch search engine (Java, Lucene-based) seems to have something like that, possibly we could port this: http://wiki.apache.org/nutch/MultiLingualSupport> > However, if your search system isn''t online (ie, the feature isn''t > > enabled in the front end), why would you need anything special? The > > AAF DRb server can server requests while you''re running a rebuild (as > > long as you don''t use the current rebuild_index method). > > > > Perhaps I''m remembering incorrectly, but my recollection was that, the > first time I created a new record for a model that uses aaf, the whole > instance blocked while aaf was creating the index. Did I remember that > wrong?No, that''s correct. You can force a rebuild by calling Model.rebuild_index from the console.> If that is the way that it works, then, clearly, I need to start the > rebuild from outside of the application, before any users can create new > model objects. > > Further, are you saying that model creations during the rebuild won''t > block (I guess they realize that a rebuild is already happening and just > return immediately)?Unfortunately the DRb server doesn''t realize this, yet. As Ryan wrote, I plan to rework the re-indexing stuff in the near future, most likely then there will be some kind of index rotation and a queue remembering model updates that occured while a rebuild is going on.> >> 5. I suspect we will have to disable_ferret(:always) on our utterance > >> model, then update the index manually on some periodic basis (cron job, > >> backgroundrb worker, etc.). The reason for this is that we don''t want > >> to introduce any delay into the process of storing a new utterance, > >> which occurs in realtime during a chat session. Anyone have experience > >> doing this? > > > > It''s pretty fast. The only time you''d see a slowdown is when you > > encounter a lock in the DRb server. > > > > And what would cause that? Do normal model creates cause a lock?Index updates are synchronized as there only may be one thread writing to the index at a time. In case immediate indexing of new or updated records is not needed, I see no problem in doing this later from cron or backgroundrb based on some flag or timestamp. Ferret *is* fast, but you also have to take into account the DRb round trip time, so this really could make sense for a chat application. cheers, Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold, Hagen Malessa
Danny Burkes
2007-Apr-23 14:42 UTC
[Ferret-talk] Thinking of using aaf- looking for advice
> Some heuristics to get an idea about which language you''re working on > right now might be a good idea to select a proper analyzing algorithm. >I think that''s probably the only way to do this effectively, but how can I specify a particular analyzer on a per-model-instance basis? I only recall seeing that aaf allows analyzer specification on a per-model basis. Or perhaps you were thinking of a single analyzer that does heuristics itself and decides how to tokenized based on the input string?> No, that''s correct. You can force a rebuild by calling > Model.rebuild_index from the console. >OK, fair enough.>> Further, are you saying that model creations during the rebuild won''t >> block (I guess they realize that a rebuild is already happening and just >> return immediately)? > > Unfortunately the DRb server doesn''t realize this, yet. As Ryan wrote, I > plan to rework the re-indexing stuff in the near future, most likely > then there will be some kind of index rotation and a queue remembering > model updates that occured while a rebuild is going on. >So how would you suggest that ever get the index "caught up"? The first rebuild_index will probably take many hours, and, while that''s building, thousands on new model instances will be created. Since we can''t turn on automatic indexing (at least until the index is up to date), how do we get the index up to date?> Index updates are synchronized as there only may be one thread writing > to the index at a time. In case immediate indexing of new or updated > records is not needed, I see no problem in doing this later from cron or > backgroundrb based on some flag or timestamp. > > Ferret *is* fast, but you also have to take into account the DRb round > trip time, so this really could make sense for a chat application. >Yeah, sounds like we would have to do it periodically, and tell users to expect a few minutes of index latency. That''s fine- it certainly beats the typical latency that already we get from Google crawlers, which are currently our only local search functionality. Best Regards, Danny -- Posted via http://www.ruby-forum.com/.
Jens Kraemer
2007-Apr-23 15:08 UTC
[Ferret-talk] Thinking of using aaf- looking for advice
On Mon, Apr 23, 2007 at 04:42:50PM +0200, Danny Burkes wrote:> > Some heuristics to get an idea about which language you''re working on > > right now might be a good idea to select a proper analyzing algorithm. > > > > I think that''s probably the only way to do this effectively, but how can > I specify a particular analyzer on a per-model-instance basis? I only > recall seeing that aaf allows analyzer specification on a per-model > basis. > > Or perhaps you were thinking of a single analyzer that does heuristics > itself and decides how to tokenized based on the input string?Yeah, that''s what I was thinking of. That uber-analyzer could determine the language/type of language used in a document and then delegate to a specialized analyzer. Same would have to be done for query analysis - here (because of small text size) it would be good if a hint could be supplied by the application (i.e. user profile, ui language used). [..]> > Unfortunately the DRb server doesn''t realize this, yet. As Ryan wrote, I > > plan to rework the re-indexing stuff in the near future, most likely > > then there will be some kind of index rotation and a queue remembering > > model updates that occured while a rebuild is going on. > > > > So how would you suggest that ever get the index "caught up"? The first > rebuild_index will probably take many hours, and, while that''s building, > thousands on new model instances will be created. Since we can''t turn > on automatic indexing (at least until the index is up to date), how do > we get the index up to date?I''d just remember the time I started rebuild_index, and after it''s finished, index all records that have been created afterwards, not doing a rebuild_index but reading/indexing them one by one. Or even better, let your background indexer (that later will handle the regular index updates) do this by marking all records older than the rebuild timestamp as ''already indexed'' and then starting it to handle these not yet indexed records. regards, Jens -- Jens Kr?mer webit! Gesellschaft f?r neue Medien mbH Schnorrstra?e 76 | 01069 Dresden Telefon +49 351 46766-0 | Telefax +49 351 46766-66 kraemer at webit.de | www.webit.de Amtsgericht Dresden | HRB 15422 GF Sven Haubold, Hagen Malessa