I''m trying to add a way to query across associations for a model in acts_as_ferret. Say I have a model A and it has a relationship with model B. Like say a Book has many pages. I want to search across the pages of the Book and produce a list of unique books who''s pages match the terms. So if I have a page that hits then I will add that book to my list of results. Right now the multi_search returns all pages and books that match the query. This sort of gets difficult with pagination because you can''t just keep track of it yourself. Also the total hits will include all hits on pages and books. The way the pagination works today with ferret is you hand it :offsets and :limit params. But, these are fixed width params. I could end up with 100''s of pages that all belong to the same book so I have to skip all of those. This sort of seems like a different kind of search. Not a multi_search or find_by_contents, but a find_by_association. Where the hit on the association returns an object of the associated type. Is there something in ferret that allows me to scroll through the results one by one and stop when I''ve reached my limit? Charlie -- Posted via http://www.ruby-forum.com/.
On 10/15/06, Charlie Hubbard <charlie.hubbard at gmail.com> wrote:> > I''m trying to add a way to query across associations for a model in > acts_as_ferret. Say I have a model A and it has a relationship with > model B. Like say a Book has many pages. I want to search across the > pages of the Book and produce a list of unique books who''s pages match > the terms. So if I have a page that hits then I will add that book to > my list of results. Right now the multi_search returns all pages and > books that match the query. > > This sort of gets difficult with pagination because you can''t just keep > track of it yourself. Also the total hits will include all hits on > pages and books. The way the pagination works today with ferret is you > hand it :offsets and :limit params. But, these are fixed width params. > I could end up with 100''s of pages that all belong to the same book so I > have to skip all of those. > > This sort of seems like a different kind of search. Not a multi_search > or find_by_contents, but a find_by_association. Where the hit on the > association returns an object of the associated type.If I manage to implement the Ferret object database[1] this will be simple. Currently though there are two ways to do this. You can index all of the Page data in the Book document, presumably in a :page field. Or you can store the Book ids in the Pages and create a Book id set by scanning through all matching pages. [1] http://www.ruby-forum.com/topic/82086#142613> Is there something in ferret that allows me to scroll through the > results one by one and stop when I''ve reached my limit?Sure. Set :limit => :all and call search_each. Then break when you reach your limit.
David Balmain wrote:> If I manage to implement the Ferret object database[1] this will be > simple. Currently though there are two ways to do this. You can index > all of the Page data in the Book document, presumably in a :page > field. Or you can store the Book ids in the Pages and create a Book id > set by scanning through all matching pages. > > [1] http://www.ruby-forum.com/topic/82086#142613The first option has problems because a book''s content will be too large for a single field. It would overrun ferret''s maximum field length. I''m pretty much doing the second option now. But, it''s drawback is pagination gets tough. I''m not sure how having the ferret object database would actually work to solve this problem. How would your queries express what the user intends? How would it know I want to include all the Page objects as apart of a search on Books? Seems like you''d have to specify that sort of thing as options to the search. Like we have to specify eager loading with :include option to find.>> Is there something in ferret that allows me to scroll through the >> results one by one and stop when I''ve reached my limit? > > Sure. Set :limit => :all and call search_each. Then break when you > reach your limit.That will work for creating a list of Books, and ensuring a show say 10 unique books per page. But, I won''t be able to tell what the total number of hits were. Any ideas? Also it gets hard to do pagination because you can''t compute where the next window starts and ends. So how do you know what the offset parameter is for the previous pages. Or the offset for the 9th page is? Charlie -- Posted via http://www.ruby-forum.com/.
On 10/15/06, Charlie Hubbard <charlie.hubbard at gmail.com> wrote:> David Balmain wrote: > > > If I manage to implement the Ferret object database[1] this will be > > simple. Currently though there are two ways to do this. You can index > > all of the Page data in the Book document, presumably in a :page > > field. Or you can store the Book ids in the Pages and create a Book id > > set by scanning through all matching pages. > > > > [1] http://www.ruby-forum.com/topic/82086#142613 > > The first option has problems because a book''s content will be too large > for a single field. It would overrun ferret''s maximum field length.Then change the maximum field length. IndexWriter has a :max_field_length parameter.> I''m pretty much doing the second option now. But, it''s drawback is > pagination gets tough. I''m not sure how having the ferret object > database would actually work to solve this problem. How would your > queries express what the user intends? How would it know I want to > include all the Page objects as apart of a search on Books? Seems like > you''d have to specify that sort of thing as options to the search. Like > we have to specify eager loading with :include option to find.Well the user would just type their query as usual but you''d write the query something like: Books.find("pages match ''#{query}''", :limit => 10) Or something like that. I haven''t worked the details yet. And you would be able to specify whether you wanted lazy or eager loading too.> >> Is there something in ferret that allows me to scroll through the > >> results one by one and stop when I''ve reached my limit? > > > > Sure. Set :limit => :all and call search_each. Then break when you > > reach your limit. > > That will work for creating a list of Books, and ensuring a show say 10 > unique books per page. But, I won''t be able to tell what the total > number of hits were. Any ideas? > > Also it gets hard to do pagination because you can''t compute where the > next window starts and ends. So how do you know what the offset > parameter is for the previous pages. Or the offset for the 9th page is? >Scroll through all matches or use option 1. Cheers, Dave
David Balmain wrote:> Well the user would just type their query as usual but you''d write the > query something like: > > Books.find("pages match ''#{query}''", :limit => 10) > > Or something like that. I haven''t worked the details yet. And you > would be able to specify whether you wanted lazy or eager loading too.That''s what I guessed you''d have to do. Change the query language to support this concept. I was actually working on adding a new method to acts_as_ferret where you could pass these associations matches in like: Book.find_by_association( query, [:pages], { :limit => 20 } ) Since I can''t change the query language, but I could express the same sort of behavior. This would result in a multi_index query across Book and Page indexes. But, tracking total_hits, and paging just don''t work with this approach. The only option you have is to iterate over all the matches. When we do ferret queries does ferret actually go over the entire search space to calculate all the possible documents that matched the query? Then just returns the ones within the offset and limits? If that''s the case then it''s doable to create this type of search, but it would make more sense to modify ferret to support this type of query. I''m interested in your database approach. It could help simplify this problem. It seems doable to add this to acts_as_ferret without needing a seperate project. Not to mention it''s really needed in Rails apps as well. Charlie -- Posted via http://www.ruby-forum.com/.
On 10/16/06, Charlie Hubbard <charlie.hubbard at gmail.com> wrote:> David Balmain wrote: > > > Well the user would just type their query as usual but you''d write the > > query something like: > > > > Books.find("pages match ''#{query}''", :limit => 10) > > > > Or something like that. I haven''t worked the details yet. And you > > would be able to specify whether you wanted lazy or eager loading too. > > That''s what I guessed you''d have to do. Change the query language to > support this concept. I was actually working on adding a new method to > acts_as_ferret where you could pass these associations matches in like: > > Book.find_by_association( query, [:pages], { :limit => 20 } ) > > Since I can''t change the query language, but I could express the same > sort of behavior. This would result in a multi_index query across Book > and Page indexes. But, tracking total_hits, and paging just don''t work > with this approach. The only option you have is to iterate over all the > matches. > > When we do ferret queries does ferret actually go over the entire search > space to calculate all the possible documents that matched the query? > Then just returns the ones within the offset and limits?Yes, that''s exactly how it works.> If that''s the case then it''s doable to create this type of search, but > it would make more sense to modify ferret to support this type of query.I don''t see a way to add this feature cleanly. It is just as easy for you to do iterate through all the results yourself. Besides, you still haven''t explained why you can''t add all Pages to each Book document? As I said, the field length limit isn''t an issue. This would be the best way to solve this problem.> I''m interested in your database approach. It could help simplify this > problem. It seems doable to add this to acts_as_ferret without needing > a seperate project. Not to mention it''s really needed in Rails apps as > well. >In my suggested database approach the search would be the equivalent of a simple SQL join query. By adding a feature like this to acts_as_ferret you''ll need to pull all the matching page ids out of the index and peform a much slower SQL query for all books that include those page ids. I''m not sure it is feasible but I''ll leave that decision to the acts_as_ferret developers. The best solution is definitely to index all the pages with the book document, even if it means indexing each page twice. Cheers, Dave
Hi! On Mon, Oct 16, 2006 at 04:23:28PM +0900, David Balmain wrote:> On 10/16/06, Charlie Hubbard <charlie.hubbard at gmail.com> wrote:[..]> > I''m interested in your database approach. It could help simplify this > > problem. It seems doable to add this to acts_as_ferret without needing > > a seperate project. Not to mention it''s really needed in Rails apps as > > well. > > > > In my suggested database approach the search would be the equivalent > of a simple SQL join query. By adding a feature like this to > acts_as_ferret you''ll need to pull all the matching page ids out of > the index and peform a much slower SQL query for all books that > include those page ids. I''m not sure it is feasible but I''ll leave > that decision to the acts_as_ferret developers. The best solution is > definitely to index all the pages with the book document, even if it > means indexing each page twice.I''d suggest going that route, too. An imho interesting question around this is, how much the size of the value for that pages field containing all pages of a book really would influence the total index size (when not storing the contents and not storing term vectors), i.e. will the index size grow in a linear way, or will it grow slower over time, as with bigger size of the value of a field more terms occur more than once ? Jens -- webit! Gesellschaft f?r neue Medien mbH www.webit.de Dipl.-Wirtschaftsingenieur Jens Kr?mer kraemer at webit.de Schnorrstra?e 76 Tel +49 351 46766 0 D-01069 Dresden Fax +49 351 46766 66
On 10/16/06, Jens Kraemer <kraemer at webit.de> wrote:> Hi! > > On Mon, Oct 16, 2006 at 04:23:28PM +0900, David Balmain wrote: > > On 10/16/06, Charlie Hubbard <charlie.hubbard at gmail.com> wrote: > [..] > > > I''m interested in your database approach. It could help simplify this > > > problem. It seems doable to add this to acts_as_ferret without needing > > > a seperate project. Not to mention it''s really needed in Rails apps as > > > well. > > > > > > > In my suggested database approach the search would be the equivalent > > of a simple SQL join query. By adding a feature like this to > > acts_as_ferret you''ll need to pull all the matching page ids out of > > the index and peform a much slower SQL query for all books that > > include those page ids. I''m not sure it is feasible but I''ll leave > > that decision to the acts_as_ferret developers. The best solution is > > definitely to index all the pages with the book document, even if it > > means indexing each page twice. > > I''d suggest going that route, too. > > An imho interesting question around this is, how much the size of the > value for that pages field containing all pages of a book really would > influence the total index size (when not storing the contents and not > storing term vectors), i.e. will the index size grow in a linear way, or > will it grow slower over time, as with bigger size of the value of a > field more terms occur more than once ? > > JensThat is an interesting question. I haven''t done any tests to back this up but I would guess you are correct. Indexing the content as a single field in Book will take up a lot less space than it would in separated into multiple documents as pages. So indexing the field twice as I suggested shouldn''t double the size of your index. In fact, if you give the fields the same name (ie :content for both Page and Book) then the increase in index size will be negligable. There will however be a noticable difference in indexing time but again, it shouldn''t be double. As far as search goes this solution will probably be orders of magnitude better. Dave
David Balmain wrote:>> If that''s the case then it''s doable to create this type of search, but >> it would make more sense to modify ferret to support this type of query. > > I don''t see a way to add this feature cleanly. It is just as easy for > you to do iterate through all the results yourself. Besides, you still > haven''t explained why you can''t add all Pages to each Book document? > As I said, the field length limit isn''t an issue. This would be the > best way to solve this problem.There is no reason why I couldn''t. I was just trying to figure out a way to avoid it. The big drawback to indexing all the pages onto a single field in book would mean I''d have to pick a size of the field up front that could be the maximum. I don''t have a lot of data yet, but I tried running some tests. A 94 chapter book it''s somewhere around of 100,000. But that''s a smaller book. It''s just something you have to watch closely which I was trying to avoid is all. Right now your right the best approach is to store it twice.> In my suggested database approach the search would be the equivalent > of a simple SQL join query. By adding a feature like this to > acts_as_ferret you''ll need to pull all the matching page ids out of > the index and peform a much slower SQL query for all books that > include those page ids. I''m not sure it is feasible but I''ll leave > that decision to the acts_as_ferret developers. The best solution is > definitely to index all the pages with the book document, even if it > means indexing each page twice.I was thinking it would be more like a SQL union. In other words the query didn''t have to match the Book document in order to be included. It just had to match the Page object to be included. For example, say I have a book title of Lucene in Action, but you''d expect a query "java" would pull that one back. Java is probably mentioned in the text of that book. I sort of saw it as a multi_index query, since aaf maps the objects that way, where you''d first query Book Documents, then query the Page documents. Instead of adding those Page documents to the resulting array. They would only add a new entry if there was a Book not already there. I suppose I could do that in Ruby, but it just seems like it might be more optimized if ferret understood this type of relationship since it is already iterating over this already. -- Posted via http://www.ruby-forum.com/.
On 10/16/06, Charlie Hubbard <charlie.hubbard at gmail.com> wrote:> David Balmain wrote: > > >> If that''s the case then it''s doable to create this type of search, but > >> it would make more sense to modify ferret to support this type of query. > > > > I don''t see a way to add this feature cleanly. It is just as easy for > > you to do iterate through all the results yourself. Besides, you still > > haven''t explained why you can''t add all Pages to each Book document? > > As I said, the field length limit isn''t an issue. This would be the > > best way to solve this problem. > > There is no reason why I couldn''t. I was just trying to figure out a > way to avoid it. The big drawback to indexing all the pages onto a > single field in book would mean I''d have to pick a size of the field up > front that could be the maximum. I don''t have a lot of data yet, but I > tried running some tests. A 94 chapter book it''s somewhere around of > 100,000. But that''s a smaller book. It''s just something you have to > watch closely which I was trying to avoid is all. Right now your right > the best approach is to store it twice.Set it to Ferret::FIX_INT_MAX. This is the largest number that you set any of the properties too and effectively sets no limit to the field length. I''ll add :all as an option at some point.> > In my suggested database approach the search would be the equivalent > > of a simple SQL join query. By adding a feature like this to > > acts_as_ferret you''ll need to pull all the matching page ids out of > > the index and peform a much slower SQL query for all books that > > include those page ids. I''m not sure it is feasible but I''ll leave > > that decision to the acts_as_ferret developers. The best solution is > > definitely to index all the pages with the book document, even if it > > means indexing each page twice. > > I was thinking it would be more like a SQL union. In other words the > query didn''t have to match the Book document in order to be included. > It just had to match the Page object to be included. For example, say I > have a book title of Lucene in Action, but you''d expect a query "java" > would pull that one back. Java is probably mentioned in the text of > that book. I sort of saw it as a multi_index query, since aaf maps the > objects that way, where you''d first query Book Documents, then query the > Page documents. Instead of adding those Page documents to the resulting > array. They would only add a new entry if there was a Book not already > there. I suppose I could do that in Ruby, but it just seems like it > might be more optimized if ferret understood this type of relationship > since it is already iterating over this already.Trust me, Ferret is complex enough as it is without having to understand relationships between different documents. I need to draw the line somewhere. If I want to add features like this I need to design Ferret from the ground up to be more like a database which is exactly what I intend to do with the Ferret object database. I hope that makes sense. Dave