thr3ads.net - Ferret talk - [Ferret-talk] How can I do my own search limits? [Oct 2006]

If this information is useful, please help other people find it:
Share via:

Charlie Hubbard

2006-Oct-14 21:45 UTC

[Ferret-talk] How can I do my own search limits?

I''m trying to add a way to query across associations for a model in 
acts_as_ferret.  Say I have a model A and it has a relationship with 
model B.  Like say a Book has many pages.  I want to search across the 
pages of the Book and produce a list of unique books who''s pages match 
the terms.  So if I have a page that hits then I will add that book to 
my list of results.  Right now the multi_search returns all pages and 
books that match the query.

This sort of gets difficult with pagination because you can''t just keep
track of it yourself.  Also the total hits will include all hits on 
pages and books.  The way the pagination works today with ferret is you 
hand it :offsets and :limit params.  But, these are fixed width params. 
I could end up with 100''s of pages that all belong to the same book so
I
have to skip all of those.

This sort of seems like a different kind of search.  Not a multi_search 
or find_by_contents, but a find_by_association.  Where the hit on the 
association returns an object of the associated type.

Is there something in ferret that allows me to scroll through the 
results one by one and stop when I''ve reached my limit?

Charlie

-- 
Posted via http://www.ruby-forum.com/.

David Balmain

2006-Oct-15 03:46 UTC

head link

[Ferret-talk] How can I do my own search limits?

On 10/15/06, Charlie Hubbard <charlie.hubbard at gmail.com>
wrote:>
> I''m trying to add a way to query across associations for a model
in
> acts_as_ferret.  Say I have a model A and it has a relationship with
> model B.  Like say a Book has many pages.  I want to search across the
> pages of the Book and produce a list of unique books who''s pages
match
> the terms.  So if I have a page that hits then I will add that book to
> my list of results.  Right now the multi_search returns all pages and
> books that match the query.
>
> This sort of gets difficult with pagination because you can''t just
keep
> track of it yourself.  Also the total hits will include all hits on
> pages and books.  The way the pagination works today with ferret is you
> hand it :offsets and :limit params.  But, these are fixed width params.
> I could end up with 100''s of pages that all belong to the same
book so I
> have to skip all of those.
>
> This sort of seems like a different kind of search.  Not a multi_search
> or find_by_contents, but a find_by_association.  Where the hit on the
> association returns an object of the associated type.
If I manage to implement the Ferret object database[1] this will be
simple. Currently though there are two ways to do this. You can index
all of the Page data in the Book document, presumably in a :page
field. Or you can store the Book ids in the Pages and create a Book id
set by scanning through all matching pages.

[1] http://www.ruby-forum.com/topic/82086#142613
> Is there something in ferret that allows me to scroll through the
> results one by one and stop when I''ve reached my limit?
Sure. Set :limit => :all and call search_each. Then break when you
reach your limit.

Charlie Hubbard

2006-Oct-15 14:30 UTC

head link

[Ferret-talk] How can I do my own search limits?

David Balmain wrote:
> If I manage to implement the Ferret object database[1] this will be
> simple. Currently though there are two ways to do this. You can index
> all of the Page data in the Book document, presumably in a :page
> field. Or you can store the Book ids in the Pages and create a Book id
> set by scanning through all matching pages.
> 
> [1] http://www.ruby-forum.com/topic/82086#142613
The first option has problems because a book''s content will be too
large
for a single field.  It would overrun ferret''s maximum field length. 
I''m pretty much doing the second option now.  But, it''s
drawback is
pagination gets tough.  I''m not sure how having the ferret object 
database would actually work to solve this problem.  How would your 
queries express what the user intends?  How would it know I want to 
include all the Page objects as apart of a search on Books?  Seems like 
you''d have to specify that sort of thing as options to the search. 
Like
we have to specify eager loading with :include option to find.
>> Is there something in ferret that allows me to scroll through the
>> results one by one and stop when I''ve reached my limit?
> 
> Sure. Set :limit => :all and call search_each. Then break when you
> reach your limit.
That will work for creating a list of Books, and ensuring a show say 10 
unique books per page.  But, I won''t be able to tell what the total 
number of hits were.  Any ideas?

Also it gets hard to do pagination because you can''t compute where the 
next window starts and ends.  So how do you know what the offset 
parameter is for the previous pages.  Or the offset for the 9th page is?

Charlie

-- 
Posted via http://www.ruby-forum.com/.

David Balmain

2006-Oct-15 16:16 UTC

head link

[Ferret-talk] How can I do my own search limits?

On 10/15/06, Charlie Hubbard <charlie.hubbard at gmail.com>
wrote:> David Balmain wrote:
>
> > If I manage to implement the Ferret object database[1] this will be
> > simple. Currently though there are two ways to do this. You can index
> > all of the Page data in the Book document, presumably in a :page
> > field. Or you can store the Book ids in the Pages and create a Book id
> > set by scanning through all matching pages.
> >
> > [1] http://www.ruby-forum.com/topic/82086#142613
>
> The first option has problems because a book''s content will be too
large
> for a single field.  It would overrun ferret''s maximum field
length.
Then change the maximum field length. IndexWriter has a
:max_field_length parameter.
> I''m pretty much doing the second option now.  But, it''s
drawback is
> pagination gets tough.  I''m not sure how having the ferret object
> database would actually work to solve this problem.  How would your
> queries express what the user intends?  How would it know I want to
> include all the Page objects as apart of a search on Books?  Seems like
> you''d have to specify that sort of thing as options to the search.
Like
> we have to specify eager loading with :include option to find.
Well the user would just type their query as usual but you''d write the
query something like:

Books.find("pages match ''#{query}''", :limit =>
10)

Or something like that. I haven''t worked the details yet. And you
would be able to specify whether you wanted lazy or eager loading too.
> >> Is there something in ferret that allows me to scroll through the
> >> results one by one and stop when I''ve reached my limit?
> >
> > Sure. Set :limit => :all and call search_each. Then break when you
> > reach your limit.
>
> That will work for creating a list of Books, and ensuring a show say 10
> unique books per page.  But, I won''t be able to tell what the
total
> number of hits were.  Any ideas?
>
> Also it gets hard to do pagination because you can''t compute where
the
> next window starts and ends.  So how do you know what the offset
> parameter is for the previous pages.  Or the offset for the 9th page is?
>
Scroll through all matches or use option 1.

Cheers,
Dave

Charlie Hubbard

2006-Oct-15 22:05 UTC

head link

[Ferret-talk] How can I do my own search limits?

David Balmain wrote:
> Well the user would just type their query as usual but you''d write
the
> query something like:
> 
> Books.find("pages match ''#{query}''", :limit
=> 10)
> 
> Or something like that. I haven''t worked the details yet. And you
> would be able to specify whether you wanted lazy or eager loading too.
That''s what I guessed you''d have to do.  Change the query
language to
support this concept.  I was actually working on adding a new method to 
acts_as_ferret where you could pass these associations matches in like:

Book.find_by_association( query, [:pages], { :limit => 20 } )

Since I can''t change the query language, but I could express the same 
sort of behavior.  This would result in a multi_index query across Book 
and Page indexes.  But, tracking total_hits, and paging just don''t work
with this approach.  The only option you have is to iterate over all the 
matches.

When we do ferret queries does ferret actually go over the entire search 
space to calculate all the possible documents that matched the query? 
Then just returns the ones within the offset and limits?

If that''s the case then it''s doable to create this type of
search, but
it would make more sense to modify ferret to support this type of query.

I''m interested in your database approach. It could help simplify this 
problem.  It seems doable to add this to acts_as_ferret without needing 
a seperate project.  Not to mention it''s really needed in Rails apps as
well.

Charlie

-- 
Posted via http://www.ruby-forum.com/.

David Balmain

2006-Oct-16 07:23 UTC

head link

[Ferret-talk] How can I do my own search limits?

On 10/16/06, Charlie Hubbard <charlie.hubbard at gmail.com>
wrote:> David Balmain wrote:
>
> > Well the user would just type their query as usual but you''d
write the
> > query something like:
> >
> > Books.find("pages match ''#{query}''",
:limit => 10)
> >
> > Or something like that. I haven''t worked the details yet. And
you
> > would be able to specify whether you wanted lazy or eager loading too.
>
> That''s what I guessed you''d have to do.  Change the query
language to
> support this concept.  I was actually working on adding a new method to
> acts_as_ferret where you could pass these associations matches in like:
>
> Book.find_by_association( query, [:pages], { :limit => 20 } )
>
> Since I can''t change the query language, but I could express the
same
> sort of behavior.  This would result in a multi_index query across Book
> and Page indexes.  But, tracking total_hits, and paging just don''t
work
> with this approach.  The only option you have is to iterate over all the
> matches.
>
> When we do ferret queries does ferret actually go over the entire search
> space to calculate all the possible documents that matched the query?
> Then just returns the ones within the offset and limits?
Yes, that''s exactly how it works.
> If that''s the case then it''s doable to create this type
of search, but
> it would make more sense to modify ferret to support this type of query.
I don''t see a way to add this feature cleanly. It is just as easy for
you to do iterate through all the results yourself. Besides, you still
haven''t explained why you can''t add all Pages to each Book
document?
As I said, the field length limit isn''t an issue. This would be the
best way to solve this problem.
> I''m interested in your database approach. It could help simplify
this
> problem.  It seems doable to add this to acts_as_ferret without needing
> a seperate project.  Not to mention it''s really needed in Rails
apps as
> well.
>
In my suggested database approach the search would be the equivalent
of a simple SQL join query. By adding a feature like this to
acts_as_ferret you''ll need to pull all the matching page ids out of
the index and peform a much slower SQL query for all books that
include those page ids. I''m not sure it is feasible but I''ll
leave
that decision to the acts_as_ferret developers. The best solution is
definitely to index all the pages with the book document, even if it
means indexing each page twice.

Cheers,
Dave

Jens Kraemer

2006-Oct-16 10:03 UTC

head link

[Ferret-talk] How can I do my own search limits?

Hi!

On Mon, Oct 16, 2006 at 04:23:28PM +0900, David Balmain
wrote:> On 10/16/06, Charlie Hubbard <charlie.hubbard at gmail.com> wrote:
[..]> > I''m interested in your database approach. It could help
simplify this
> > problem.  It seems doable to add this to acts_as_ferret without
needing
> > a seperate project.  Not to mention it''s really needed in
Rails apps as
> > well.
> >
> 
> In my suggested database approach the search would be the equivalent
> of a simple SQL join query. By adding a feature like this to
> acts_as_ferret you''ll need to pull all the matching page ids out
of
> the index and peform a much slower SQL query for all books that
> include those page ids. I''m not sure it is feasible but
I''ll leave
> that decision to the acts_as_ferret developers. The best solution is
> definitely to index all the pages with the book document, even if it
> means indexing each page twice.
I''d suggest going that route, too. 

An imho interesting question around this is, how much the size of the
value for that pages field containing all pages of a book really would
influence the total index size (when not storing the contents and not
storing term vectors), i.e. will the index size grow in a linear way, or
will it grow slower over time, as with bigger size of the value of a
field more terms occur more than once ?

Jens


-- 
webit! Gesellschaft f?r neue Medien mbH          www.webit.de
Dipl.-Wirtschaftsingenieur Jens Kr?mer       kraemer at webit.de
Schnorrstra?e 76                         Tel +49 351 46766  0
D-01069 Dresden                          Fax +49 351 46766 66

David Balmain

2006-Oct-16 10:45 UTC

head link

[Ferret-talk] How can I do my own search limits?

On 10/16/06, Jens Kraemer <kraemer at webit.de>
wrote:> Hi!
>
> On Mon, Oct 16, 2006 at 04:23:28PM +0900, David Balmain wrote:
> > On 10/16/06, Charlie Hubbard <charlie.hubbard at gmail.com>
wrote:
> [..]
> > > I''m interested in your database approach. It could help
simplify this
> > > problem.  It seems doable to add this to acts_as_ferret without
needing
> > > a seperate project.  Not to mention it''s really needed
in Rails apps as
> > > well.
> > >
> >
> > In my suggested database approach the search would be the equivalent
> > of a simple SQL join query. By adding a feature like this to
> > acts_as_ferret you''ll need to pull all the matching page ids
out of
> > the index and peform a much slower SQL query for all books that
> > include those page ids. I''m not sure it is feasible but
I''ll leave
> > that decision to the acts_as_ferret developers. The best solution is
> > definitely to index all the pages with the book document, even if it
> > means indexing each page twice.
>
> I''d suggest going that route, too.
>
> An imho interesting question around this is, how much the size of the
> value for that pages field containing all pages of a book really would
> influence the total index size (when not storing the contents and not
> storing term vectors), i.e. will the index size grow in a linear way, or
> will it grow slower over time, as with bigger size of the value of a
> field more terms occur more than once ?
>
> Jens
That is an interesting question. I haven''t done any tests to back this
up but I would guess you are correct. Indexing the content as a single
field in Book will take up a lot less space than it would in separated
into multiple documents as pages. So indexing the field twice as I
suggested shouldn''t double the size of your index. In fact, if you
give the fields the same name (ie :content for both Page and Book)
then the increase in index size will be negligable. There will however
be a noticable difference in indexing time but again, it shouldn''t be
double. As far as search goes this solution will probably be orders of
magnitude better.

Dave

Charlie Hubbard

2006-Oct-16 14:20 UTC

head link

[Ferret-talk] How can I do my own search limits?

David Balmain wrote:
>> If that''s the case then it''s doable to create this
type of search, but
>> it would make more sense to modify ferret to support this type of
query.
> 
> I don''t see a way to add this feature cleanly. It is just as easy
for
> you to do iterate through all the results yourself. Besides, you still
> haven''t explained why you can''t add all Pages to each
Book document?
> As I said, the field length limit isn''t an issue. This would be
the
> best way to solve this problem.
There is no reason why I couldn''t.  I was just trying to figure out a 
way to avoid it.  The big drawback to indexing all the pages onto a 
single field in book would mean I''d have to pick a size of the field up
front that could be the maximum.  I don''t have a lot of data yet, but I
tried running some tests.  A 94 chapter book it''s somewhere around of 
100,000.  But that''s a smaller book.  It''s just something you
have to
watch closely which I was trying to avoid is all.  Right now your right 
the best approach is to store it twice.
> In my suggested database approach the search would be the equivalent
> of a simple SQL join query. By adding a feature like this to
> acts_as_ferret you''ll need to pull all the matching page ids out
of
> the index and peform a much slower SQL query for all books that
> include those page ids. I''m not sure it is feasible but
I''ll leave
> that decision to the acts_as_ferret developers. The best solution is
> definitely to index all the pages with the book document, even if it
> means indexing each page twice.
I was thinking it would be more like a SQL union.  In other words the 
query didn''t have to match the Book document in order to be included. 
It just had to match the Page object to be included.  For example, say I 
have a book title of Lucene in Action, but you''d expect a query
"java"
would pull that one back.  Java is probably mentioned in the text of 
that book.  I sort of saw it as a multi_index query, since aaf maps the 
objects that way, where you''d first query Book Documents, then query
the
Page documents.  Instead of adding those Page documents to the resulting 
array.  They would only add a new entry if there was a Book not already 
there.  I suppose I could do that in Ruby, but it just seems like it 
might be more optimized if ferret understood this type of relationship 
since it is already iterating over this already.

-- 
Posted via http://www.ruby-forum.com/.

David Balmain

2006-Oct-16 16:13 UTC

head link

[Ferret-talk] How can I do my own search limits?

On 10/16/06, Charlie Hubbard <charlie.hubbard at gmail.com>
wrote:> David Balmain wrote:
>
> >> If that''s the case then it''s doable to create
this type of search, but
> >> it would make more sense to modify ferret to support this type of
query.
> >
> > I don''t see a way to add this feature cleanly. It is just as
easy for
> > you to do iterate through all the results yourself. Besides, you still
> > haven''t explained why you can''t add all Pages to
each Book document?
> > As I said, the field length limit isn''t an issue. This would
be the
> > best way to solve this problem.
>
> There is no reason why I couldn''t.  I was just trying to figure
out a
> way to avoid it.  The big drawback to indexing all the pages onto a
> single field in book would mean I''d have to pick a size of the
field up
> front that could be the maximum.  I don''t have a lot of data yet,
but I
> tried running some tests.  A 94 chapter book it''s somewhere around
of
> 100,000.  But that''s a smaller book.  It''s just something
you have to
> watch closely which I was trying to avoid is all.  Right now your right
> the best approach is to store it twice.
Set it to Ferret::FIX_INT_MAX. This is the largest number that you set
any of the properties too and effectively sets no limit to the field
length. I''ll add :all as an option at some point.
> > In my suggested database approach the search would be the equivalent
> > of a simple SQL join query. By adding a feature like this to
> > acts_as_ferret you''ll need to pull all the matching page ids
out of
> > the index and peform a much slower SQL query for all books that
> > include those page ids. I''m not sure it is feasible but
I''ll leave
> > that decision to the acts_as_ferret developers. The best solution is
> > definitely to index all the pages with the book document, even if it
> > means indexing each page twice.
>
> I was thinking it would be more like a SQL union.  In other words the
> query didn''t have to match the Book document in order to be
included.
> It just had to match the Page object to be included.  For example, say I
> have a book title of Lucene in Action, but you''d expect a query
"java"
> would pull that one back.  Java is probably mentioned in the text of
> that book.  I sort of saw it as a multi_index query, since aaf maps the
> objects that way, where you''d first query Book Documents, then
query the
> Page documents.  Instead of adding those Page documents to the resulting
> array.  They would only add a new entry if there was a Book not already
> there.  I suppose I could do that in Ruby, but it just seems like it
> might be more optimized if ferret understood this type of relationship
> since it is already iterating over this already.
Trust me, Ferret is complex enough as it is without having to
understand relationships between different documents. I need to draw
the line somewhere. If I want to add features like this I need to
design Ferret from the ground up to be more like a database which is
exactly what I intend to do with the Ferret object database. I hope
that makes sense.

Dave

Maybe Matching Threads

Search for more possibly parallel threads

Ferret talk - Oct 2006 - How can I do my own search limits?

[Ferret-talk] How can I do my own search limits?

[Ferret-talk] How can I do my own search limits?

[Ferret-talk] How can I do my own search limits?

[Ferret-talk] How can I do my own search limits?

[Ferret-talk] How can I do my own search limits?

[Ferret-talk] How can I do my own search limits?

[Ferret-talk] How can I do my own search limits?

[Ferret-talk] How can I do my own search limits?

[Ferret-talk] How can I do my own search limits?

[Ferret-talk] How can I do my own search limits?

Maybe Matching Threads