So, I''m working on a search engine of sorts that restricts results to your local area. I can successfully return all entries within 15 miles of a particular point, and I can successfully return all entries that match a search query, but I''m having trouble combining the two together and doing pagination on them. Basically, for the range query, you do a SQL query that returns all results within +/- 1 latitude/longitude from the point in question, and then you do some spherical trig on each of those results to get only the entries within X number of miles of the point. And for the search query, so far, I''ve been using acts_as_ferret''s find_by_contents method. But now I need to figure out how to take an array of results from the range query, and only do the find_by_contents magic on just the entries in that Array. So far, everything method I''ve thought of looks like it''s going to have performance problems. Any suggestions? -- Posted via http://www.ruby-forum.com/.
On Wed, Jan 31, 2007 at 03:21:31PM +0100, Bob Aman wrote:> So, I''m working on a search engine of sorts that restricts results to > your local area. I can successfully return all entries within 15 miles > of a particular point, and I can successfully return all entries that > match a search query, but I''m having trouble combining the two together > and doing pagination on them. > > Basically, for the range query, you do a SQL query that returns all > results within +/- 1 latitude/longitude from the point in question, and > then you do some spherical trig on each of those results to get only the > entries within X number of miles of the point.wouldn''t it be possible to query the ferret index for all points withing the +/- 1 long/lat range? then you could combine the user''s search terms with that query, and afterwards do the calculations to filter out those points outside the x miles radius.> And for the search query, so far, I''ve been using acts_as_ferret''s > find_by_contents method. But now I need to figure out how to take an > array of results from the range query, and only do the find_by_contents > magic on just the entries in that Array. So far, everything method I''ve > thought of looks like it''s going to have performance problems.In any case I''d first try what works, and then look at the performance of different approaches with a realistical amount of data. That aside, my first guess is that narrowing down the result set with Ferret before doing any spatial calculations would be a good idea. Jens -- webit! Gesellschaft f?r neue Medien mbH www.webit.de Dipl.-Wirtschaftsingenieur Jens Kr?mer kraemer at webit.de Schnorrstra?e 76 Tel +49 351 46766 0 D-01069 Dresden Fax +49 351 46766 66
On Jan 31, 2007, at 6:21 AM, Bob Aman wrote:> And for the search query, so far, I''ve been using acts_as_ferret''s > find_by_contents method. But now I need to figure out how to take an > array of results from the range query, and only do the > find_by_contents > magic on just the entries in that Array. So far, everything method > I''ve > thought of looks like it''s going to have performance problems. > > Any suggestions?Bob- I don''t know what would be involved in using this method with aaf, but I am using this method for a bounding box geo query. If you need a more strict radial search you can use a custom filter with 0.10.x. During index population I''m doing: doc << Field.new("latitude", latitude.to_f + 1000, Field::Store::NO, Field::Index::UNTOKENIZED) doc << Field.new("longitude", longitude.to_f + 1000, Field::Store::NO, Field::Index::UNTOKENIZED) and during the query I''m doing: query << "latitude:[#{box[:lat_min] + 1000} #{box[:lat_max] + 1000}] AND " query << "longitude:[#{box[:lon_min] + 1000} #{box[:lon_max] + 1000}] AND " I have helper methods outside of this scope to handle the min/max that I''m searching in. I also have a complete (yet untested) GeoFilter, but we aren''t using the .10.x Ferret yet so I have no idea if it actually works. Michael-
I could go on for hours about this, but lemme see if I can summarize what I know. Search engines aren''t really optimized to geographic queries (unless they have a built-in geographic index feature). You can make them work, but it''s gonna be a hack. Several things to try: - What you described already, the lat / lon box idea. The issue with that is that it selects the lat first, the lon second. By doing so, it''s like taking a vertical or horizontal stripe of the country and sticking it in a temporary result set. That temporary result set can be HUGE, especially on the US East Coast. Which is why this is slow. - You could, instead, select all the zip codes or postal codes in the area, first, then do your precise calculations . That will be fast if and only if Ferret can handle that many query terms at once. Most search engines can''t, really, but still, this is usually a bit faster than the first solution. The hard part here is computing the set of zip codes you want in the first place. - You could limit individual queries to greater metro areas, first, then do your precise calculations. Two issues here: one, getting that data; two, those areas have borders, so getting coverage is difficult at best. Faster, still, is doing the zip code coverage area solution in a database. The reason this''d be faster is that you can take your set of zip codes covering your search area, and join them to the table in question. Joins are much faster than a list of search query terms. Like I said, the true solution is geographic indexes. Unfortunately, Ferret doesn''t have them. Maybe Lucene does? It''s possible to fake geographic indexes in a non-geographic engine, but it''s really nasty math; I''d only recommend that if you need to bleed every last ounce of speed out of it. Schnitz On 1/31/07, Michael Moen <mi-ferret at moensolutions.com> wrote:> > > On Jan 31, 2007, at 6:21 AM, Bob Aman wrote: > > > And for the search query, so far, I''ve been using acts_as_ferret''s > > find_by_contents method. But now I need to figure out how to take an > > array of results from the range query, and only do the > > find_by_contents > > magic on just the entries in that Array. So far, everything method > > I''ve > > thought of looks like it''s going to have performance problems. > > > > Any suggestions? > > Bob- I don''t know what would be involved in using this method with > aaf, but I am using this method for a bounding box geo query. If you > need a more strict radial search you can use a custom filter with > 0.10.x. > > During index population I''m doing: > > doc << Field.new("latitude", latitude.to_f + 1000, > Field::Store::NO, Field::Index::UNTOKENIZED) > doc << Field.new("longitude", longitude.to_f + 1000, > Field::Store::NO, Field::Index::UNTOKENIZED) > > and during the query I''m doing: > > query << "latitude:[#{box[:lat_min] + 1000} #{box[:lat_max] + 1000}] > AND " > query << "longitude:[#{box[:lon_min] + 1000} #{box[:lon_max] + 1000}] > AND " > > I have helper methods outside of this scope to handle the min/max > that I''m searching in. I also have a complete (yet untested) > GeoFilter, but we aren''t using the .10.x Ferret yet so I have no idea > if it actually works. > > Michael- > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20070131/61b99233/attachment-0001.html
Oh, and one bit of advice on lat/lon boxes - The order the engine performs the lat & lon query - either lat then lon, or lon then lat - matters. Try swapping the terms in the queries. You want the engine to select the vertical stripe (lon between x and y) first in most of the US, because that has the least chance of picking up other major metro areas. If you''re really slick, though, you''ll switch it up depending on the area of the country, because that vertical stripe is murder on certain parts of the East Coast. (This all assumes you''re in the US, but the principle is the same wherever you are - check your map!) Schnitz On 1/31/07, Matt Schnitz <matt at mattschnitz.com> wrote:> > I could go on for hours about this, but lemme see if I can summarize what > I know. > > Search engines aren''t really optimized to geographic queries (unless they > have a built-in geographic index feature). You can make them work, but it''s > gonna be a hack. > > Several things to try: > - What you described already, the lat / lon box idea. The issue with > that is that it selects the lat first, the lon second. By doing so, it''s > like taking a vertical or horizontal stripe of the country and sticking it > in a temporary result set. That temporary result set can be HUGE, > especially on the US East Coast. Which is why this is slow. > - You could, instead, select all the zip codes or postal codes in the > area, first, then do your precise calculations . That will be fast if and > only if Ferret can handle that many query terms at once. Most search > engines can''t, really, but still, this is usually a bit faster than the > first solution. The hard part here is computing the set of zip codes you > want in the first place. > - You could limit individual queries to greater metro areas, first, then > do your precise calculations. Two issues here: one, getting that data; two, > those areas have borders, so getting coverage is difficult at best. > > Faster, still, is doing the zip code coverage area solution in a > database. The reason this''d be faster is that you can take your set of zip > codes covering your search area, and join them to the table in question. > Joins are much faster than a list of search query terms. > > Like I said, the true solution is geographic indexes. Unfortunately, > Ferret doesn''t have them. Maybe Lucene does? It''s possible to fake > geographic indexes in a non-geographic engine, but it''s really nasty math; > I''d only recommend that if you need to bleed every last ounce of speed out > of it. > > > Schnitz > > On 1/31/07, Michael Moen <mi-ferret at moensolutions.com> wrote: > > > > > > On Jan 31, 2007, at 6:21 AM, Bob Aman wrote: > > > > > And for the search query, so far, I''ve been using acts_as_ferret''s > > > find_by_contents method. But now I need to figure out how to take an > > > array of results from the range query, and only do the > > > find_by_contents > > > magic on just the entries in that Array. So far, everything method > > > I''ve > > > thought of looks like it''s going to have performance problems. > > > > > > Any suggestions? > > > > Bob- I don''t know what would be involved in using this method with > > aaf, but I am using this method for a bounding box geo query. If you > > need a more strict radial search you can use a custom filter with > > 0.10.x. > > > > During index population I''m doing: > > > > doc << Field.new("latitude", latitude.to_f + 1000, > > Field::Store::NO, Field::Index::UNTOKENIZED) > > doc << Field.new("longitude", longitude.to_f + 1000, > > Field::Store::NO, Field::Index::UNTOKENIZED) > > > > and during the query I''m doing: > > > > query << "latitude:[#{box[:lat_min] + 1000} #{box[:lat_max] + 1000}] > > AND " > > query << "longitude:[#{box[:lon_min] + 1000} #{box[:lon_max] + 1000}] > > AND " > > > > I have helper methods outside of this scope to handle the min/max > > that I''m searching in. I also have a complete (yet untested) > > GeoFilter, but we aren''t using the .10.x Ferret yet so I have no idea > > if it actually works. > > > > Michael- > > _______________________________________________ > > Ferret-talk mailing list > > Ferret-talk at rubyforge.org > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20070131/c3c121f0/attachment.html
> wouldn''t it be possible to query the ferret index for all points withing > the > +/- 1 long/lat range? then you could combine the user''s search terms > with that > query, and afterwards do the calculations to filter out those points > outside > the x miles radius.Hadn''t actually thought of having ferret index the lat/long points. However, I''m pretty sure I don''t want to go down that route. However, while I was waiting for an answer, I think I figured out how to do this query in the fastest way possible. In this case, it would be optimized specifically for my app, and for this one query, but that''s fine by me because the app only has one thing you ever search against. The way things are set up, you have a single location in the DB for each subscriber. Each location can have multiple entries. The geoquery is actually done against the locations, not the entries. So there''s a LOT fewer of them. You might have 10-15 locations in a city, max, but each location could have hundreds of entries. So you do the geoquery only against the locations, and you get an array of ids for locations within range of the point back. Now you just change the indexed entries to also include the ids of the locations they''re associated with. When you do the search, you include the list of ids that it can match against as part of the query. That should be possible, correct? I haven''t sat down and worked out exactly what the ferret query syntax would look like for that though. -- Posted via http://www.ruby-forum.com/.
That''s a good plan. Beware the number of search terms, though. Typically search engines slow down pretty quick the more you add search terms onto the query. It creates a lot of merging work on the back-end of the query execution. Schnitz On 1/31/07, Bob Aman <bob at sporkmonger.com> wrote:> > > wouldn''t it be possible to query the ferret index for all points withing > > the > > +/- 1 long/lat range? then you could combine the user''s search terms > > with that > > query, and afterwards do the calculations to filter out those points > > outside > > the x miles radius. > > Hadn''t actually thought of having ferret index the lat/long points. > However, I''m pretty sure I don''t want to go down that route. > > However, while I was waiting for an answer, I think I figured out how to > do this query in the fastest way possible. In this case, it would be > optimized specifically for my app, and for this one query, but that''s > fine by me because the app only has one thing you ever search against. > > The way things are set up, you have a single location in the DB for each > subscriber. Each location can have multiple entries. The geoquery is > actually done against the locations, not the entries. So there''s a LOT > fewer of them. You might have 10-15 locations in a city, max, but each > location could have hundreds of entries. So you do the geoquery only > against the locations, and you get an array of ids for locations within > range of the point back. Now you just change the indexed entries to > also include the ids of the locations they''re associated with. When you > do the search, you include the list of ids that it can match against as > part of the query. > > That should be possible, correct? > > I haven''t sat down and worked out exactly what the ferret query syntax > would look like for that though. > > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20070131/144680f0/attachment.html