David Balmain
2007-Apr-06 07:45 UTC
[Ferret-talk] [VOTE] Should stop-words be filtered by default?
Hey folks, A lot of confusion has been caused by having stop-words filtered by the default analyzer. There have been a few suggestions to remove this feature so I thought I''d put it to a vote. Making this change would not be backwards compatible and would require users to either rebuild their indexes or change their code to keep the same stop-words settings. However, it would save a lot of confusion for people starting out with Ferret. So what do people think? Should stop-words be filtered by default? -- Dave Balmain http://www.davebalmain.com/
Benjamin Krause
2007-Apr-06 08:23 UTC
[Ferret-talk] [VOTE] Should stop-words be filtered by default?
On Apr 6, 2007, at 09:45, David Balmain wrote:> So what do people think? Should stop-words be filtered by default?yes.. but i guess the problem is, that most people doesn''t know about analyzers and therefore will not see the relation between stop-words and the standard analyzer. afaic, the default behavior should be with stop-word-filtering, because searching is about full-text-search in the first place. as soon as you want to do special things like searching for names, you should understand ferrets fields and analyzers, because if you don''t any result will be coincidental anyway. Ben
Florian Gilcher
2007-Apr-06 10:06 UTC
[Ferret-talk] [VOTE] Should stop-words be filtered by default?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 +1 Most it''s the typical ''it cannot be so hard, there is a plugin for it'' problem - if you do text searches, you should know the basics (and stop-words). This problem even has the first place in the gotchas list. I hate to answer with ''RTFM'' but this is one of the cases where it applies. We are talking about removing a sane default just because no one gets it. Greetings Florian Benjamin Krause wrote:> On Apr 6, 2007, at 09:45, David Balmain wrote: > >> So what do people think? Should stop-words be filtered by default? > > yes.. but i guess the problem is, that most people doesn''t know about > analyzers and therefore will not see the relation between stop-words > and the standard analyzer. > > afaic, the default behavior should be with stop-word-filtering, because > searching is about full-text-search in the first place. as soon as > you want > to do special things like searching for names, you should understand > ferrets fields and analyzers, because if you don''t any result will be > coincidental anyway. > > Ben > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGFhu38RlGMqQ8m7oRAhmmAJ4l200bUi2UHuXXS7WconbelN6Q7ACfb1SQ SGPIFWwEHFEcBhrHeu28TL4=B+YM -----END PGP SIGNATURE-----
François Beausoleil
2007-Apr-06 12:52 UTC
[Ferret-talk] [VOTE] Should stop-words be filtered by default?
Hello David, 2007/4/6, David Balmain <dbalmain.ml at gmail.com>:> So what do people think? Should stop-words be filtered by default?I too would prefer that stop words be filtered by default. If you don''t know what you''re doing, then it''s just normal that you would be bitten by your work. Bye ! -- Fran?ois Beausoleil http://blog.teksol.info/ http://piston.rubyforge.org/
Marvin Humphrey
2007-Apr-06 13:14 UTC
[Ferret-talk] [VOTE] Should stop-words be filtered by default?
On Apr 6, 2007, at 12:45 AM, David Balmain wrote:> Should stop-words be filtered by default?-1 Queries return more relevant documents on average if you don''t filter stopwords. The default setting should be the one that produces the best search results. Adding a stop filter should be part of performance tuning. Marvin Humphrey Rectangular Research http://www.rectangular.com/
Erik Hatcher
2007-Apr-06 13:39 UTC
[Ferret-talk] [VOTE] Should stop-words be filtered by default?
I concur with Marvin on this point. It is very often confusing even for me when using Lucene and the StandardAnalyzer with stop words removed by default. The list is English and thus biased and often wrong anyway. Less magic by default :) Erik On Apr 6, 2007, at 9:14 AM, Marvin Humphrey wrote:> > On Apr 6, 2007, at 12:45 AM, David Balmain wrote: > >> Should stop-words be filtered by default? > > -1 > > Queries return more relevant documents on average if you don''t filter > stopwords. The default setting should be the one that produces the > best search results. Adding a stop filter should be part of > performance tuning. > > Marvin Humphrey > Rectangular Research > http://www.rectangular.com/ > > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk
Jonathan Weiss
2007-Apr-06 13:54 UTC
[Ferret-talk] [VOTE] Should stop-words be filtered by default?
> The list is English and thus biased and often > wrong anyway. Less magic by default :) >+ 1 Jonathan -- Jonathan Weiss http://blog.innerewut.de
Alex Young
2007-Apr-06 16:41 UTC
[Ferret-talk] [VOTE] Should stop-words be filtered by default?
Erik Hatcher wrote:> I concur with Marvin on this point. It is very often confusing even > for me when using Lucene and the StandardAnalyzer with stop words > removed by default. The list is English and thus biased and often > wrong anyway. Less magic by default :) >Stop-words are a form of optimisation. Premature optimisation is evil. Therefore applying stopwords by default (that is, before one knows anything about the performance and space constraints of the specific application context) is evil. Having them available, should one need to reduce the index size, is extremely useful, but my vote goes to switching them off by default. -- Alex
William Morgan
2007-Apr-06 16:57 UTC
[Ferret-talk] [VOTE] Should stop-words be filtered by default?
Excerpts from David Balmain''s message of Fri Apr 06 00:45:42 -0700 2007:> So what do people think? Should stop-words be filtered by default?I also vote to turn them off by default. Their usefuless to retrieval performance is limited to specific and uncommon situations, whereas their ability to confuse people is not. -- William <wmorgan-ferret at masanjin.net>
Tiago Macedo
2007-Apr-06 17:26 UTC
[Ferret-talk] [VOTE] Should stop-words be filtered by default?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I also believe the default behaviour should be filtering the stop words. It''s not like it''s hard to change it. Tiago Macedo Florian Gilcher wrote:> +1 > > Most it''s the typical ''it cannot be so hard, there is a plugin for it'' > problem - if you do text searches, you should know the basics (and > stop-words). This problem even has the first place in the gotchas list. > I hate to answer with ''RTFM'' but this is one of the cases where it > applies. We are talking about removing a sane default just because no > one gets it. > > Greetings > Florian > > Benjamin Krause wrote: >> On Apr 6, 2007, at 09:45, David Balmain wrote: > >>> So what do people think? Should stop-words be filtered by default? >> yes.. but i guess the problem is, that most people doesn''t know about >> analyzers and therefore will not see the relation between stop-words >> and the standard analyzer. > >> afaic, the default behavior should be with stop-word-filtering, because >> searching is about full-text-search in the first place. as soon as >> you want >> to do special things like searching for names, you should understand >> ferrets fields and analyzers, because if you don''t any result will be >> coincidental anyway. > >> Ben > >> _______________________________________________ >> Ferret-talk mailing list >> Ferret-talk at rubyforge.org >> http://rubyforge.org/mailman/listinfo/ferret-talk >_______________________________________________ Ferret-talk mailing list Ferret-talk at rubyforge.org http://rubyforge.org/mailman/listinfo/ferret-talk -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGFoLRxFuRTtCTMvIRArfOAJ4wZaHmPNXCfSE+EAYZlopYnIcsrgCdElSy 70SWqMpkoAsqm41LVUOel34=aJUE -----END PGP SIGNATURE-----
Andreas Korth
2007-Apr-07 10:11 UTC
[Ferret-talk] [VOTE] Should stop-words be filtered by default?
On Apr 6, 2007, at 6:57 PM, William Morgan wrote:> Excerpts from David Balmain''s message of Fri Apr 06 00:45:42 -0700 > 2007: >> So what do people think? Should stop-words be filtered by default? > > I also vote to turn them off by default. Their usefuless to retrieval > performance is limited to specific and uncommon situations, whereas > their ability to confuse people is not.Very well put. I, too, vote for off. (gee, wasn''t I the one who started this? =)
Florian Gilcher
2007-Apr-08 13:06 UTC
[Ferret-talk] [VOTE] Should stop-words be filtered by default?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 William Morgan wrote:> Excerpts from David Balmain''s message of Fri Apr 06 00:45:42 -0700 2007: >> So what do people think? Should stop-words be filtered by default? > > I also vote to turn them off by default. Their usefuless to retrieval > performance is limited to specific and uncommon situations, whereas > their ability to confuse people is not. >Do you have any proof for this assumption? Every fulltext search I use has a stopword-list by default. Mysql FULLTEXT for example even needs to be recompiled if you want to change them. I also want to argue that the use of stopwords is very common. For example, if I have an index of 1.000 english documents and search for ''and'', chances are high that I get a result set of 1000 hits - which is unusable. I am unable to see the corner-case in this scenario. We are not talking about performance here - we are talking about sane results. Stopwords are more of a result than an performance optimization. If you want to query phrases, i would be wise to use ferrets phrase-query instead of killing the stopwords. I cannot find it at the moment, but there was the point that ''premature'' optimization is bad. This may be wise for your own application, but the libraries in use should be a) mature and b) optimized. Greetings Florian -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGGOjt8RlGMqQ8m7oRArS1AJ0bz7nvEniqilGUFmY+IFQEzzHMpQCfVBpT VzDUFW9MVtbQwVOkF/UiRoA=WzGq -----END PGP SIGNATURE-----
Alex Young
2007-Apr-08 16:24 UTC
[Ferret-talk] [VOTE] Should stop-words be filtered by default?
Florian Gilcher wrote:> Every fulltext search I use > has a stopword-list by default. Mysql FULLTEXT for example even needs to > be recompiled if you want to change them.This is a massive, massive drawback. For web-apps on shared hosts, in the past I''ve had to resort to appending characters to each word to evade stop-word and minimum-length filtering, precisely because of this inane default, and you can imagine what that does to performance.> I also want to argue that the > use of stopwords is very common.That doesn''t make it correct. I see enough queries on this list alone from people surprised by the stop-word behaviour, or needing to change it because they need to support a language other than English, to believe that they should be dropped by default.> For example, if I have an index of > 1.000 english documents and search for ''and'', chances are high that I > get a result set of 1000 hits - which is unusable.So what? The inverse isn''t usable either - if ''and'' is a stop-word, and you only search for ''and'', you''ll get no results at all.> Stopwords are more of a result than an performance optimization.That''s just not the case - stop-words exist primarily to reduce the index size. Their effect on the result set is a product of the way you construct a stop-word list - by picking the words which impart the smallest amounts of information.> I cannot find it at the moment, but there was the point that ''premature'' > optimization is bad. This may be wise for your own application, but the > libraries in use should be a) mature and b) optimized.I believe that point was mine. However, I was not referring to performance - traditionally stop-words have been used as a storage space reduction strategy, with typical results being a reduction in index size of between 20 and 30 percent. There may well be a correlated performance bump, but that''s tangential. I''m not arguing that stop-words should not be available if you want them. I''m not even arguing against supplying a decent set of stop-words for as many different languages as possible. I am trying to argue that they should not be turned on by default. -- Alex
Chad Thatcher
2007-Apr-08 20:22 UTC
[Ferret-talk] [VOTE] Should stop-words be filtered by default?
I vote to take it out as a default. I will surely continue to use it but only in certain places and it would be nicer to switch it on rather than have to remember to switch it off. -- Posted via http://www.ruby-forum.com/.
David Balmain wrote:> Hey folks, > > A lot of confusion has been caused by having stop-words filtered by > the default analyzer. There have been a few suggestions to remove this > feature so I thought I''d put it to a vote. Making this change would > not be backwards compatible and would require users to either rebuild > their indexes or change their code to keep the same stop-words > settings. However, it would save a lot of confusion for people > starting out with Ferret. > > So what do people think? Should stop-words be filtered by default?I suggest that we can keep this default filter with stopwords All the users who tried to use this plugin in their app would like to practise it, then they will know such a feature through self learning or faq here :),if u turned it as non stop words maybe some of them won''t realize it and implement it by themselves to purify the query -- Posted via http://www.ruby-forum.com/.
David Balmain
2007-Apr-09 04:39 UTC
[Ferret-talk] [VOTE] Should stop-words be filtered by default?
On 4/9/07, Alex Young <alex at blackkettle.org> wrote:> I''m not arguing that stop-words should not be available if you want > them. I''m not even arguing against supplying a decent set of stop-words > for as many different languages as possible. I am trying to argue that > they should not be turned on by default.Well, it looks like the people who want stop-word filtering off by default have the slight edge (at 7 to 5 by my count). I will probably change this default one day, however, I don''t think it is important enough to change now as it would force a lot Ferret users to rebuild their indexes. I''ll wait until there is a more important update already forcing users to rebuild their indexes. Perhaps Ferret 2.0? ;-) Thanks for your input guys. Dave -- Dave Balmain http://www.davebalmain.com/
Andreas Korth
2007-Apr-09 15:26 UTC
[Ferret-talk] [VOTE] Should stop-words be filtered by default?
On Apr 8, 2007, at 6:24 PM, Alex Young wrote:> I''m not arguing that stop-words should not be available if you want > them. I''m not even arguing against supplying a decent set of stop- > words > for as many different languages as possible. I am trying to argue > that > they should not be turned on by default.Full f****** ack! :P -- Andy