Hey all, I went through the docs in Ferret''s page, plus a quick search through the email list (thread titles), and I couldn''t find any info on how to have Ferret storing it''s data using UTF-8. In the scenario I would use it, nothing''s being stored outside (like external databases). So it''s just how Ferret would do it that I''m interesting in knowing. The reason why I ask is because I''m deploying a search engine for an application that will probably be searching for text content in Japanese/Chinese *apart* from english. I''m hinting it in case someone did it before and knows any pitfalls. Thanks in advance. -- Julio C. Ody http://rootshell.be/~julioody
On 7/12/06, Julio Cesar Ody <julioody at gmail.com> wrote:> Hey all, > > I went through the docs in Ferret''s page, plus a quick search through > the email list (thread titles), and I couldn''t find any info on how to > have Ferret storing it''s data using UTF-8. > > In the scenario I would use it, nothing''s being stored outside (like > external databases). So it''s just how Ferret would do it that I''m > interesting in knowing. > > The reason why I ask is because I''m deploying a search engine for an > application that will probably be searching for text content in > Japanese/Chinese *apart* from english. I''m hinting it in case someone > did it before and knows any pitfalls. > > Thanks in advance.The core of ferret is character encoding agnostic. It treats all strings as an array of bytes so it doesn''t matter what you put in. You could store JPEGs in the index if you wanted to. The analysis section of Ferret is another matter. There are two sets of analyzers, ASCII analyzers (AsciiWhiteSpaceAnalyzer, AsciiStandardAnalyzer) which are the most robust (no encoding errors raised) and the the other analyzers (WhiteSpaceAnalyzer, StandardAnalyzer) which are based on whichever locale you have set. So if your operating system''s locale is set to UTF-8 then that will be how the analyzer treats any strings you pass through it.
David Balmain wrote:> The core of ferret is character encoding agnostic. It treats all > strings as an array of bytes so it doesn''t matter what you put in. You > could store JPEGs in the index if you wanted to.On which subject, I happen to have chucked some bmp files into my index, and was really quite amazed to see them being returned on search results. Not only that, but the results were accurate. For example, if I have a bmp which contains the word "Sheep" (when viewed as an image) and I search the index for "Sheep" - the bmp is returned. I am adding documents using the standard analyser and file.readlines to add the contents. If I open the bmp in a text editor and search for "Sheep" - that word is not contained within the file. So how come ferret can read the bmp? Cheers, Steven -- Posted via http://www.ruby-forum.com/.
> So how come ferret can read the bmp?OK please ignore what must rank as the stupidest question for some time. "Sheep" was in the file path, and the path is one of the Ferret document fields. For a minute there, I was excited. :) Cheers, Steven -- Posted via http://www.ruby-forum.com/.
On Wed, 2006-07-12 at 17:23 +0200, steven shingler wrote:> > So how come ferret can read the bmp? > > OK please ignore what must rank as the stupidest question for some time. > > "Sheep" was in the file path, and the path is one of the Ferret document > fields. > > For a minute there, I was excited. :)And David was probably scared that ferret had become conscious. :) Pedro.
Cool. For a minute I thought if I should ask if the file is maybe named ''sheep'' but then decided that this might offend you ;-) Great one! Nonetheless I''ve got a question on this subject too. Has anyone experience with a task like this: A searchengine that doesn''t use words as query objects but an uploaded image? Is there something like this already available on the net - a little google research of mine didn''t yielded any results. This should be able to also find resized images of the same kind. Background: Images that aren''t authorized by the copyright owner but won''t be found by google images or the like because they were renamed. Cheers, Jan On 7/12/06, steven shingler <shingler at gmail.com> wrote:> > > So how come ferret can read the bmp? > > OK please ignore what must rank as the stupidest question for some time. > > "Sheep" was in the file path, and the path is one of the Ferret document > fields. > > For a minute there, I was excited. :) > > Cheers, > Steven > > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20060712/d81b9a42/attachment.html
On Wed, 2006-07-12 at 17:30 +0200, Jan Prill wrote:> Cool. For a minute I thought if I should ask if the file is maybe > named ''sheep'' but then decided that this might offend you ;-) > > Great one! > > Nonetheless I''ve got a question on this subject too. Has anyone > experience with a task like this: A searchengine that doesn''t use > words as query objects but an uploaded image? Is there something like > this already available on the net - a little google research of mine > didn''t yielded any results. This should be able to also find resized > images of the same kind. Background: Images that aren''t authorized by > the copyright owner but won''t be found by google images or the like > because they were renamed.See this:> Cheers, > Jan > > On 7/12/06, steven shingler <shingler at gmail.com> wrote: > > So how come ferret can read the bmp? > > OK please ignore what must rank as the stupidest question for > some time. > > "Sheep" was in the file path, and the path is one of the > Ferret document > fields. > > For a minute there, I was excited. :) > > Cheers, > Steven > > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk
On Wed, 2006-07-12 at 17:30 +0200, Jan Prill wrote:> Nonetheless I''ve got a question on this subject too. Has anyone > experience with a task like this: A searchengine that doesn''t use > words as query objects but an uploaded image? Is there something like > this already available on the net - a little google research of mine > didn''t yielded any results. This should be able to also find resized > images of the same kind. Background: Images that aren''t authorized by > the copyright owner but won''t be found by google images or the like > because they were renamed.See this http://www.imgseek.net/ Never tried it myself but looks like what you meant. It''s a desktop app though. Pedro. PS: Story for the other empty email. I pressed send by mistake before I was done.
Something like this but on the net is what I''m searching for. Thanks for the pointer! Jan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20060712/5060104c/attachment.html
On 7/13/06, steven shingler <shingler at gmail.com> wrote:> > So how come ferret can read the bmp? > > OK please ignore what must rank as the stupidest question for some time. > > "Sheep" was in the file path, and the path is one of the Ferret document > fields. > > For a minute there, I was excited. :)This functionality isn''t due until version Ferret-4.0.