I do not know how to make Ruby to recognize \w for non-ascii characters. Let assume I have some strings in cp1250 encodings. In Python I can convert it to unicode object and I can use regexp for it. Let's compare: Python:>>> txt = 'abc łąka def' # cp1250 text >>> txt'abc \xb3\xb9ka def'>>> txt = unicode(txt, 'cp1250') # change to unicode object >>> txtu'abc \u0142\u0105ka def'>>> import re >>> re.compile('(\w+)', re.U).findall(txt)[u'abc', u'\u0142\u0105ka', u'def'] BINGO! It could find the middle word. Ruby: require 'iconv' => true txt = 'abc łąka def' => "abc \263\271ka def" txt = Iconv.conv('utf-8', 'cp1250', txt) => "abc \305\202\304\205ka def" pattern = Regexp.new(/(\w+)/, true, 'U') txt.scan(pattern) => [["abc"], ["ka"], ["def"]] Failed! :( Ruby cannot find non ascii characters. It cannot see tham at all (sic!) Is there any solution for Ruby? _______________________________________________ Rails mailing list Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org http://lists.rubyonrails.org/mailman/listinfo/rails
I do not know how to make Ruby to recognize \w for non-ascii characters. Let assume I have some strings in cp1250 encodings. In Python I can convert it to unicode object and I can use regexp for it. Let's compare: Python:>>> txt = 'abc łąka def' # cp1250 text >>> txt'abc \xb3\xb9ka def'>>> txt = unicode(txt, 'cp1250') # change to unicode object >>> txtu'abc \u0142\u0105ka def'>>> import re >>> re.compile('(\w+)', re.U).findall(txt)[u'abc', u'\u0142\u0105ka', u'def'] BINGO! It could find the middle word. Ruby: require 'iconv' => true txt = 'abc łąka def' => "abc \263\271ka def" txt = Iconv.conv('utf-8', 'cp1250', txt) => "abc \305\202\304\205ka def" pattern = Regexp.new(/(\w+)/, true, 'U') txt.scan(pattern) => [["abc"], ["ka"], ["def"]] Failed! :( Ruby cannot find non ascii characters. It cannot see tham at all (sic!) Is there any solution for Ruby? _______________________________________________ Rails mailing list Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org http://lists.rubyonrails.org/mailman/listinfo/rails
Do this first: $KCODE=''u'' require ''jcode'' then> require ''iconv'' > => true > txt = ''abc łąka def'' > => "abc \263\271ka def" > txt = Iconv.conv(''utf-8'', ''cp1250'', txt) > => "abc \305\202\304\205ka def" > pattern = Regexp.new(/(\w+)/, true, ''U'') > txt.scan(pattern)=> [["abc"], ["┼é─àka"], ["def"]] Was that what you were after? -- Alex
On 10-okt-2005, at 22:32, Jaroslaw Zabiello wrote:> > Failed! :( Ruby cannot find non ascii characters. It cannot see tham > at all (sic!) > > Is there any solution for Ruby?1. Convert your strings once and for all. Get rid of them. Go Unicode. 8-bit is EVIL. Let it die in flames in your backyard. Usually it''s 3 commands in the terminal. 2. $KCODE = ''u'' require ''jcode'' Some advise that jcode is bad but I don''t anticipate all the hoops you have to go through if you don''t use it 3. "stuff" =~ /uff/u => true Don''t forget the /u key. It works quite well and I ported an HTML beautification library from PHP using that. N.B. don''t expect anyhing to work with old 8-byte charsets. I hope they will all break one day so that nobody dares to use them anymore. -- Julian "Julik" Tarkhanov
2005/10/10, Alex Young <alex@blackkettle.org>: ...> Was that what you were after?My real problems begin when I want to find words with mixing lower and upper case. Once again. Python: import re s = 'abcd łąka def' u = unicode(s, 'cp1250') re.compile(unicode('łąka', 'cp1250'), re.U|re.I).findall(u) [u'\u0142\u0105ka'] and re.compile(unicode('łĄka', 'cp1250'), re.U|re.I).findall(u) [u'\u0142\u0105ka'] Lower or upper case does not matter. This is what I wanted. I need that feature for my text searcher. I want to display searched phrases in text even if they contains different upper/lower case of some characters. Ruby: $KCODE='u' require 'jcode' require 'iconv' s = 'abc łąka def' u = Iconv.conv('utf-8', 'cp1250', s) pattern = Regexp.new(Iconv.conv('utf-8', 'cp1250', 'łąka'), true, 'U') u.scan(pattern) => ["Ĺ‚Ä…ka"] pattern = Regexp.new(Iconv.conv('utf-8', 'cp1250', 'łĄka'), true, 'U') u.scan(pattern) => [] Failled :( It could not find phrases when I changed from lower to upper case one non-ascii letter. -- JZ _______________________________________________ Rails mailing list Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org http://lists.rubyonrails.org/mailman/listinfo/rails
On 11-okt-2005, at 1:00, Jaroslaw Zabiello wrote:> 2005/10/10, Alex Young <alex-qV/boFbD8Meu8LGVeLuP/g@public.gmane.org>: > ... > >> Was that what you were after? >> > > My real problems begin when I want to find words with mixing lower and > upper case. Once again.$KCODE = ''u'' require ''jcode'' "SomeWicKEDШтучКА" =~ /учк/ui Please read Pickaxe. If you need to compile a regex with strings in it you can use the literal form of /foo#{method_call_or_whatever}/ui -- Julian "Julik" Tarkhanov
2005/10/11, Julian 'Julik' Tarkhanov <listbox@julik.nl>:> $KCODE = 'u' > require 'jcode' > > "SomeWicKEDШтучКА" =~ /учк/ui > > Please read Pickaxe. > If you need to compile a regex with strings in it you can use the > literal form of > > /foo#{method_call_or_whatever}/uiIt would be not enough I am afraid. I checked your example: $KCODE = 'u' require 'jcode' if "SomeWicKEDШтучКА" =~ /учк/ui puts 'FOUND' else puts 'NOT FOUND' end The result is "NOT FOUND" :( -- JZ _______________________________________________ Rails mailing list Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org http://lists.rubyonrails.org/mailman/listinfo/rails
On 11-okt-2005, at 9:37, Jaroslaw Zabiello wrote:> $KCODE = ''u'' > require ''jcode'' > > if "SomeWicKEDШтучКА" =~ /учк/ui > puts ''FOUND'' > else > puts ''NOT FOUND'' > endOh crap, I am drowning in my own arrogance. Indeed, /i key does not work for Unicode (not in ruby and not in - surprise - PHP). I wonder how many ways exist for a scripting language to break Unicode - for now only Perl and Python at least seem to do it properly. I hope (sigh) that this is top-priority for YARV instead of (sigh) reimplementing block scopes and whatnot. What you really need is this.. 1. Get the Unicode gem. http://www.yoshidam.net/Ruby.html#unicode 2. require ''unicode''; Unicode::downcase(string).sub(/your_pattern/u) etc - insert replacements into the text at positions where your pattern is found. Unicode::capitalize(), same::upcase() and the rest are not to miss. If you intend to distribute yur application you can make a stub for those who don''t have Unicode available: begin require ''unicode'' rescue LoadError RAILS_DEFAULT_LOGGER.error "You don''t have Unicode gem installed, capitalizations will be single-byte and case conversion might be broken" module Unicode def self.method_missing(meth, *args) args[0].send(meth) end end end Unicode support in Ruby is almost non-existent -- Julian "Julik" Tarkhanov
2005/10/11, Julian ''Julik'' Tarkhanov <listbox-RY+snkucC20@public.gmane.org>:> What you really need is this.. > 1. Get the Unicode gem. http://www.yoshidam.net/Ruby.html#unicodeBut I am using WinXP box and I would need binary library... :(> Unicode support in Ruby is almost non-existentIt''s a very bad news because I started to make application where unicode plays very important role. -- JZ
On 11-okt-2005, at 17:56, Jaroslaw Zabiello wrote:> 2005/10/11, Julian ''Julik'' Tarkhanov <listbox-RY+snkucC20@public.gmane.org>: > > >> What you really need is this.. >> 1. Get the Unicode gem. http://www.yoshidam.net/Ruby.html#unicode >> > > But I am using WinXP box and I would need binary library... :( > > >> Unicode support in Ruby is almost non-existent >> > > It''s a very bad news because I started to make application where > unicode plays very important role.Well, I guess it is. But I am using the Unicode gem and I am happily deploying applications that use russian strings. Everyywhere. You also have Locale module which gives you localized date and currency formatting etc. - together with Unicode gem and Iconv there was nothing I wasn''t yet able to tackle. If you still want to go the Rails route I think you can find folks on ruby-lang who will help you compile the needed libs. Or go UNIX. -- Julian "Julik" Tarkhanov