Hello! I use MySQL and making sure it is UTF-8 and in my view the character set is also UTF-8. But when I display the text whose input came from either an antiword.exe or WIN32OLE output of a MS Word document in a textarea. Text fail to show immediately after a strange character that shows up in rails console as \267. And I went back to Word to see what this is (looked it up by its position). And it is a dot sort of floating in middle of the line. Sort of like how they display chapters or whatever they call it of the Bible. like 12-7[dot]Matthrew For example: Rails Console: >>doc="This is a pipe, but \267 this is not a pipe" HTML: <p> This is a pipe, but </p> It just sort of STOPS rendering the rest of the text. I can''t possibly ask my clients to remove that so to convenient me. I have been on a 38 hours hunt to try to find some solutions to it. Some says remove all [^[:print:]] matches. Which I can do and find a way to at least preserve the \n\r''s. But then again, I do want to preserve also as much of the original document as possible. I mean, what if they use umlauts the o with " on top. Any ideas? Thank You!
You could try... require ''iconv'' clean_str = Iconv.new(''UTF-8//Ignore'', ''UTF-8'').iconv(messy_str) It doesn''t always work though... you might need to catch Iconv::InvalidCharacter... Worth a try though and has gotten me out of some of this mess with bad source data. On Jun 23, 2009, at 1:03 PM, Nik wrote:> > Hello! > > I use MySQL and making sure it is UTF-8 and in my view the character > set is also UTF-8. But when I display the text whose input came from > either an antiword.exe or WIN32OLE output of a MS Word document in a > textarea. Text fail to show immediately after a strange character that > shows up in rails console as \267. And I went back to Word to see what > this is (looked it up by its position). And it is a dot sort of > floating in middle of the line. Sort of like how they display chapters > or whatever they call it of the Bible. like 12-7[dot]Matthrew > > For example: > Rails Console: >>> doc="This is a pipe, but \267 this is not a pipe" > HTML: > <p> > This is a pipe, but > </p> > It just sort of STOPS rendering the rest of the text. > > I can''t possibly ask my clients to remove that so to convenient me. I > have been on a 38 hours hunt to try to find some solutions to it. > > Some says remove all [^[:print:]] matches. Which I can do and find a > way to at least preserve the \n\r''s. But then again, I do want to > preserve also as much of the original document as possible. I mean, > what if they use umlauts the o with " on top. > > Any ideas? > > Thank You! > >
Thanks Phillip for your help! I just tried it and it works great! It display that dot thing. But then because all of my regular expressions did not account for these characters and some fail at where these characters appear. 1 - What do I know even what the right question to ask is... But what do you call \267 Is this that hex character business or octal, decimal? And 2 - Just like that character \267 or ''dot'' as I call it, how can I match it? And does it have a class name? Lastly, 3 - and what charcode or other means can I systematically identify the accentuated characters as in the accent grave in French. Thank You! On Jun 23, 4:21 pm, Philip Hallstrom <phi...-LSG90OXdqQE@public.gmane.org> wrote:> You could try... > > require ''iconv'' > > clean_str = Iconv.new(''UTF-8//Ignore'', ''UTF-8'').iconv(messy_str) > > It doesn''t always work though... you might need to catch > Iconv::InvalidCharacter... > > Worth a try though and has gotten me out of some of this mess with bad > source data. > > On Jun 23, 2009, at 1:03 PM, Nik wrote: > > > > > Hello! > > > I use MySQL and making sure it is UTF-8 and in my view the character > > set is also UTF-8. But when I display the text whose input came from > > either an antiword.exe or WIN32OLE output of a MS Word document in a > > textarea. Text fail to show immediately after a strange character that > > shows up in rails console as \267. And I went back to Word to see what > > this is (looked it up by its position). And it is a dot sort of > > floating in middle of the line. Sort of like how they display chapters > > or whatever they call it of the Bible. like 12-7[dot]Matthrew > > > For example: > > Rails Console: > >>> doc="This is a pipe, but \267 this is not a pipe" > > HTML: > > <p> > > This is a pipe, but > > </p> > > It just sort of STOPS rendering the rest of the text. > > > I can''t possibly ask my clients to remove that so to convenient me. I > > have been on a 38 hours hunt to try to find some solutions to it. > > > Some says remove all [^[:print:]] matches. Which I can do and find a > > way to at least preserve the \n\r''s. But then again, I do want to > > preserve also as much of the original document as possible. I mean, > > what if they use umlauts the o with " on top. > > > Any ideas? > > > Thank You!
You really need to translate the character encoding on that data - Rails is assuming that it''s UTF-8, when (from your description of the character) it''s either Windows-1252 or (possibly) ISO8859-1. Your previous problem was the default UTF-8 parser giving up, as \267 (B7 hex) is only a valid UTF-8 character inside a multibyte sequence. --Matt Jones On Jun 23, 6:46 pm, Nik <NiKS...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> Thanks Phillip for your help! > > I just tried it and it works great! It display that dot thing. But > then because all of my regular expressions did not account for these > characters and some fail at where these characters appear. > > 1 - What do I know even what the right question to ask is... But what > do you call \267 Is this that hex character business or octal, > decimal? > > And 2 - Just like that character \267 or ''dot'' as I call it, how can I > match it? And does it have a class name? > > Lastly, 3 - and what charcode or other means can I systematically > identify the accentuated characters as in the accent grave in French. > > Thank You! > > On Jun 23, 4:21 pm, Philip Hallstrom <phi...-LSG90OXdqQE@public.gmane.org> wrote: > > > > > You could try... > > > require ''iconv'' > > > clean_str = Iconv.new(''UTF-8//Ignore'', ''UTF-8'').iconv(messy_str) > > > It doesn''t always work though... you might need to catch > > Iconv::InvalidCharacter... > > > Worth a try though and has gotten me out of some of this mess with bad > > source data. > > > On Jun 23, 2009, at 1:03 PM, Nik wrote: > > > > Hello! > > > > I use MySQL and making sure it is UTF-8 and in my view the character > > > set is also UTF-8. But when I display the text whose input came from > > > either an antiword.exe or WIN32OLE output of a MS Word document in a > > > textarea. Text fail to show immediately after a strange character that > > > shows up in rails console as \267. And I went back to Word to see what > > > this is (looked it up by its position). And it is a dot sort of > > > floating in middle of the line. Sort of like how they display chapters > > > or whatever they call it of the Bible. like 12-7[dot]Matthrew > > > > For example: > > > Rails Console: > > >>> doc="This is a pipe, but \267 this is not a pipe" > > > HTML: > > > <p> > > > This is a pipe, but > > > </p> > > > It just sort of STOPS rendering the rest of the text. > > > > I can''t possibly ask my clients to remove that so to convenient me. I > > > have been on a 38 hours hunt to try to find some solutions to it. > > > > Some says remove all [^[:print:]] matches. Which I can do and find a > > > way to at least preserve the \n\r''s. But then again, I do want to > > > preserve also as much of the original document as possible. I mean, > > > what if they use umlauts the o with " on top. > > > > Any ideas? > > > > Thank You!
On Jun 23, 2009, at 3:46 PM, Nik wrote:> > Thanks Phillip for your help! > > I just tried it and it works great! It display that dot thing. But > then because all of my regular expressions did not account for these > characters and some fail at where these characters appear. > > 1 - What do I know even what the right question to ask is... But what > do you call \267 Is this that hex character business or octal, > decimal?It''s unicode. A multi-byte, but single character.> And 2 - Just like that character \267 or ''dot'' as I call it, how can I > match it? And does it have a class name?By matching the unicode via \267 yourself. This might give some insight... http://interglacial.com/~sburke/tpj/as_html/tpj22.html> Lastly, 3 - and what charcode or other means can I systematically > identify the accentuated characters as in the accent grave in French.If the charcode is over what... 127 then it''s not simple ASCII... You might also find this plugin useful - http://github.com/rsl/stringex/tree - It will try and turn all that stuff into simple ASCII. You''ll ose the accents, etc, but that might be okay for what you''re doing.> Thank You! > > On Jun 23, 4:21 pm, Philip Hallstrom <phi...-LSG90OXdqQE@public.gmane.org> wrote: >> You could try... >> >> require ''iconv'' >> >> clean_str = Iconv.new(''UTF-8//Ignore'', ''UTF-8'').iconv(messy_str) >> >> It doesn''t always work though... you might need to catch >> Iconv::InvalidCharacter... >> >> Worth a try though and has gotten me out of some of this mess with >> bad >> source data. >> >> On Jun 23, 2009, at 1:03 PM, Nik wrote: >> >> >> >>> Hello! >> >>> I use MySQL and making sure it is UTF-8 and in my view the character >>> set is also UTF-8. But when I display the text whose input came from >>> either an antiword.exe or WIN32OLE output of a MS Word document in a >>> textarea. Text fail to show immediately after a strange character >>> that >>> shows up in rails console as \267. And I went back to Word to see >>> what >>> this is (looked it up by its position). And it is a dot sort of >>> floating in middle of the line. Sort of like how they display >>> chapters >>> or whatever they call it of the Bible. like 12-7[dot]Matthrew >> >>> For example: >>> Rails Console: >>>>> doc="This is a pipe, but \267 this is not a pipe" >>> HTML: >>> <p> >>> This is a pipe, but >>> </p> >>> It just sort of STOPS rendering the rest of the text. >> >>> I can''t possibly ask my clients to remove that so to convenient >>> me. I >>> have been on a 38 hours hunt to try to find some solutions to it. >> >>> Some says remove all [^[:print:]] matches. Which I can do and find a >>> way to at least preserve the \n\r''s. But then again, I do want to >>> preserve also as much of the original document as possible. I mean, >>> what if they use umlauts the o with " on top. >> >>> Any ideas? >> >>> Thank You! > >
Hey Matt, thanks for your help! Here''s what I do work\ruby script/console>>doc = `c:\\antiword.exe c:\\test.doc`=>"\n This is a pipe \267 but this is not a pipe.\n\r">>Bakery.create(:description=>doc)=> #<Bakery id: 55, created_at: "2009-06-24 18:01:03", updated_at: "2009-06-24 18:01:03", description: "\n This is a pipe \267 but this is not a pipe.\n\r"> Then go to http://localhost:3000/bakeries/55, where show.html.erb is simply <p> <%= @bakery.description %> <p> with @bakery = Bakery.find(params[:id]) in Bakeries_Controller HTML output: <p> This is a pipe </p> That''s it, the entire process of what I do. I would want to try out your solution of translating the character encoding. Could it be that it is the same method as Phillip above suggested, by using Iconv? If so, do I convert UTF-8 to LATIN1? Or something else? Thanks! On Jun 24, 11:48 am, Matt Jones <al2o...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> You really need to translate the character encoding on that data - > Rails is assuming that it''s UTF-8, when (from your description of the > character) it''s either Windows-1252 or (possibly) ISO8859-1. Your > previous problem was the default UTF-8 parser giving up, as \267 (B7 > hex) is only a valid UTF-8 character inside a multibyte sequence. > > --Matt Jones > > On Jun 23, 6:46 pm, Nik <NiKS...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > > > Thanks Phillip for your help! > > > I just tried it and it works great! It display that dot thing. But > > then because all of my regular expressions did not account for these > > characters and some fail at where these characters appear. > > > 1 - What do I know even what the right question to ask is... But what > > do you call \267 Is this that hex character business or octal, > > decimal? > > > And 2 - Just like that character \267 or ''dot'' as I call it, how can I > > match it? And does it have a class name? > > > Lastly, 3 - and what charcode or other means can I systematically > > identify the accentuated characters as in the accent grave in French. > > > Thank You! > > > On Jun 23, 4:21 pm, Philip Hallstrom <phi...-LSG90OXdqQE@public.gmane.org> wrote: > > > > You could try... > > > > require ''iconv'' > > > > clean_str = Iconv.new(''UTF-8//Ignore'', ''UTF-8'').iconv(messy_str) > > > > It doesn''t always work though... you might need to catch > > > Iconv::InvalidCharacter... > > > > Worth a try though and has gotten me out of some of this mess with bad > > > source data. > > > > On Jun 23, 2009, at 1:03 PM, Nik wrote: > > > > > Hello! > > > > > I use MySQL and making sure it is UTF-8 and in my view the character > > > > set is also UTF-8. But when I display the text whose input came from > > > > either an antiword.exe or WIN32OLE output of a MS Word document in a > > > > textarea. Text fail to show immediately after a strange character that > > > > shows up in rails console as \267. And I went back to Word to see what > > > > this is (looked it up by its position). And it is a dot sort of > > > > floating in middle of the line. Sort of like how they display chapters > > > > or whatever they call it of the Bible. like 12-7[dot]Matthrew > > > > > For example: > > > > Rails Console: > > > >>> doc="This is a pipe, but \267 this is not a pipe" > > > > HTML: > > > > <p> > > > > This is a pipe, but > > > > </p> > > > > It just sort of STOPS rendering the rest of the text. > > > > > I can''t possibly ask my clients to remove that so to convenient me. I > > > > have been on a 38 hours hunt to try to find some solutions to it. > > > > > Some says remove all [^[:print:]] matches. Which I can do and find a > > > > way to at least preserve the \n\r''s. But then again, I do want to > > > > preserve also as much of the original document as possible. I mean, > > > > what if they use umlauts the o with " on top. > > > > > Any ideas? > > > > > Thank You!
Hey Matt, thanks for your help! Here''s what I do work\ruby script/console>>doc = `c:\\antiword.exe c:\\test.doc`=>"\n This is a pipe \267 but this is not a pipe.\n\r">>Bakery.create(:description=>doc)=> #<Bakery id: 55, created_at: "2009-06-24 18:01:03", updated_at: "2009-06-24 18:01:03", description: "\n This is a pipe \267 but this is not a pipe.\n\r"> Then go to http://localhost:3000/bakeries/55, where show.html.erb is simply <p> <%= @bakery.description %> <p> with @bakery = Bakery.find(params[:id]) in Bakeries_Controller HTML output: <p> This is a pipe </p> That''s it, the entire process of what I do. I would want to try out your solution of translating the character encoding. Could it be that it is the same method as Phillip above suggested, by using Iconv? If so, do I convert UTF-8 to LATIN1? Or something else? Thanks! On Jun 24, 11:48 am, Matt Jones <al2o...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> You really need to translate the character encoding on that data - > Rails is assuming that it''s UTF-8, when (from your description of the > character) it''s either Windows-1252 or (possibly) ISO8859-1. Your > previous problem was the default UTF-8 parser giving up, as \267 (B7 > hex) is only a valid UTF-8 character inside a multibyte sequence. > > --Matt Jones > > On Jun 23, 6:46 pm, Nik <NiKS...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > > > Thanks Phillip for your help! > > > I just tried it and it works great! It display that dot thing. But > > then because all of my regular expressions did not account for these > > characters and some fail at where these characters appear. > > > 1 - What do I know even what the right question to ask is... But what > > do you call \267 Is this that hex character business or octal, > > decimal? > > > And 2 - Just like that character \267 or ''dot'' as I call it, how can I > > match it? And does it have a class name? > > > Lastly, 3 - and what charcode or other means can I systematically > > identify the accentuated characters as in the accent grave in French. > > > Thank You! > > > On Jun 23, 4:21 pm, Philip Hallstrom <phi...-LSG90OXdqQE@public.gmane.org> wrote: > > > > You could try... > > > > require ''iconv'' > > > > clean_str = Iconv.new(''UTF-8//Ignore'', ''UTF-8'').iconv(messy_str) > > > > It doesn''t always work though... you might need to catch > > > Iconv::InvalidCharacter... > > > > Worth a try though and has gotten me out of some of this mess with bad > > > source data. > > > > On Jun 23, 2009, at 1:03 PM, Nik wrote: > > > > > Hello! > > > > > I use MySQL and making sure it is UTF-8 and in my view the character > > > > set is also UTF-8. But when I display the text whose input came from > > > > either an antiword.exe or WIN32OLE output of a MS Word document in a > > > > textarea. Text fail to show immediately after a strange character that > > > > shows up in rails console as \267. And I went back to Word to see what > > > > this is (looked it up by its position). And it is a dot sort of > > > > floating in middle of the line. Sort of like how they display chapters > > > > or whatever they call it of the Bible. like 12-7[dot]Matthrew > > > > > For example: > > > > Rails Console: > > > >>> doc="This is a pipe, but \267 this is not a pipe" > > > > HTML: > > > > <p> > > > > This is a pipe, but > > > > </p> > > > > It just sort of STOPS rendering the rest of the text. > > > > > I can''t possibly ask my clients to remove that so to convenient me. I > > > > have been on a 38 hours hunt to try to find some solutions to it. > > > > > Some says remove all [^[:print:]] matches. Which I can do and find a > > > > way to at least preserve the \n\r''s. But then again, I do want to > > > > preserve also as much of the original document as possible. I mean, > > > > what if they use umlauts the o with " on top. > > > > > Any ideas? > > > > > Thank You!
Actually, doing some more digging, you should first try adding using the -m switch to antiword - the docs claim that: antiword.exe -m utf-8 c:\test.doc should convert the character set correctly. If nothing else, it should be easy to try out... --Matt Jones On Jun 24, 6:07 pm, Nik <NiKS...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> Hey Matt, thanks for your help! > > Here''s what I do > work\ruby script/console > > >>doc = `c:\\antiword.exe c:\\test.doc` > > =>"\n This is a pipe \267 but this is not a pipe.\n\r" > > >>Bakery.create(:description=>doc) > > => #<Bakery id: 55, created_at: "2009-06-24 18:01:03", updated_at: > "2009-06-24 18:01:03", description: "\n This is a pipe \267 but > this is not a pipe.\n\r"> > > Then go tohttp://localhost:3000/bakeries/55, where show.html.erb is > simply > <p> > <%= @bakery.description %> > <p> > with @bakery = Bakery.find(params[:id]) in Bakeries_Controller > > HTML output: > <p> > > This is a pipe > > </p> > > That''s it, the entire process of what I do. I would want to try out > your solution of translating the character encoding. Could it be that > it is the same method as Phillip above suggested, by using Iconv? If > so, do I convert UTF-8 to LATIN1? Or something else? > > Thanks! > > On Jun 24, 11:48 am, Matt Jones <al2o...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > > > > > You really need to translate the character encoding on that data - > > Rails is assuming that it''s UTF-8, when (from your description of the > > character) it''s either Windows-1252 or (possibly) ISO8859-1. Your > > previous problem was the default UTF-8 parser giving up, as \267 (B7 > > hex) is only a valid UTF-8 character inside a multibyte sequence. > > > --Matt Jones > > > On Jun 23, 6:46 pm, Nik <NiKS...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > > > > Thanks Phillip for your help! > > > > I just tried it and it works great! It display that dot thing. But > > > then because all of my regular expressions did not account for these > > > characters and some fail at where these characters appear. > > > > 1 - What do I know even what the right question to ask is... But what > > > do you call \267 Is this that hex character business or octal, > > > decimal? > > > > And 2 - Just like that character \267 or ''dot'' as I call it, how can I > > > match it? And does it have a class name? > > > > Lastly, 3 - and what charcode or other means can I systematically > > > identify the accentuated characters as in the accent grave in French. > > > > Thank You! > > > > On Jun 23, 4:21 pm, Philip Hallstrom <phi...-LSG90OXdqQE@public.gmane.org> wrote: > > > > > You could try... > > > > > require ''iconv'' > > > > > clean_str = Iconv.new(''UTF-8//Ignore'', ''UTF-8'').iconv(messy_str) > > > > > It doesn''t always work though... you might need to catch > > > > Iconv::InvalidCharacter... > > > > > Worth a try though and has gotten me out of some of this mess with bad > > > > source data. > > > > > On Jun 23, 2009, at 1:03 PM, Nik wrote: > > > > > > Hello! > > > > > > I use MySQL and making sure it is UTF-8 and in my view the character > > > > > set is also UTF-8. But when I display the text whose input came from > > > > > either an antiword.exe or WIN32OLE output of a MS Word document in a > > > > > textarea. Text fail to show immediately after a strange character that > > > > > shows up in rails console as \267. And I went back to Word to see what > > > > > this is (looked it up by its position). And it is a dot sort of > > > > > floating in middle of the line. Sort of like how they display chapters > > > > > or whatever they call it of the Bible. like 12-7[dot]Matthrew > > > > > > For example: > > > > > Rails Console: > > > > >>> doc="This is a pipe, but \267 this is not a pipe" > > > > > HTML: > > > > > <p> > > > > > This is a pipe, but > > > > > </p> > > > > > It just sort of STOPS rendering the rest of the text. > > > > > > I can''t possibly ask my clients to remove that so to convenient me. I > > > > > have been on a 38 hours hunt to try to find some solutions to it. > > > > > > Some says remove all [^[:print:]] matches. Which I can do and find a > > > > > way to at least preserve the \n\r''s. But then again, I do want to > > > > > preserve also as much of the original document as possible. I mean, > > > > > what if they use umlauts the o with " on top. > > > > > > Any ideas? > > > > > > Thank You!
Hey Matt! That saved the day for me. -- I am terribly sorry to brought this trouble up here. I did look for the documentation/reference/manual/ instruction on Google but some obscure links turned up instead. If I learned anything at all from you all , it''d be for me to look first at the dir of the app from now on. Thank You! Case closed On Jun 25, 12:17 pm, Matt Jones <al2o...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> Actually, doing some more digging, you should first try adding using > the -m switch to antiword - the docs claim that: > > antiword.exe -m utf-8 c:\test.doc > > should convert the character set correctly. If nothing else, it should > be easy to try out... > > --Matt Jones > > On Jun 24, 6:07 pm, Nik <NiKS...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > > > Hey Matt, thanks for your help! > > > Here''s what I do > > work\ruby script/console > > > >>doc = `c:\\antiword.exe c:\\test.doc` > > > =>"\n This is a pipe \267 but this is not a pipe.\n\r" > > > >>Bakery.create(:description=>doc) > > > => #<Bakery id: 55, created_at: "2009-06-24 18:01:03", updated_at: > > "2009-06-24 18:01:03", description: "\n This is a pipe \267 but > > this is not a pipe.\n\r"> > > > Then go tohttp://localhost:3000/bakeries/55, where show.html.erb is > > simply > > <p> > > <%= @bakery.description %> > > <p> > > with @bakery = Bakery.find(params[:id]) in Bakeries_Controller > > > HTML output: > > <p> > > > This is a pipe > > > </p> > > > That''s it, the entire process of what I do. I would want to try out > > your solution of translating the character encoding. Could it be that > > it is the same method as Phillip above suggested, by using Iconv? If > > so, do I convert UTF-8 to LATIN1? Or something else? > > > Thanks! > > > On Jun 24, 11:48 am, Matt Jones <al2o...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > > > > You really need to translate the character encoding on that data - > > > Rails is assuming that it''s UTF-8, when (from your description of the > > > character) it''s either Windows-1252 or (possibly) ISO8859-1. Your > > > previous problem was the default UTF-8 parser giving up, as \267 (B7 > > > hex) is only a valid UTF-8 character inside a multibyte sequence. > > > > --Matt Jones > > > > On Jun 23, 6:46 pm, Nik <NiKS...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > > > > > Thanks Phillip for your help! > > > > > I just tried it and it works great! It display that dot thing. But > > > > then because all of my regular expressions did not account for these > > > > characters and some fail at where these characters appear. > > > > > 1 - What do I know even what the right question to ask is... But what > > > > do you call \267 Is this that hex character business or octal, > > > > decimal? > > > > > And 2 - Just like that character \267 or ''dot'' as I call it, how can I > > > > match it? And does it have a class name? > > > > > Lastly, 3 - and what charcode or other means can I systematically > > > > identify the accentuated characters as in the accent grave in French. > > > > > Thank You! > > > > > On Jun 23, 4:21 pm, Philip Hallstrom <phi...-LSG90OXdqQE@public.gmane.org> wrote: > > > > > > You could try... > > > > > > require ''iconv'' > > > > > > clean_str = Iconv.new(''UTF-8//Ignore'', ''UTF-8'').iconv(messy_str) > > > > > > It doesn''t always work though... you might need to catch > > > > > Iconv::InvalidCharacter... > > > > > > Worth a try though and has gotten me out of some of this mess with bad > > > > > source data. > > > > > > On Jun 23, 2009, at 1:03 PM, Nik wrote: > > > > > > > Hello! > > > > > > > I use MySQL and making sure it is UTF-8 and in my view the character > > > > > > set is also UTF-8. But when I display the text whose input came from > > > > > > either an antiword.exe or WIN32OLE output of a MS Word document in a > > > > > > textarea. Text fail to show immediately after a strange character that > > > > > > shows up in rails console as \267. And I went back to Word to see what > > > > > > this is (looked it up by its position). And it is a dot sort of > > > > > > floating in middle of the line. Sort of like how they display chapters > > > > > > or whatever they call it of the Bible. like 12-7[dot]Matthrew > > > > > > > For example: > > > > > > Rails Console: > > > > > >>> doc="This is a pipe, but \267 this is not a pipe" > > > > > > HTML: > > > > > > <p> > > > > > > This is a pipe, but > > > > > > </p> > > > > > > It just sort of STOPS rendering the rest of the text. > > > > > > > I can''t possibly ask my clients to remove that so to convenient me. I > > > > > > have been on a 38 hours hunt to try to find some solutions to it. > > > > > > > Some says remove all [^[:print:]] matches. Which I can do and find a > > > > > > way to at least preserve the \n\r''s. But then again, I do want to > > > > > > preserve also as much of the original document as possible. I mean, > > > > > > what if they use umlauts the o with " on top. > > > > > > > Any ideas? > > > > > > > Thank You!
Wow! Someone else dealing with the exact same thing as me! Matt: your suggestion to use the "-m utf-8" flag for antiword was exactly the right solution. Conceptually it makes the most sense, too. I.e.: "Convert this Word doc to UTF-8 and parse it into text" as the first step. Much much nicer! It''s good to know that Iconv could probably do the same thing later in the process, but it''s nice to just handle it up-front and the resulting String object is already UTF-8. Whee! Thank you! (my solution was much less than 38 hours, primarily thanks to this thread) -Danimal
Hey, I am glad that my little post helped!! On Jul 7, 5:45 pm, Danimal <fightonfightw...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> Wow! > > Someone else dealing with the exact same thing as me! > > Matt: your suggestion to use the "-m utf-8" flag for antiword was > exactly the right solution. Conceptually it makes the most sense, too. > I.e.: "Convert this Word doc to UTF-8 and parse it into text" as the > first step. Much much nicer! > > It''s good to know that Iconv could probably do the same thing later in > the process, but it''s nice to just handle it up-front and the > resulting String object is already UTF-8. Whee! > > Thank you! (my solution was much less than 38 hours, primarily thanks > to this thread) > > -Danimal