thr3ads.net - Rails - Won''t display characters following ''\267'' [Jun 2009]

If this information is useful, please help other people find it:
Share via:

Nik

2009-Jun-23 20:03 UTC

Won''t display characters following ''\267''

Hello!

I use MySQL and making sure it is UTF-8 and in my view the character
set is also UTF-8. But when I display the text whose input came from
either an antiword.exe or WIN32OLE output of a MS Word document in a
textarea. Text fail to show immediately after a strange character that
shows up in rails console as \267. And I went back to Word to see what
this is (looked it up by its position). And it is a dot sort of
floating in middle of the line. Sort of like how they display chapters
or whatever they call it of the Bible. like 12-7[dot]Matthrew

For example:
  Rails Console:
  >>doc="This is a pipe, but \267 this is not a pipe"
  HTML:
  <p>
    This is a pipe, but
  </p>
It just sort of STOPS rendering the rest of the text.

I can''t possibly ask my clients to remove that so to convenient me. I
have been on a 38 hours hunt to try to find some solutions to it.

Some says remove all [^[:print:]] matches. Which I can do and find a
way to at least preserve the \n\r''s. But then again, I do want to
preserve also as much of the original document as possible. I mean,
what if they use umlauts the o with " on top.

Any ideas?

Thank You!

Philip Hallstrom

2009-Jun-23 20:21 UTC

head link

Re: Won''t display characters following ''\267''

You could try...

require ''iconv''

clean_str = Iconv.new(''UTF-8//Ignore'',
''UTF-8'').iconv(messy_str)

It doesn''t always work though... you might need to catch  
Iconv::InvalidCharacter...

Worth a try though and has gotten me out of some of this mess with bad  
source data.

On Jun 23, 2009, at 1:03 PM, Nik wrote:
>
> Hello!
>
> I use MySQL and making sure it is UTF-8 and in my view the character
> set is also UTF-8. But when I display the text whose input came from
> either an antiword.exe or WIN32OLE output of a MS Word document in a
> textarea. Text fail to show immediately after a strange character that
> shows up in rails console as \267. And I went back to Word to see what
> this is (looked it up by its position). And it is a dot sort of
> floating in middle of the line. Sort of like how they display chapters
> or whatever they call it of the Bible. like 12-7[dot]Matthrew
>
> For example:
>  Rails Console:
>>> doc="This is a pipe, but \267 this is not a pipe"
>  HTML:
>  <p>
>    This is a pipe, but
>  </p>
> It just sort of STOPS rendering the rest of the text.
>
> I can''t possibly ask my clients to remove that so to convenient
me. I
> have been on a 38 hours hunt to try to find some solutions to it.
>
> Some says remove all [^[:print:]] matches. Which I can do and find a
> way to at least preserve the \n\r''s. But then again, I do want to
> preserve also as much of the original document as possible. I mean,
> what if they use umlauts the o with " on top.
>
> Any ideas?
>
> Thank You!
> >

Nik

2009-Jun-23 22:46 UTC

head link

Re: Won''t display characters following ''\267''

Thanks Phillip for your help!

I just tried it and it works great! It display that dot thing. But
then because all of my regular expressions did not account for these
characters and some fail at where these characters appear.

1 - What do I know even what the right question to ask is... But what
do you call \267 Is this that hex character business or octal,
decimal?

And 2 - Just like that character \267 or ''dot'' as I call it,
how can I
match it? And does it have a class name?

Lastly, 3 - and what charcode or other means can I systematically
identify the accentuated characters as in the accent grave in French.

Thank You!

On Jun 23, 4:21 pm, Philip Hallstrom <phi...-LSG90OXdqQE@public.gmane.org>
wrote:> You could try...
>
> require ''iconv''
>
> clean_str = Iconv.new(''UTF-8//Ignore'',
''UTF-8'').iconv(messy_str)
>
> It doesn''t always work though... you might need to catch  
> Iconv::InvalidCharacter...
>
> Worth a try though and has gotten me out of some of this mess with bad  
> source data.
>
> On Jun 23, 2009, at 1:03 PM, Nik wrote:
>
>
>
> > Hello!
>
> > I use MySQL and making sure it is UTF-8 and in my view the character
> > set is also UTF-8. But when I display the text whose input came from
> > either an antiword.exe or WIN32OLE output of a MS Word document in a
> > textarea. Text fail to show immediately after a strange character that
> > shows up in rails console as \267. And I went back to Word to see what
> > this is (looked it up by its position). And it is a dot sort of
> > floating in middle of the line. Sort of like how they display chapters
> > or whatever they call it of the Bible. like 12-7[dot]Matthrew
>
> > For example:
> >  Rails Console:
> >>> doc="This is a pipe, but \267 this is not a pipe"
> >  HTML:
> >  <p>
> >    This is a pipe, but
> >  </p>
> > It just sort of STOPS rendering the rest of the text.
>
> > I can''t possibly ask my clients to remove that so to
convenient me. I
> > have been on a 38 hours hunt to try to find some solutions to it.
>
> > Some says remove all [^[:print:]] matches. Which I can do and find a
> > way to at least preserve the \n\r''s. But then again, I do
want to
> > preserve also as much of the original document as possible. I mean,
> > what if they use umlauts the o with " on top.
>
> > Any ideas?
>
> > Thank You!

Matt Jones

2009-Jun-24 15:48 UTC

head link

Re: Won''t display characters following ''\267''

You really need to translate the character encoding on that data -
Rails is assuming that it''s UTF-8, when (from your description of the
character) it''s either Windows-1252 or (possibly) ISO8859-1. Your
previous problem was the default UTF-8 parser giving up, as \267 (B7
hex) is only a valid UTF-8 character inside a multibyte sequence.

--Matt Jones


On Jun 23, 6:46 pm, Nik <NiKS...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
wrote:> Thanks Phillip for your help!
>
> I just tried it and it works great! It display that dot thing. But
> then because all of my regular expressions did not account for these
> characters and some fail at where these characters appear.
>
> 1 - What do I know even what the right question to ask is... But what
> do you call \267 Is this that hex character business or octal,
> decimal?
>
> And 2 - Just like that character \267 or ''dot'' as I call
it, how can I
> match it? And does it have a class name?
>
> Lastly, 3 - and what charcode or other means can I systematically
> identify the accentuated characters as in the accent grave in French.
>
> Thank You!
>
> On Jun 23, 4:21 pm, Philip Hallstrom
<phi...-LSG90OXdqQE@public.gmane.org> wrote:
>
>
>
> > You could try...
>
> > require ''iconv''
>
> > clean_str = Iconv.new(''UTF-8//Ignore'',
''UTF-8'').iconv(messy_str)
>
> > It doesn''t always work though... you might need to catch  
> > Iconv::InvalidCharacter...
>
> > Worth a try though and has gotten me out of some of this mess with bad
 
> > source data.
>
> > On Jun 23, 2009, at 1:03 PM, Nik wrote:
>
> > > Hello!
>
> > > I use MySQL and making sure it is UTF-8 and in my view the
character
> > > set is also UTF-8. But when I display the text whose input came
from
> > > either an antiword.exe or WIN32OLE output of a MS Word document
in a
> > > textarea. Text fail to show immediately after a strange character
that
> > > shows up in rails console as \267. And I went back to Word to see
what
> > > this is (looked it up by its position). And it is a dot sort of
> > > floating in middle of the line. Sort of like how they display
chapters
> > > or whatever they call it of the Bible. like 12-7[dot]Matthrew
>
> > > For example:
> > >  Rails Console:
> > >>> doc="This is a pipe, but \267 this is not a
pipe"
> > >  HTML:
> > >  <p>
> > >    This is a pipe, but
> > >  </p>
> > > It just sort of STOPS rendering the rest of the text.
>
> > > I can''t possibly ask my clients to remove that so to
convenient me. I
> > > have been on a 38 hours hunt to try to find some solutions to it.
>
> > > Some says remove all [^[:print:]] matches. Which I can do and
find a
> > > way to at least preserve the \n\r''s. But then again, I
do want to
> > > preserve also as much of the original document as possible. I
mean,
> > > what if they use umlauts the o with " on top.
>
> > > Any ideas?
>
> > > Thank You!

Philip Hallstrom

2009-Jun-24 15:52 UTC

head link

Re: Won''t display characters following ''\267''

On Jun 23, 2009, at 3:46 PM, Nik wrote:
>
> Thanks Phillip for your help!
>
> I just tried it and it works great! It display that dot thing. But
> then because all of my regular expressions did not account for these
> characters and some fail at where these characters appear.
>
> 1 - What do I know even what the right question to ask is... But what
> do you call \267 Is this that hex character business or octal,
> decimal?
It''s unicode.  A multi-byte, but single character.
> And 2 - Just like that character \267 or ''dot'' as I call
it, how can I
> match it? And does it have a class name?
By matching the unicode via \267 yourself.  This might give some  
insight... http://interglacial.com/~sburke/tpj/as_html/tpj22.html

> Lastly, 3 - and what charcode or other means can I systematically
> identify the accentuated characters as in the accent grave in French.
If the charcode is over what... 127 then it''s not simple ASCII...

You might also find this plugin useful - http://github.com/rsl/stringex/tree 
  - It will try and turn all that stuff into simple ASCII.  You''ll ose
the accents, etc, but that might be okay for what you''re doing.

> Thank You!
>
> On Jun 23, 4:21 pm, Philip Hallstrom
<phi...-LSG90OXdqQE@public.gmane.org> wrote:
>> You could try...
>>
>> require ''iconv''
>>
>> clean_str = Iconv.new(''UTF-8//Ignore'',
''UTF-8'').iconv(messy_str)
>>
>> It doesn''t always work though... you might need to catch
>> Iconv::InvalidCharacter...
>>
>> Worth a try though and has gotten me out of some of this mess with  
>> bad
>> source data.
>>
>> On Jun 23, 2009, at 1:03 PM, Nik wrote:
>>
>>
>>
>>> Hello!
>>
>>> I use MySQL and making sure it is UTF-8 and in my view the
character
>>> set is also UTF-8. But when I display the text whose input came
from
>>> either an antiword.exe or WIN32OLE output of a MS Word document in
a
>>> textarea. Text fail to show immediately after a strange character  
>>> that
>>> shows up in rails console as \267. And I went back to Word to see  
>>> what
>>> this is (looked it up by its position). And it is a dot sort of
>>> floating in middle of the line. Sort of like how they display  
>>> chapters
>>> or whatever they call it of the Bible. like 12-7[dot]Matthrew
>>
>>> For example:
>>>  Rails Console:
>>>>> doc="This is a pipe, but \267 this is not a pipe"
>>>  HTML:
>>>  <p>
>>>    This is a pipe, but
>>>  </p>
>>> It just sort of STOPS rendering the rest of the text.
>>
>>> I can''t possibly ask my clients to remove that so to
convenient
>>> me. I
>>> have been on a 38 hours hunt to try to find some solutions to it.
>>
>>> Some says remove all [^[:print:]] matches. Which I can do and find
a
>>> way to at least preserve the \n\r''s. But then again, I do
want to
>>> preserve also as much of the original document as possible. I mean,
>>> what if they use umlauts the o with " on top.
>>
>>> Any ideas?
>>
>>> Thank You!
> >

Nik

2009-Jun-24 22:06 UTC

head link

Re: Won''t display characters following ''\267''

Hey Matt, thanks for your help!

Here''s what I do
work\ruby script/console>>doc = `c:\\antiword.exe c:\\test.doc`
=>"\n       This is a pipe \267 but this is not a pipe.\n\r"
>>Bakery.create(:description=>doc)
=> #<Bakery id: 55, created_at: "2009-06-24 18:01:03",
updated_at:
"2009-06-24 18:01:03", description: "\n       This is a pipe \267
but
this is not a pipe.\n\r">

Then go to http://localhost:3000/bakeries/55, where show.html.erb is
simply
<p>
<%= @bakery.description %>
<p>
 with @bakery = Bakery.find(params[:id]) in Bakeries_Controller

HTML output:
<p>

       This is a pipe

</p>

That''s it, the entire process of what I do. I would want to try out
your solution of translating the character encoding. Could it be that
it is the same method as Phillip above suggested, by using Iconv? If
so, do I convert UTF-8 to LATIN1? Or something else?

Thanks!


On Jun 24, 11:48 am, Matt Jones
<al2o...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
wrote:> You really need to translate the character encoding on that data -
> Rails is assuming that it''s UTF-8, when (from your description of
the
> character) it''s either Windows-1252 or (possibly) ISO8859-1. Your
> previous problem was the default UTF-8 parser giving up, as \267 (B7
> hex) is only a valid UTF-8 character inside a multibyte sequence.
>
> --Matt Jones
>
> On Jun 23, 6:46 pm, Nik
<NiKS...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
> > Thanks Phillip for your help!
>
> > I just tried it and it works great! It display that dot thing. But
> > then because all of my regular expressions did not account for these
> > characters and some fail at where these characters appear.
>
> > 1 - What do I know even what the right question to ask is... But what
> > do you call \267 Is this that hex character business or octal,
> > decimal?
>
> > And 2 - Just like that character \267 or ''dot'' as I
call it, how can I
> > match it? And does it have a class name?
>
> > Lastly, 3 - and what charcode or other means can I systematically
> > identify the accentuated characters as in the accent grave in French.
>
> > Thank You!
>
> > On Jun 23, 4:21 pm, Philip Hallstrom
<phi...-LSG90OXdqQE@public.gmane.org> wrote:
>
> > > You could try...
>
> > > require ''iconv''
>
> > > clean_str = Iconv.new(''UTF-8//Ignore'',
''UTF-8'').iconv(messy_str)
>
> > > It doesn''t always work though... you might need to catch
 
> > > Iconv::InvalidCharacter...
>
> > > Worth a try though and has gotten me out of some of this mess
with bad  
> > > source data.
>
> > > On Jun 23, 2009, at 1:03 PM, Nik wrote:
>
> > > > Hello!
>
> > > > I use MySQL and making sure it is UTF-8 and in my view the
character
> > > > set is also UTF-8. But when I display the text whose input
came from
> > > > either an antiword.exe or WIN32OLE output of a MS Word
document in a
> > > > textarea. Text fail to show immediately after a strange
character that
> > > > shows up in rails console as \267. And I went back to Word
to see what
> > > > this is (looked it up by its position). And it is a dot sort
of
> > > > floating in middle of the line. Sort of like how they
display chapters
> > > > or whatever they call it of the Bible. like
12-7[dot]Matthrew
>
> > > > For example:
> > > >  Rails Console:
> > > >>> doc="This is a pipe, but \267 this is not a
pipe"
> > > >  HTML:
> > > >  <p>
> > > >    This is a pipe, but
> > > >  </p>
> > > > It just sort of STOPS rendering the rest of the text.
>
> > > > I can''t possibly ask my clients to remove that so
to convenient me. I
> > > > have been on a 38 hours hunt to try to find some solutions
to it.
>
> > > > Some says remove all [^[:print:]] matches. Which I can do
and find a
> > > > way to at least preserve the \n\r''s. But then
again, I do want to
> > > > preserve also as much of the original document as possible.
I mean,
> > > > what if they use umlauts the o with " on top.
>
> > > > Any ideas?
>
> > > > Thank You!

Nik

2009-Jun-24 22:07 UTC

head link

Re: Won''t display characters following ''\267''

Hey Matt, thanks for your help!

Here''s what I do
work\ruby script/console>>doc = `c:\\antiword.exe c:\\test.doc`
=>"\n       This is a pipe \267 but this is not a pipe.\n\r"
>>Bakery.create(:description=>doc)
=> #<Bakery id: 55, created_at: "2009-06-24 18:01:03",
updated_at:
"2009-06-24 18:01:03", description: "\n       This is a pipe \267
but
this is not a pipe.\n\r">

Then go to http://localhost:3000/bakeries/55, where show.html.erb is
simply
<p>
<%= @bakery.description %>
<p>
 with @bakery = Bakery.find(params[:id]) in Bakeries_Controller

HTML output:
<p>

       This is a pipe

</p>

That''s it, the entire process of what I do. I would want to try out
your solution of translating the character encoding. Could it be that
it is the same method as Phillip above suggested, by using Iconv? If
so, do I convert UTF-8 to LATIN1? Or something else?

Thanks!


On Jun 24, 11:48 am, Matt Jones
<al2o...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
wrote:> You really need to translate the character encoding on that data -
> Rails is assuming that it''s UTF-8, when (from your description of
the
> character) it''s either Windows-1252 or (possibly) ISO8859-1. Your
> previous problem was the default UTF-8 parser giving up, as \267 (B7
> hex) is only a valid UTF-8 character inside a multibyte sequence.
>
> --Matt Jones
>
> On Jun 23, 6:46 pm, Nik
<NiKS...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
> > Thanks Phillip for your help!
>
> > I just tried it and it works great! It display that dot thing. But
> > then because all of my regular expressions did not account for these
> > characters and some fail at where these characters appear.
>
> > 1 - What do I know even what the right question to ask is... But what
> > do you call \267 Is this that hex character business or octal,
> > decimal?
>
> > And 2 - Just like that character \267 or ''dot'' as I
call it, how can I
> > match it? And does it have a class name?
>
> > Lastly, 3 - and what charcode or other means can I systematically
> > identify the accentuated characters as in the accent grave in French.
>
> > Thank You!
>
> > On Jun 23, 4:21 pm, Philip Hallstrom
<phi...-LSG90OXdqQE@public.gmane.org> wrote:
>
> > > You could try...
>
> > > require ''iconv''
>
> > > clean_str = Iconv.new(''UTF-8//Ignore'',
''UTF-8'').iconv(messy_str)
>
> > > It doesn''t always work though... you might need to catch
 
> > > Iconv::InvalidCharacter...
>
> > > Worth a try though and has gotten me out of some of this mess
with bad  
> > > source data.
>
> > > On Jun 23, 2009, at 1:03 PM, Nik wrote:
>
> > > > Hello!
>
> > > > I use MySQL and making sure it is UTF-8 and in my view the
character
> > > > set is also UTF-8. But when I display the text whose input
came from
> > > > either an antiword.exe or WIN32OLE output of a MS Word
document in a
> > > > textarea. Text fail to show immediately after a strange
character that
> > > > shows up in rails console as \267. And I went back to Word
to see what
> > > > this is (looked it up by its position). And it is a dot sort
of
> > > > floating in middle of the line. Sort of like how they
display chapters
> > > > or whatever they call it of the Bible. like
12-7[dot]Matthrew
>
> > > > For example:
> > > >  Rails Console:
> > > >>> doc="This is a pipe, but \267 this is not a
pipe"
> > > >  HTML:
> > > >  <p>
> > > >    This is a pipe, but
> > > >  </p>
> > > > It just sort of STOPS rendering the rest of the text.
>
> > > > I can''t possibly ask my clients to remove that so
to convenient me. I
> > > > have been on a 38 hours hunt to try to find some solutions
to it.
>
> > > > Some says remove all [^[:print:]] matches. Which I can do
and find a
> > > > way to at least preserve the \n\r''s. But then
again, I do want to
> > > > preserve also as much of the original document as possible.
I mean,
> > > > what if they use umlauts the o with " on top.
>
> > > > Any ideas?
>
> > > > Thank You!

Matt Jones

2009-Jun-25 16:17 UTC

head link

Re: Won''t display characters following ''\267''

Actually, doing some more digging, you should first try adding using
the -m switch to antiword - the docs claim that:

antiword.exe -m utf-8 c:\test.doc

should convert the character set correctly. If nothing else, it should
be easy to try out...

--Matt Jones

On Jun 24, 6:07 pm, Nik <NiKS...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
wrote:> Hey Matt, thanks for your help!
>
> Here''s what I do
> work\ruby script/console
>
> >>doc = `c:\\antiword.exe c:\\test.doc`
>
> =>"\n       This is a pipe \267 but this is not a pipe.\n\r"
>
> >>Bakery.create(:description=>doc)
>
> => #<Bakery id: 55, created_at: "2009-06-24 18:01:03",
updated_at:
> "2009-06-24 18:01:03", description: "\n       This is a pipe
\267 but
> this is not a pipe.\n\r">
>
> Then go tohttp://localhost:3000/bakeries/55, where show.html.erb is
> simply
> <p>
> <%= @bakery.description %>
> <p>
>  with @bakery = Bakery.find(params[:id]) in Bakeries_Controller
>
> HTML output:
> <p>
>
>        This is a pipe
>
> </p>
>
> That''s it, the entire process of what I do. I would want to try
out
> your solution of translating the character encoding. Could it be that
> it is the same method as Phillip above suggested, by using Iconv? If
> so, do I convert UTF-8 to LATIN1? Or something else?
>
> Thanks!
>
> On Jun 24, 11:48 am, Matt Jones
<al2o...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
>
>
> > You really need to translate the character encoding on that data -
> > Rails is assuming that it''s UTF-8, when (from your
description of the
> > character) it''s either Windows-1252 or (possibly) ISO8859-1.
Your
> > previous problem was the default UTF-8 parser giving up, as \267 (B7
> > hex) is only a valid UTF-8 character inside a multibyte sequence.
>
> > --Matt Jones
>
> > On Jun 23, 6:46 pm, Nik
<NiKS...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
> > > Thanks Phillip for your help!
>
> > > I just tried it and it works great! It display that dot thing.
But
> > > then because all of my regular expressions did not account for
these
> > > characters and some fail at where these characters appear.
>
> > > 1 - What do I know even what the right question to ask is... But
what
> > > do you call \267 Is this that hex character business or octal,
> > > decimal?
>
> > > And 2 - Just like that character \267 or ''dot''
as I call it, how can I
> > > match it? And does it have a class name?
>
> > > Lastly, 3 - and what charcode or other means can I systematically
> > > identify the accentuated characters as in the accent grave in
French.
>
> > > Thank You!
>
> > > On Jun 23, 4:21 pm, Philip Hallstrom
<phi...-LSG90OXdqQE@public.gmane.org> wrote:
>
> > > > You could try...
>
> > > > require ''iconv''
>
> > > > clean_str = Iconv.new(''UTF-8//Ignore'',
''UTF-8'').iconv(messy_str)
>
> > > > It doesn''t always work though... you might need to
catch  
> > > > Iconv::InvalidCharacter...
>
> > > > Worth a try though and has gotten me out of some of this
mess with bad  
> > > > source data.
>
> > > > On Jun 23, 2009, at 1:03 PM, Nik wrote:
>
> > > > > Hello!
>
> > > > > I use MySQL and making sure it is UTF-8 and in my view
the character
> > > > > set is also UTF-8. But when I display the text whose
input came from
> > > > > either an antiword.exe or WIN32OLE output of a MS Word
document in a
> > > > > textarea. Text fail to show immediately after a strange
character that
> > > > > shows up in rails console as \267. And I went back to
Word to see what
> > > > > this is (looked it up by its position). And it is a dot
sort of
> > > > > floating in middle of the line. Sort of like how they
display chapters
> > > > > or whatever they call it of the Bible. like
12-7[dot]Matthrew
>
> > > > > For example:
> > > > >  Rails Console:
> > > > >>> doc="This is a pipe, but \267 this is not
a pipe"
> > > > >  HTML:
> > > > >  <p>
> > > > >    This is a pipe, but
> > > > >  </p>
> > > > > It just sort of STOPS rendering the rest of the text.
>
> > > > > I can''t possibly ask my clients to remove that
so to convenient me. I
> > > > > have been on a 38 hours hunt to try to find some
solutions to it.
>
> > > > > Some says remove all [^[:print:]] matches. Which I can
do and find a
> > > > > way to at least preserve the \n\r''s. But then
again, I do want to
> > > > > preserve also as much of the original document as
possible. I mean,
> > > > > what if they use umlauts the o with " on top.
>
> > > > > Any ideas?
>
> > > > > Thank You!

Nik

2009-Jun-26 08:38 UTC

head link

Re: Won''t display characters following ''\267''

Hey Matt!

That saved the day for me. -- I am terribly sorry to brought this
trouble up here. I did look for the documentation/reference/manual/
instruction on Google but some obscure links turned up instead. If I
learned anything at all from you all , it''d be for me to look first at
the dir of the app from now on.

Thank You! Case closed

On Jun 25, 12:17 pm, Matt Jones
<al2o...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
wrote:> Actually, doing some more digging, you should first try adding using
> the -m switch to antiword - the docs claim that:
>
> antiword.exe -m utf-8 c:\test.doc
>
> should convert the character set correctly. If nothing else, it should
> be easy to try out...
>
> --Matt Jones
>
> On Jun 24, 6:07 pm, Nik
<NiKS...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
> > Hey Matt, thanks for your help!
>
> > Here''s what I do
> > work\ruby script/console
>
> > >>doc = `c:\\antiword.exe c:\\test.doc`
>
> > =>"\n       This is a pipe \267 but this is not a
pipe.\n\r"
>
> > >>Bakery.create(:description=>doc)
>
> > => #<Bakery id: 55, created_at: "2009-06-24 18:01:03",
updated_at:
> > "2009-06-24 18:01:03", description: "\n       This is a
pipe \267 but
> > this is not a pipe.\n\r">
>
> > Then go tohttp://localhost:3000/bakeries/55, where show.html.erb is
> > simply
> > <p>
> > <%= @bakery.description %>
> > <p>
> >  with @bakery = Bakery.find(params[:id]) in Bakeries_Controller
>
> > HTML output:
> > <p>
>
> >        This is a pipe
>
> > </p>
>
> > That''s it, the entire process of what I do. I would want to
try out
> > your solution of translating the character encoding. Could it be that
> > it is the same method as Phillip above suggested, by using Iconv? If
> > so, do I convert UTF-8 to LATIN1? Or something else?
>
> > Thanks!
>
> > On Jun 24, 11:48 am, Matt Jones
<al2o...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
> > > You really need to translate the character encoding on that data
-
> > > Rails is assuming that it''s UTF-8, when (from your
description of the
> > > character) it''s either Windows-1252 or (possibly)
ISO8859-1. Your
> > > previous problem was the default UTF-8 parser giving up, as \267
(B7
> > > hex) is only a valid UTF-8 character inside a multibyte sequence.
>
> > > --Matt Jones
>
> > > On Jun 23, 6:46 pm, Nik
<NiKS...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
> > > > Thanks Phillip for your help!
>
> > > > I just tried it and it works great! It display that dot
thing. But
> > > > then because all of my regular expressions did not account
for these
> > > > characters and some fail at where these characters appear.
>
> > > > 1 - What do I know even what the right question to ask is...
But what
> > > > do you call \267 Is this that hex character business or
octal,
> > > > decimal?
>
> > > > And 2 - Just like that character \267 or
''dot'' as I call it, how can I
> > > > match it? And does it have a class name?
>
> > > > Lastly, 3 - and what charcode or other means can I
systematically
> > > > identify the accentuated characters as in the accent grave
in French.
>
> > > > Thank You!
>
> > > > On Jun 23, 4:21 pm, Philip Hallstrom
<phi...-LSG90OXdqQE@public.gmane.org> wrote:
>
> > > > > You could try...
>
> > > > > require ''iconv''
>
> > > > > clean_str =
Iconv.new(''UTF-8//Ignore'',
''UTF-8'').iconv(messy_str)
>
> > > > > It doesn''t always work though... you might
need to catch  
> > > > > Iconv::InvalidCharacter...
>
> > > > > Worth a try though and has gotten me out of some of
this mess with bad  
> > > > > source data.
>
> > > > > On Jun 23, 2009, at 1:03 PM, Nik wrote:
>
> > > > > > Hello!
>
> > > > > > I use MySQL and making sure it is UTF-8 and in my
view the character
> > > > > > set is also UTF-8. But when I display the text
whose input came from
> > > > > > either an antiword.exe or WIN32OLE output of a MS
Word document in a
> > > > > > textarea. Text fail to show immediately after a
strange character that
> > > > > > shows up in rails console as \267. And I went back
to Word to see what
> > > > > > this is (looked it up by its position). And it is
a dot sort of
> > > > > > floating in middle of the line. Sort of like how
they display chapters
> > > > > > or whatever they call it of the Bible. like
12-7[dot]Matthrew
>
> > > > > > For example:
> > > > > >  Rails Console:
> > > > > >>> doc="This is a pipe, but \267 this is
not a pipe"
> > > > > >  HTML:
> > > > > >  <p>
> > > > > >    This is a pipe, but
> > > > > >  </p>
> > > > > > It just sort of STOPS rendering the rest of the
text.
>
> > > > > > I can''t possibly ask my clients to remove
that so to convenient me. I
> > > > > > have been on a 38 hours hunt to try to find some
solutions to it.
>
> > > > > > Some says remove all [^[:print:]] matches. Which I
can do and find a
> > > > > > way to at least preserve the \n\r''s. But
then again, I do want to
> > > > > > preserve also as much of the original document as
possible. I mean,
> > > > > > what if they use umlauts the o with " on top.
>
> > > > > > Any ideas?
>
> > > > > > Thank You!

Danimal

2009-Jul-07 21:45 UTC

head link

Re: Won''t display characters following ''\267''

Wow!

Someone else dealing with the exact same thing as me!

Matt: your suggestion to use the "-m utf-8" flag for antiword was
exactly the right solution. Conceptually it makes the most sense, too.
I.e.: "Convert this Word doc to UTF-8 and parse it into text" as the
first step. Much much nicer!

It''s good to know that Iconv could probably do the same thing later in
the process, but it''s nice to just handle it up-front and the
resulting String object is already UTF-8. Whee!

Thank you! (my solution was much less than 38 hours, primarily thanks
to this thread)

-Danimal

Nik

2009-Jul-27 02:39 UTC

head link

Re: Won''t display characters following ''\267''

Hey, I am glad that my little post helped!!

On Jul 7, 5:45 pm, Danimal
<fightonfightw...-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
wrote:> Wow!
>
> Someone else dealing with the exact same thing as me!
>
> Matt: your suggestion to use the "-m utf-8" flag for antiword was
> exactly the right solution. Conceptually it makes the most sense, too.
> I.e.: "Convert this Word doc to UTF-8 and parse it into text" as
the
> first step. Much much nicer!
>
> It''s good to know that Iconv could probably do the same thing
later in
> the process, but it''s nice to just handle it up-front and the
> resulting String object is already UTF-8. Whee!
>
> Thank you! (my solution was much less than 38 hours, primarily thanks
> to this thread)
>
> -Danimal

Rails - Jun 2009 - Won't display characters following '\267'

Won''t display characters following ''\267''

Re: Won''t display characters following ''\267''

Re: Won''t display characters following ''\267''

Re: Won''t display characters following ''\267''

Re: Won''t display characters following ''\267''

Re: Won''t display characters following ''\267''

Re: Won''t display characters following ''\267''

Re: Won''t display characters following ''\267''

Re: Won''t display characters following ''\267''

Re: Won''t display characters following ''\267''

Re: Won''t display characters following ''\267''