rovin varshney
2012-Sep-13  11:35 UTC
How to read Microsoft document file in ruby on rails ?
Hello Everyone,
     I m looking for parsing doc/docx file in ruby on rails.
     I have use File.open(''filename'',''r''),
but it shows special character
instead of the content of file .
Thanks. 
-- 
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To view this discussion on the web visit
https://groups.google.com/d/msg/rubyonrails-talk/-/O5fkWF3a1ecJ.
For more options, visit https://groups.google.com/groups/opt_out.
Walter Lee Davis
2012-Sep-13  13:59 UTC
Re: How to read Microsoft document file in ruby on rails ?
On Sep 13, 2012, at 7:35 AM, rovin varshney wrote:> Hello Everyone, > I m looking for parsing doc/docx file in ruby on rails. > I have use File.open(''filename'',''r''), but it shows special character instead of the content of file .If all you want is the text content of the files, you can try the ancient Unix utility catdoc to do that. Just back-tick to that command (and make sure it''s installed in your Web server''s path). The result will not be pretty, but it will have all of the words in it. Walter> > Thanks. > > -- > You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. > To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com. > To view this discussion on the web visit https://groups.google.com/d/msg/rubyonrails-talk/-/O5fkWF3a1ecJ. > For more options, visit https://groups.google.com/groups/opt_out. > >-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
The docx format is actually pretty simple: it is a zipped set of files. If you upload it to the server and unzip it, you''ll see a set of xml files. You can poke around and figure out the format, or you can find a spec on line. On Thu, Sep 13, 2012 at 9:59 AM, Walter Lee Davis <waltd-HQgmohHLjDZWk0Htik3J/w@public.gmane.org> wrote:> > On Sep 13, 2012, at 7:35 AM, rovin varshney wrote: > >> Hello Everyone, >> I m looking for parsing doc/docx file in ruby on rails. >> I have use File.open(''filename'',''r''), but it shows special character instead of the content of file . > > If all you want is the text content of the files, you can try the ancient Unix utility catdoc to do that. Just back-tick to that command (and make sure it''s installed in your Web server''s path). The result will not be pretty, but it will have all of the words in it. > > Walter > >> >> Thanks. >> >> -- >> You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. >> To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> To view this discussion on the web visit https://groups.google.com/d/msg/rubyonrails-talk/-/O5fkWF3a1ecJ. >> For more options, visit https://groups.google.com/groups/opt_out. >> >> > > -- > You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. > To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com. > For more options, visit https://groups.google.com/groups/opt_out. > >-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Scott Ribe
2012-Sep-15  14:07 UTC
Re: How to read Microsoft document file in ruby on rails ?
On Sep 15, 2012, at 7:27 AM, Paul wrote:> The docx format is actually pretty simple...You are really cruel to toy with him like that ;-) -- Scott Ribe scott_ribe-ZCQMRMivIIdUL8GK/JU1Wg@public.gmane.org http://www.elevated-dev.com/ (303) 722-0567 voice -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit https://groups.google.com/groups/opt_out.
rovin varshney
2012-Sep-16  10:16 UTC
Re: How to read Microsoft document file in ruby on rails ?
Hi  Walter Lee Davis , Paul
         Please can u give some code snipet or give some more clarification
about parsing doc file.
On Sat, Sep 15, 2012 at 7:37 PM, Scott Ribe
<scott_ribe-ZCQMRMivIIdUL8GK/JU1Wg@public.gmane.org>wrote:
> On Sep 15, 2012, at 7:27 AM, Paul wrote:
>
> > The docx format is actually pretty simple...
>
> You are really cruel to toy with him like that ;-)
>
>
> --
> Scott Ribe
> scott_ribe-ZCQMRMivIIdUL8GK/JU1Wg@public.gmane.org
> http://www.elevated-dev.com/
> (303) 722-0567 voice
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Ruby on Rails: Talk" group.
> To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To unsubscribe from this group, send email to
>
rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>
-- 
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit https://groups.google.com/groups/opt_out.
Dheeraj Kumar
2012-Sep-16  12:28 UTC
Re: How to read Microsoft document file in ruby on rails ?
Did you try googling? This was the third link I found. http://deepakprasanna.blogspot.in/2011/06/parsing-pdfdocdocx-content-with-apache.html Dheeraj Kumar On Sunday 16 September 2012 at 3:46 PM, rovin varshney wrote:> > Hi Walter Lee Davis , Paul > > Please can u give some code snipet or give some more clarification about parsing doc file. > > On Sat, Sep 15, 2012 at 7:37 PM, Scott Ribe <scott_ribe-ZCQMRMivIIdUL8GK/JU1Wg@public.gmane.org (mailto:scott_ribe-ZCQMRMivIIdUL8GK/JU1Wg@public.gmane.org)> wrote: > > On Sep 15, 2012, at 7:27 AM, Paul wrote: > > > > > The docx format is actually pretty simple... > > > > You are really cruel to toy with him like that ;-) > > > > > > -- > > Scott Ribe > > scott_ribe-ZCQMRMivIIdUL8GK/JU1Wg@public.gmane.org (mailto:scott_ribe-ZCQMRMivIIdUL8GK/JU1Wg@public.gmane.org) > > http://www.elevated-dev.com/ > > (303) 722-0567 voice > > > > > > > > > > -- > > You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. > > To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org (mailto:rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org). > > To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org (mailto:rubyonrails-talk%2Bunsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org). > > For more options, visit https://groups.google.com/groups/opt_out. > > > > > > -- > You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. > To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org (mailto:rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org). > To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org (mailto:rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org). > For more options, visit https://groups.google.com/groups/opt_out. > >-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit https://groups.google.com/groups/opt_out.
On Sunday 16 September 2012 05:58 PM, Dheeraj Kumar wrote:> Did you try googling? This was the third link I found. > > http://deepakprasanna.blogspot.in/2011/06/parsing-pdfdocdocx-content-with-apache.html > > > Dheeraj Kumar > > On Sunday 16 September 2012 at 3:46 PM, rovin varshney wrote: > >> >> Hi Walter Lee Davis , Paul >> >> Please can u give some code snipet or give some more >> clarification about parsing doc file. >> >> On Sat, Sep 15, 2012 at 7:37 PM, Scott Ribe >> <scott_ribe-ZCQMRMivIIdUL8GK/JU1Wg@public.gmane.org <mailto:scott_ribe-ZCQMRMivIIdUL8GK/JU1Wg@public.gmane.org>> wrote: >>> On Sep 15, 2012, at 7:27 AM, Paul wrote: >>> >>> > The docx format is actually pretty simple... >>> >>> You are really cruel to toy with him like that ;-) >>> >>> >>> -- >>> Scott Ribe >>> scott_ribe-ZCQMRMivIIdUL8GK/JU1Wg@public.gmane.org <mailto:scott_ribe-ZCQMRMivIIdUL8GK/JU1Wg@public.gmane.org> >>> http://www.elevated-dev.com/ >>> (303) 722-0567 voice >>> >>> >>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Ruby on Rails: Talk" group. >>> To post to this group, send email to >>> rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org >>> <mailto:rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>. >>> To unsubscribe from this group, send email to >>> rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org >>> <mailto:rubyonrails-talk%2Bunsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>. >>> For more options, visit https://groups.google.com/groups/opt_out. >>> >>> >> >> -- >> You received this message because you are subscribed to the Google >> Groups "Ruby on Rails: Talk" group. >> To post to this group, send email to >> rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org >> <mailto:rubyonrails-talk-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>. >> To unsubscribe from this group, send email to >> rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org >> <mailto:rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>. >> For more options, visit https://groups.google.com/groups/opt_out. >> >> > > -- > You received this message because you are subscribed to the Google > Groups "Ruby on Rails: Talk" group. > To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To unsubscribe from this group, send email to > rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > For more options, visit https://groups.google.com/groups/opt_out. > >Use of PDFTron may useful. google for "PDFTron Ruby Intigration" programs -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit https://groups.google.com/groups/opt_out.
Walter Lee Davis
2012-Sep-16  16:12 UTC
Re: How to read Microsoft document file in ruby on rails ?
For a start, here''s the man page for catdoc, which you will need to
install.
http://linux.die.net/man/1/catdoc
Then, read up on using the system() or backtick operators in a Ruby script to
engage it. You''ll need to have a path to the file you want to process,
which is highly dependent on the system you''re using to store the
files. In Paperclip, I made this processor to extract text from PDF files
(pdftotext is part of the same collection of utilities as catdoc, I believe):
#lib/paperclip_processors/text.rb
module Paperclip
  # Handles extracting plain text from PDF file attachments
  class Text < Processor
    attr_accessor :whiny
    # Creates a Text extract from PDF
    def make
      src = @file
      dst = Tempfile.new([@basename,
''txt''].compact.join("."))
      command = <<-end_command
        "#{ File.expand_path(src.path) }"
        "#{ File.expand_path(dst.path) }"
      end_command
      begin
        success = Paperclip.run("/usr/bin/pdftotext -nopgbrk",
command.gsub(/\s+/, " "))
        Rails.logger.info "Processing #{src.path} to #{dst.path} in the
text processor."
      rescue PaperclipCommandLineError
        raise PaperclipError, "There was an error processing the text for
#{@basename}" if @whiny
      end
      dst
    end
  end
end
Depending on how you are uploading your files, your mileage may vary. At the
very simplest, the command would be
text_contents = system(''/usr/bin/catdoc
/root/relative/path/to/file.doc'')
But that''s hopelessly naive and will blow up on any error. 
Walter
On Sep 16, 2012, at 6:16 AM, rovin varshney wrote:
> 
> Hi  Walter Lee Davis , Paul
> 
>          Please can u give some code snipet or give some more clarification
about parsing doc file.
> 
> On Sat, Sep 15, 2012 at 7:37 PM, Scott Ribe
<scott_ribe-ZCQMRMivIIdUL8GK/JU1Wg@public.gmane.org> wrote:
> On Sep 15, 2012, at 7:27 AM, Paul wrote:
> 
> > The docx format is actually pretty simple...
> 
> You are really cruel to toy with him like that ;-)
> 
> 
> --
> Scott Ribe
> scott_ribe-ZCQMRMivIIdUL8GK/JU1Wg@public.gmane.org
> http://www.elevated-dev.com/
> (303) 722-0567 voice
> 
> 
> 
> 
> --
> You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
> To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To unsubscribe from this group, send email to
rubyonrails-talk+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
> 
> 
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
> To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To unsubscribe from this group, send email to
rubyonrails-talk+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>  
>  
-- 
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
rovin varshney
2012-Sep-18  05:40 UTC
Re: How to read Microsoft document file in ruby on rails ?
Hello Everyone,
   Thanks everyone.Finally got a solution while searching things that you
all had explained.
   There is a docx gem for parsing docx file and docx-html for convert it
into HTML.
   require ''docx''
d = Docx::Document.open(''example.docx'')d.each_paragraph do |p|
  puts dend
and for the docx file stored on s3 amazon.
Docx::Document.open(open(''http://S3-URL/original.docx'',:ssl_verify_mode
=>
OpenSSL::SSL::VERIFY_NONE))
A big Thanks to All.
On Sun, Sep 16, 2012 at 9:42 PM, Walter Lee Davis
<waltd-HQgmohHLjDZWk0Htik3J/w@public.gmane.org>wrote:
> For a start, here''s the man page for catdoc, which you will need
to
> install.
>
> http://linux.die.net/man/1/catdoc
>
> Then, read up on using the system() or backtick operators in a Ruby script
> to engage it. You''ll need to have a path to the file you want to
process,
> which is highly dependent on the system you''re using to store the
files. In
> Paperclip, I made this processor to extract text from PDF files (pdftotext
> is part of the same collection of utilities as catdoc, I believe):
>
> #lib/paperclip_processors/text.rb
>
> module Paperclip
>   # Handles extracting plain text from PDF file attachments
>   class Text < Processor
>
>     attr_accessor :whiny
>
>     # Creates a Text extract from PDF
>     def make
>       src = @file
>       dst = Tempfile.new([@basename,
''txt''].compact.join("."))
>       command = <<-end_command
>         "#{ File.expand_path(src.path) }"
>         "#{ File.expand_path(dst.path) }"
>       end_command
>
>       begin
>         success = Paperclip.run("/usr/bin/pdftotext -nopgbrk",
> command.gsub(/\s+/, " "))
>         Rails.logger.info "Processing #{src.path} to #{dst.path} in
the
> text processor."
>       rescue PaperclipCommandLineError
>         raise PaperclipError, "There was an error processing the text
for
> #{@basename}" if @whiny
>       end
>       dst
>     end
>   end
> end
>
> Depending on how you are uploading your files, your mileage may vary. At
> the very simplest, the command would be
>
> text_contents = system(''/usr/bin/catdoc
/root/relative/path/to/file.doc'')
>
> But that''s hopelessly naive and will blow up on any error.
>
> Walter
>
>
> On Sep 16, 2012, at 6:16 AM, rovin varshney wrote:
>
> >
> > Hi  Walter Lee Davis , Paul
> >
> >          Please can u give some code snipet or give some more
> clarification about parsing doc file.
> >
> > On Sat, Sep 15, 2012 at 7:37 PM, Scott Ribe
<scott_ribe-ZCQMRMivIIdUL8GK/JU1Wg@public.gmane.org>
> wrote:
> > On Sep 15, 2012, at 7:27 AM, Paul wrote:
> >
> > > The docx format is actually pretty simple...
> >
> > You are really cruel to toy with him like that ;-)
> >
> >
> > --
> > Scott Ribe
> > scott_ribe-ZCQMRMivIIdUL8GK/JU1Wg@public.gmane.org
> > http://www.elevated-dev.com/
> > (303) 722-0567 voice
> >
> >
> >
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups "Ruby on Rails: Talk" group.
> > To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To unsubscribe from this group, send email to
>
rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > For more options, visit https://groups.google.com/groups/opt_out.
> >
> >
> >
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups "Ruby on Rails: Talk" group.
> > To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > To unsubscribe from this group, send email to
>
rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> > For more options, visit https://groups.google.com/groups/opt_out.
> >
> >
>
> --
> You received this message because you are subscribed to the Google Groups
> "Ruby on Rails: Talk" group.
> To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To unsubscribe from this group, send email to
>
rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>
-- 
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit https://groups.google.com/groups/opt_out.