thr3ads.net - Rails - Extract text from PDF file [Jan 2011]

If this information is useful, please help other people find it:
Share via:

Tushar Gandhi

2011-Jan-31 17:12 UTC

Extract text from PDF file

Hi,
In my upcoming application we are uploading the pdf files.
After uploading the pdf file I have to extract the text from pdf and
display it to user.
can anyone tell me how to extract text from pdf file?
Is there any plugin or gem present for this?
Thanks,
Tushar

-- 
Posted via http://www.ruby-forum.com/.

-- 
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en.

Walter Lee Davis

2011-Jan-31 17:32 UTC

head link

Re: Extract text from PDF file

On Jan 31, 2011, at 12:12 PM, Tushar Gandhi wrote:
> Hi,
> In my upcoming application we are uploading the pdf files.
> After uploading the pdf file I have to extract the text from pdf and
> display it to user.
> can anyone tell me how to extract text from pdf file?
> Is there any plugin or gem present for this?
> Thanks,
> Tushar
>
I did this using Paperclip and defining a processor for Paperclip as  
follows:

#lib/paperclip_processors/text.rb
module Paperclip
   # Handles extracting plain text from PDF file attachments
   class Text < Processor

     attr_accessor :whiny

     # Creates a Text extract from PDF
     def make
       src = @file
       dst = Tempfile.new([@basename,
''txt''].compact.join("."))
       command = <<-end_command
         "#{ File.expand_path(src.path) }"
         "#{ File.expand_path(dst.path) }"
       end_command

       begin
         success = Paperclip.run("/usr/bin/pdftotext -nopgbrk",  
command.gsub(/\s+/, " "))
         Rails.logger.info "Processing #{src.path} to #{dst.path} in  
the text processor."
       rescue PaperclipCommandLineError
         raise PaperclipError, "There was an error processing the text  
for #{@basename}" if @whiny
       end
       dst
     end
   end
end

#app/models/document.rb
   has_attached_file :pdf,:styles => { :text => { :fake =>  
''variable'' } }, :processors => [:text]
   after_post_process :extract_text

   private
   def extract_text
     file =
File.open("#{pdf.queued_for_write[:text].path}","r")
     plain_text = ""
     while (line = file.gets)
       plain_text << Iconv.conv(''ASCII//IGNORE'',
''UTF8'', line)
     end
     self.plain_text = plain_text #text column to hold the extracted  
text for searching
   end

I had to find and install the creaky-old pdftotext library on my  
server (happily, there was an apt-get bundle for it) and configure the  
path correctly. When Paperclip accepts a PDF upload, it creates a text  
extraction of that file and saves it in system/pdfs/:id/text/ 
filename.pdf. Note that while it has a .pdf extension, the file itself  
is actually just the plain text extracted from the original pdf. After  
quite a lot of googling and begging my local Ruby group, I got the  
recipe for ripping open that text file and reading it into a variable  
to store on the record. The text you get out of pdftotext will vary  
wildly in quality and comprehensiveness, but since all I needed was a  
way to get a simple search system fed, it works fine for my needs. I  
never show this text to anyone, just use it as the "keywords" for  
search. You may want/need to present an editing field for the  
administrator to clean up these extracted texts.

Walter

-- 
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en.

Garrett Lancaster

2011-Jan-31 17:36 UTC

head link

Re: Extract text from PDF file

pdftk, pdfbox (java), pdfkit

Garrett Lancaster
> ------------------------------------------------------------------------
>
> 	Walter Lee Davis
<mailto:waltd-HQgmohHLjDZWk0Htik3J/w@public.gmane.org>
> January 31, 2011 11:32 AM
>
>
>
> On Jan 31, 2011, at 12:12 PM, Tushar Gandhi wrote:
>
>
> I did this using Paperclip and defining a processor for Paperclip as 
> follows:
>
> #lib/paperclip_processors/text.rb
> module Paperclip
>   # Handles extracting plain text from PDF file attachments
>   class Text < Processor
>
>     attr_accessor :whiny
>
>     # Creates a Text extract from PDF
>     def make
>       src = @file
>       dst = Tempfile.new([@basename,
''txt''].compact.join("."))
>       command = <<-end_command
>         "#{ File.expand_path(src.path) }"
>         "#{ File.expand_path(dst.path) }"
>       end_command
>
>       begin
>         success = Paperclip.run("/usr/bin/pdftotext -nopgbrk", 
> command.gsub(/\s+/, " "))
>         Rails.logger.info "Processing #{src.path} to #{dst.path} in 
> the text processor."
>       rescue PaperclipCommandLineError
>         raise PaperclipError, "There was an error processing the text 
> for #{@basename}" if @whiny
>       end
>       dst
>     end
>   end
> end
>
> #app/models/document.rb
>   has_attached_file :pdf,:styles => { :text => { :fake =>
''variable'' }
> }, :processors => [:text]
>   after_post_process :extract_text
>
>   private
>   def extract_text
>     file =
File.open("#{pdf.queued_for_write[:text].path}","r")
>     plain_text = ""
>     while (line = file.gets)
>       plain_text << Iconv.conv(''ASCII//IGNORE'',
''UTF8'', line)
>     end
>     self.plain_text = plain_text #text column to hold the extracted 
> text for searching
>   end
>
> I had to find and install the creaky-old pdftotext library on my 
> server (happily, there was an apt-get bundle for it) and configure the 
> path correctly. When Paperclip accepts a PDF upload, it creates a text 
> extraction of that file and saves it in 
> system/pdfs/:id/text/filename.pdf. Note that while it has a .pdf 
> extension, the file itself is actually just the plain text extracted 
> from the original pdf. After quite a lot of googling and begging my 
> local Ruby group, I got the recipe for ripping open that text file and 
> reading it into a variable to store on the record. The text you get 
> out of pdftotext will vary wildly in quality and comprehensiveness, 
> but since all I needed was a way to get a simple search system fed, it 
> works fine for my needs. I never show this text to anyone, just use it 
> as the "keywords" for search. You may want/need to present an
editing
> field for the administrator to clean up these extracted texts.
>
> Walter
>
> ------------------------------------------------------------------------
>
> 	Tushar Gandhi <mailto:lists-fsXkhYbjdPsEEoCn2XhGlw@public.gmane.org>
> January 31, 2011 11:12 AM
>
>
> Hi,
> In my upcoming application we are uploading the pdf files.
> After uploading the pdf file I have to extract the text from pdf and
> display it to user.
> can anyone tell me how to extract text from pdf file?
> Is there any plugin or gem present for this?
> Thanks,
> Tushar
>
-- 
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en.

Walter McGinnis

2011-Jan-31 18:19 UTC

head link

Re: Extract text from PDF file

I wrote a plugin that requires attachment_fu and some unixy utilities behind
the scenes for this several years back:

https://github.com/kete/convert_attachment_to

It works reliably in Rails 2.x apps. I haven''t tried it with Rails 3
yet.
You could fork it and update (make it work with PaperClip or Rails 3) it you
like or just have a gander for example code.

Cheers,
Walter



On Tue, Feb 1, 2011 at 6:36 AM, Garrett Lancaster <
glancast-jmyiO2ngOJdad6c/EObmYVaTQe2KTcn/@public.gmane.org> wrote:
>  pdftk, pdfbox (java), pdfkit
>
> Garrett Lancaster
>
>  ------------------------------
>
>    Walter Lee Davis <waltd-HQgmohHLjDZWk0Htik3J/w@public.gmane.org>
> January 31, 2011 11:32 AM
>
>
> On Jan 31, 2011, at 12:12 PM, Tushar Gandhi wrote:
>
>
> I did this using Paperclip and defining a processor for Paperclip as
> follows:
>
> #lib/paperclip_processors/text.rb
> module Paperclip
>   # Handles extracting plain text from PDF file attachments
>   class Text < Processor
>
>     attr_accessor :whiny
>
>     # Creates a Text extract from PDF
>     def make
>       src = @file
>       dst = Tempfile.new([@basename,
''txt''].compact.join("."))
>       command = <<-end_command
>         "#{ File.expand_path(src.path) }"
>         "#{ File.expand_path(dst.path) }"
>       end_command
>
>       begin
>         success = Paperclip.run("/usr/bin/pdftotext -nopgbrk",
> command.gsub(/\s+/, " "))
>         Rails.logger.info "Processing #{src.path} to #{dst.path} in
the
> text processor."
>       rescue PaperclipCommandLineError
>         raise PaperclipError, "There was an error processing the text
for
> #{@basename}" if @whiny
>       end
>       dst
>     end
>   end
> end
>
> #app/models/document.rb
>   has_attached_file :pdf,:styles => { :text => { :fake =>
''variable'' } },
> :processors => [:text]
>   after_post_process :extract_text
>
>   private
>   def extract_text
>     file =
File.open("#{pdf.queued_for_write[:text].path}","r")
>     plain_text = ""
>     while (line = file.gets)
>       plain_text << Iconv.conv(''ASCII//IGNORE'',
''UTF8'', line)
>     end
>     self.plain_text = plain_text #text column to hold the extracted text
> for searching
>   end
>
> I had to find and install the creaky-old pdftotext library on my server
> (happily, there was an apt-get bundle for it) and configure the path
> correctly. When Paperclip accepts a PDF upload, it creates a text
extraction
> of that file and saves it in system/pdfs/:id/text/filename.pdf. Note that
> while it has a .pdf extension, the file itself is actually just the plain
> text extracted from the original pdf. After quite a lot of googling and
> begging my local Ruby group, I got the recipe for ripping open that text
> file and reading it into a variable to store on the record. The text you
get
> out of pdftotext will vary wildly in quality and comprehensiveness, but
> since all I needed was a way to get a simple search system fed, it works
> fine for my needs. I never show this text to anyone, just use it as the
> "keywords" for search. You may want/need to present an editing
field for the
> administrator to clean up these extracted texts.
>
> Walter
>
> ------------------------------
>
>    Tushar Gandhi <lists-fsXkhYbjdPsEEoCn2XhGlw@public.gmane.org>
> January 31, 2011 11:12 AM
>
> Hi,
> In my upcoming application we are uploading the pdf files.
> After uploading the pdf file I have to extract the text from pdf and
> display it to user.
> can anyone tell me how to extract text from pdf file?
> Is there any plugin or gem present for this?
> Thanks,
> Tushar
>
>    --
> You received this message because you are subscribed to the Google Groups
> "Ruby on Rails: Talk" group.
> To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
> To unsubscribe from this group, send email to
>
rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<rubyonrails-talk%2Bunsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org>
> .
> For more options, visit this group at
> http://groups.google.com/group/rubyonrails-talk?hl=en.
>
-- 
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en.

Walter Lee Davis

2011-Jan-31 19:58 UTC

head link

Re: Extract text from PDF file

I don''t see how these relate to the question -- they are apparently  
designed to generate PDFs rather than to extract text from existing  
PDF documents. Can you point to an example where these libraries can  
be used in that fashion? I''d love to use something more professionally
developed than my own system.

Walter

On Jan 31, 2011, at 12:36 PM, Garrett Lancaster wrote:
> pdftk, pdfbox (java), pdfkit
>
> Garrett Lancaster
>
>>
>>
>> Walter Lee Davis
>> January 31, 2011 11:32 AM
>>
>>
>> On Jan 31, 2011, at 12:12 PM, Tushar Gandhi wrote:
>>
>>
>> I did this using Paperclip and defining a processor for Paperclip  
>> as follows:


-- 
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en.

Garrett Lancaster

2011-Jan-31 20:21 UTC

head link

Re: Extract text from PDF file

PDFBox is the library I''m using on a current project: 
http://pdfbox.apache.org/
There is a link to "Extract Text" under Command Line Utilities. There
is
also a section called "Text Extraction" under Tutorials.

There is a ruby command line utility that wraps PDFBox called Docsplit: 
http://documentcloud.github.com/docsplit/ that might be worth looking into.

For pdftk: http://pdf-toolkit.rubyforge.org/classes/PDF/Toolkit.html#M000003


Hope this helps,
Garrett Lancaster
> ------------------------------------------------------------------------
>
> 	Walter Lee Davis
<mailto:waltd-HQgmohHLjDZWk0Htik3J/w@public.gmane.org>
> January 31, 2011 1:58 PM
>
>
> I don''t see how these relate to the question -- they are
apparently
> designed to generate PDFs rather than to extract text from existing 
> PDF documents. Can you point to an example where these libraries can 
> be used in that fashion? I''d love to use something more
professionally
> developed than my own system.
>
> Walter
>
> On Jan 31, 2011, at 12:36 PM, Garrett Lancaster wrote:
>
>> pdftk, pdfbox (java), pdfkit
>>
>> Garrett Lancaster
>>
>>>
>
>>>
>>> Walter Lee Davis
>>> January 31, 2011 11:32 AM
>>>
>>>
>>> On Jan 31, 2011, at 12:12 PM, Tushar Gandhi wrote:
>>>
>>>
>>> I did this using Paperclip and defining a processor for Paperclip
as
>>> follows:
>
> ------------------------------------------------------------------------
>
> 	Garrett Lancaster
<mailto:glancast-jmyiO2ngOJdad6c/EObmYVaTQe2KTcn/@public.gmane.org>
> January 31, 2011 11:36 AM
>
>
> pdftk, pdfbox (java), pdfkit
>
> Garrett Lancaster
>
> ------------------------------------------------------------------------
>
> 	Walter Lee Davis
<mailto:waltd-HQgmohHLjDZWk0Htik3J/w@public.gmane.org>
> January 31, 2011 11:32 AM
>
>
>
> On Jan 31, 2011, at 12:12 PM, Tushar Gandhi wrote:
>
>
> I did this using Paperclip and defining a processor for Paperclip as 
> follows:
>
> #lib/paperclip_processors/text.rb
> module Paperclip
>   # Handles extracting plain text from PDF file attachments
>   class Text < Processor
>
>     attr_accessor :whiny
>
>     # Creates a Text extract from PDF
>     def make
>       src = @file
>       dst = Tempfile.new([@basename,
''txt''].compact.join("."))
>       command = <<-end_command
>         "#{ File.expand_path(src.path) }"
>         "#{ File.expand_path(dst.path) }"
>       end_command
>
>       begin
>         success = Paperclip.run("/usr/bin/pdftotext -nopgbrk", 
> command.gsub(/\s+/, " "))
>         Rails.logger.info "Processing #{src.path} to #{dst.path} in 
> the text processor."
>       rescue PaperclipCommandLineError
>         raise PaperclipError, "There was an error processing the text 
> for #{@basename}" if @whiny
>       end
>       dst
>     end
>   end
> end
>
> #app/models/document.rb
>   has_attached_file :pdf,:styles => { :text => { :fake =>
''variable'' }
> }, :processors => [:text]
>   after_post_process :extract_text
>
>   private
>   def extract_text
>     file =
File.open("#{pdf.queued_for_write[:text].path}","r")
>     plain_text = ""
>     while (line = file.gets)
>       plain_text << Iconv.conv(''ASCII//IGNORE'',
''UTF8'', line)
>     end
>     self.plain_text = plain_text #text column to hold the extracted 
> text for searching
>   end
>
> I had to find and install the creaky-old pdftotext library on my 
> server (happily, there was an apt-get bundle for it) and configure the 
> path correctly. When Paperclip accepts a PDF upload, it creates a text 
> extraction of that file and saves it in 
> system/pdfs/:id/text/filename.pdf. Note that while it has a .pdf 
> extension, the file itself is actually just the plain text extracted 
> from the original pdf. After quite a lot of googling and begging my 
> local Ruby group, I got the recipe for ripping open that text file and 
> reading it into a variable to store on the record. The text you get 
> out of pdftotext will vary wildly in quality and comprehensiveness, 
> but since all I needed was a way to get a simple search system fed, it 
> works fine for my needs. I never show this text to anyone, just use it 
> as the "keywords" for search. You may want/need to present an
editing
> field for the administrator to clean up these extracted texts.
>
> Walter
>
> ------------------------------------------------------------------------
>
> 	Tushar Gandhi <mailto:lists-fsXkhYbjdPsEEoCn2XhGlw@public.gmane.org>
> January 31, 2011 11:12 AM
>
>
> Hi,
> In my upcoming application we are uploading the pdf files.
> After uploading the pdf file I have to extract the text from pdf and
> display it to user.
> can anyone tell me how to extract text from pdf file?
> Is there any plugin or gem present for this?
> Thanks,
> Tushar
>
-- 
You received this message because you are subscribed to the Google Groups
"Ruby on Rails: Talk" group.
To post to this group, send email to
rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to
rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit this group at
http://groups.google.com/group/rubyonrails-talk?hl=en.

Rails - Jan 2011 - Extract text from PDF file

Extract text from PDF file

Re: Extract text from PDF file

Re: Extract text from PDF file

Re: Extract text from PDF file

Re: Extract text from PDF file

Re: Extract text from PDF file