Hi, In my upcoming application we are uploading the pdf files. After uploading the pdf file I have to extract the text from pdf and display it to user. can anyone tell me how to extract text from pdf file? Is there any plugin or gem present for this? Thanks, Tushar -- Posted via http://www.ruby-forum.com/. -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
On Jan 31, 2011, at 12:12 PM, Tushar Gandhi wrote:> Hi, > In my upcoming application we are uploading the pdf files. > After uploading the pdf file I have to extract the text from pdf and > display it to user. > can anyone tell me how to extract text from pdf file? > Is there any plugin or gem present for this? > Thanks, > Tushar >I did this using Paperclip and defining a processor for Paperclip as follows: #lib/paperclip_processors/text.rb module Paperclip # Handles extracting plain text from PDF file attachments class Text < Processor attr_accessor :whiny # Creates a Text extract from PDF def make src = @file dst = Tempfile.new([@basename, ''txt''].compact.join(".")) command = <<-end_command "#{ File.expand_path(src.path) }" "#{ File.expand_path(dst.path) }" end_command begin success = Paperclip.run("/usr/bin/pdftotext -nopgbrk", command.gsub(/\s+/, " ")) Rails.logger.info "Processing #{src.path} to #{dst.path} in the text processor." rescue PaperclipCommandLineError raise PaperclipError, "There was an error processing the text for #{@basename}" if @whiny end dst end end end #app/models/document.rb has_attached_file :pdf,:styles => { :text => { :fake => ''variable'' } }, :processors => [:text] after_post_process :extract_text private def extract_text file = File.open("#{pdf.queued_for_write[:text].path}","r") plain_text = "" while (line = file.gets) plain_text << Iconv.conv(''ASCII//IGNORE'', ''UTF8'', line) end self.plain_text = plain_text #text column to hold the extracted text for searching end I had to find and install the creaky-old pdftotext library on my server (happily, there was an apt-get bundle for it) and configure the path correctly. When Paperclip accepts a PDF upload, it creates a text extraction of that file and saves it in system/pdfs/:id/text/ filename.pdf. Note that while it has a .pdf extension, the file itself is actually just the plain text extracted from the original pdf. After quite a lot of googling and begging my local Ruby group, I got the recipe for ripping open that text file and reading it into a variable to store on the record. The text you get out of pdftotext will vary wildly in quality and comprehensiveness, but since all I needed was a way to get a simple search system fed, it works fine for my needs. I never show this text to anyone, just use it as the "keywords" for search. You may want/need to present an editing field for the administrator to clean up these extracted texts. Walter -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
pdftk, pdfbox (java), pdfkit Garrett Lancaster> ------------------------------------------------------------------------ > > Walter Lee Davis <mailto:waltd-HQgmohHLjDZWk0Htik3J/w@public.gmane.org> > January 31, 2011 11:32 AM > > > > On Jan 31, 2011, at 12:12 PM, Tushar Gandhi wrote: > > > I did this using Paperclip and defining a processor for Paperclip as > follows: > > #lib/paperclip_processors/text.rb > module Paperclip > # Handles extracting plain text from PDF file attachments > class Text < Processor > > attr_accessor :whiny > > # Creates a Text extract from PDF > def make > src = @file > dst = Tempfile.new([@basename, ''txt''].compact.join(".")) > command = <<-end_command > "#{ File.expand_path(src.path) }" > "#{ File.expand_path(dst.path) }" > end_command > > begin > success = Paperclip.run("/usr/bin/pdftotext -nopgbrk", > command.gsub(/\s+/, " ")) > Rails.logger.info "Processing #{src.path} to #{dst.path} in > the text processor." > rescue PaperclipCommandLineError > raise PaperclipError, "There was an error processing the text > for #{@basename}" if @whiny > end > dst > end > end > end > > #app/models/document.rb > has_attached_file :pdf,:styles => { :text => { :fake => ''variable'' } > }, :processors => [:text] > after_post_process :extract_text > > private > def extract_text > file = File.open("#{pdf.queued_for_write[:text].path}","r") > plain_text = "" > while (line = file.gets) > plain_text << Iconv.conv(''ASCII//IGNORE'', ''UTF8'', line) > end > self.plain_text = plain_text #text column to hold the extracted > text for searching > end > > I had to find and install the creaky-old pdftotext library on my > server (happily, there was an apt-get bundle for it) and configure the > path correctly. When Paperclip accepts a PDF upload, it creates a text > extraction of that file and saves it in > system/pdfs/:id/text/filename.pdf. Note that while it has a .pdf > extension, the file itself is actually just the plain text extracted > from the original pdf. After quite a lot of googling and begging my > local Ruby group, I got the recipe for ripping open that text file and > reading it into a variable to store on the record. The text you get > out of pdftotext will vary wildly in quality and comprehensiveness, > but since all I needed was a way to get a simple search system fed, it > works fine for my needs. I never show this text to anyone, just use it > as the "keywords" for search. You may want/need to present an editing > field for the administrator to clean up these extracted texts. > > Walter > > ------------------------------------------------------------------------ > > Tushar Gandhi <mailto:lists-fsXkhYbjdPsEEoCn2XhGlw@public.gmane.org> > January 31, 2011 11:12 AM > > > Hi, > In my upcoming application we are uploading the pdf files. > After uploading the pdf file I have to extract the text from pdf and > display it to user. > can anyone tell me how to extract text from pdf file? > Is there any plugin or gem present for this? > Thanks, > Tushar >-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
I wrote a plugin that requires attachment_fu and some unixy utilities behind the scenes for this several years back: https://github.com/kete/convert_attachment_to It works reliably in Rails 2.x apps. I haven''t tried it with Rails 3 yet. You could fork it and update (make it work with PaperClip or Rails 3) it you like or just have a gander for example code. Cheers, Walter On Tue, Feb 1, 2011 at 6:36 AM, Garrett Lancaster < glancast-jmyiO2ngOJdad6c/EObmYVaTQe2KTcn/@public.gmane.org> wrote:> pdftk, pdfbox (java), pdfkit > > Garrett Lancaster > > ------------------------------ > > Walter Lee Davis <waltd-HQgmohHLjDZWk0Htik3J/w@public.gmane.org> > January 31, 2011 11:32 AM > > > On Jan 31, 2011, at 12:12 PM, Tushar Gandhi wrote: > > > I did this using Paperclip and defining a processor for Paperclip as > follows: > > #lib/paperclip_processors/text.rb > module Paperclip > # Handles extracting plain text from PDF file attachments > class Text < Processor > > attr_accessor :whiny > > # Creates a Text extract from PDF > def make > src = @file > dst = Tempfile.new([@basename, ''txt''].compact.join(".")) > command = <<-end_command > "#{ File.expand_path(src.path) }" > "#{ File.expand_path(dst.path) }" > end_command > > begin > success = Paperclip.run("/usr/bin/pdftotext -nopgbrk", > command.gsub(/\s+/, " ")) > Rails.logger.info "Processing #{src.path} to #{dst.path} in the > text processor." > rescue PaperclipCommandLineError > raise PaperclipError, "There was an error processing the text for > #{@basename}" if @whiny > end > dst > end > end > end > > #app/models/document.rb > has_attached_file :pdf,:styles => { :text => { :fake => ''variable'' } }, > :processors => [:text] > after_post_process :extract_text > > private > def extract_text > file = File.open("#{pdf.queued_for_write[:text].path}","r") > plain_text = "" > while (line = file.gets) > plain_text << Iconv.conv(''ASCII//IGNORE'', ''UTF8'', line) > end > self.plain_text = plain_text #text column to hold the extracted text > for searching > end > > I had to find and install the creaky-old pdftotext library on my server > (happily, there was an apt-get bundle for it) and configure the path > correctly. When Paperclip accepts a PDF upload, it creates a text extraction > of that file and saves it in system/pdfs/:id/text/filename.pdf. Note that > while it has a .pdf extension, the file itself is actually just the plain > text extracted from the original pdf. After quite a lot of googling and > begging my local Ruby group, I got the recipe for ripping open that text > file and reading it into a variable to store on the record. The text you get > out of pdftotext will vary wildly in quality and comprehensiveness, but > since all I needed was a way to get a simple search system fed, it works > fine for my needs. I never show this text to anyone, just use it as the > "keywords" for search. You may want/need to present an editing field for the > administrator to clean up these extracted texts. > > Walter > > ------------------------------ > > Tushar Gandhi <lists-fsXkhYbjdPsEEoCn2XhGlw@public.gmane.org> > January 31, 2011 11:12 AM > > Hi, > In my upcoming application we are uploading the pdf files. > After uploading the pdf file I have to extract the text from pdf and > display it to user. > can anyone tell me how to extract text from pdf file? > Is there any plugin or gem present for this? > Thanks, > Tushar > > -- > You received this message because you are subscribed to the Google Groups > "Ruby on Rails: Talk" group. > To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To unsubscribe from this group, send email to > rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<rubyonrails-talk%2Bunsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> > . > For more options, visit this group at > http://groups.google.com/group/rubyonrails-talk?hl=en. >-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
I don''t see how these relate to the question -- they are apparently designed to generate PDFs rather than to extract text from existing PDF documents. Can you point to an example where these libraries can be used in that fashion? I''d love to use something more professionally developed than my own system. Walter On Jan 31, 2011, at 12:36 PM, Garrett Lancaster wrote:> pdftk, pdfbox (java), pdfkit > > Garrett Lancaster > >>>> >> Walter Lee Davis >> January 31, 2011 11:32 AM >> >> >> On Jan 31, 2011, at 12:12 PM, Tushar Gandhi wrote: >> >> >> I did this using Paperclip and defining a processor for Paperclip >> as follows:-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
PDFBox is the library I''m using on a current project: http://pdfbox.apache.org/ There is a link to "Extract Text" under Command Line Utilities. There is also a section called "Text Extraction" under Tutorials. There is a ruby command line utility that wraps PDFBox called Docsplit: http://documentcloud.github.com/docsplit/ that might be worth looking into. For pdftk: http://pdf-toolkit.rubyforge.org/classes/PDF/Toolkit.html#M000003 Hope this helps, Garrett Lancaster> ------------------------------------------------------------------------ > > Walter Lee Davis <mailto:waltd-HQgmohHLjDZWk0Htik3J/w@public.gmane.org> > January 31, 2011 1:58 PM > > > I don''t see how these relate to the question -- they are apparently > designed to generate PDFs rather than to extract text from existing > PDF documents. Can you point to an example where these libraries can > be used in that fashion? I''d love to use something more professionally > developed than my own system. > > Walter > > On Jan 31, 2011, at 12:36 PM, Garrett Lancaster wrote: > >> pdftk, pdfbox (java), pdfkit >> >> Garrett Lancaster >> >>> > >>> >>> Walter Lee Davis >>> January 31, 2011 11:32 AM >>> >>> >>> On Jan 31, 2011, at 12:12 PM, Tushar Gandhi wrote: >>> >>> >>> I did this using Paperclip and defining a processor for Paperclip as >>> follows: > > ------------------------------------------------------------------------ > > Garrett Lancaster <mailto:glancast-jmyiO2ngOJdad6c/EObmYVaTQe2KTcn/@public.gmane.org> > January 31, 2011 11:36 AM > > > pdftk, pdfbox (java), pdfkit > > Garrett Lancaster > > ------------------------------------------------------------------------ > > Walter Lee Davis <mailto:waltd-HQgmohHLjDZWk0Htik3J/w@public.gmane.org> > January 31, 2011 11:32 AM > > > > On Jan 31, 2011, at 12:12 PM, Tushar Gandhi wrote: > > > I did this using Paperclip and defining a processor for Paperclip as > follows: > > #lib/paperclip_processors/text.rb > module Paperclip > # Handles extracting plain text from PDF file attachments > class Text < Processor > > attr_accessor :whiny > > # Creates a Text extract from PDF > def make > src = @file > dst = Tempfile.new([@basename, ''txt''].compact.join(".")) > command = <<-end_command > "#{ File.expand_path(src.path) }" > "#{ File.expand_path(dst.path) }" > end_command > > begin > success = Paperclip.run("/usr/bin/pdftotext -nopgbrk", > command.gsub(/\s+/, " ")) > Rails.logger.info "Processing #{src.path} to #{dst.path} in > the text processor." > rescue PaperclipCommandLineError > raise PaperclipError, "There was an error processing the text > for #{@basename}" if @whiny > end > dst > end > end > end > > #app/models/document.rb > has_attached_file :pdf,:styles => { :text => { :fake => ''variable'' } > }, :processors => [:text] > after_post_process :extract_text > > private > def extract_text > file = File.open("#{pdf.queued_for_write[:text].path}","r") > plain_text = "" > while (line = file.gets) > plain_text << Iconv.conv(''ASCII//IGNORE'', ''UTF8'', line) > end > self.plain_text = plain_text #text column to hold the extracted > text for searching > end > > I had to find and install the creaky-old pdftotext library on my > server (happily, there was an apt-get bundle for it) and configure the > path correctly. When Paperclip accepts a PDF upload, it creates a text > extraction of that file and saves it in > system/pdfs/:id/text/filename.pdf. Note that while it has a .pdf > extension, the file itself is actually just the plain text extracted > from the original pdf. After quite a lot of googling and begging my > local Ruby group, I got the recipe for ripping open that text file and > reading it into a variable to store on the record. The text you get > out of pdftotext will vary wildly in quality and comprehensiveness, > but since all I needed was a way to get a simple search system fed, it > works fine for my needs. I never show this text to anyone, just use it > as the "keywords" for search. You may want/need to present an editing > field for the administrator to clean up these extracted texts. > > Walter > > ------------------------------------------------------------------------ > > Tushar Gandhi <mailto:lists-fsXkhYbjdPsEEoCn2XhGlw@public.gmane.org> > January 31, 2011 11:12 AM > > > Hi, > In my upcoming application we are uploading the pdf files. > After uploading the pdf file I have to extract the text from pdf and > display it to user. > can anyone tell me how to extract text from pdf file? > Is there any plugin or gem present for this? > Thanks, > Tushar >-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.