thr3ads.net - Rails - Read pdf files [Oct 2005]

If this information is useful, please help other people find it:
Share via:

Nick Stuart

2005-Oct-25 17:33 UTC

Read pdf files

Ok, so I know there is a library out there to write pdfs, but is there
one to read the text of pdfs!?

I guess this is more of general ruby question, but am wondering about
this for the search functionality of our site. Right now all we can
really do is search off a description or keywords, but it would be
really cool to be able to index the actual PDF files themselves.

Any pointers on this? The most be something out there for the web that
will work as google has the handy "pdf -> html" option.

-Nick

p.s. We generate most, if not all, the pdfs we have/will host on our
own. So its not really going to be a question about security or
anything of the like.

Brandon Philips

2005-Oct-25 17:38 UTC

head link

Re: Read pdf files

Nick-

There are utilities such as pdftotext that will do the conversion for
you for easy indexing.

http://www.glyphandcog.com/XpdfText.html

	Brandon

--
http://ifup.org


On 13:33 Tue 25 Oct     , Nick Stuart wrote:> Ok, so I know there is a library out there to write pdfs, but is there
> one to read the text of pdfs!?
> 
> I guess this is more of general ruby question, but am wondering about
> this for the search functionality of our site. Right now all we can
> really do is search off a description or keywords, but it would be
> really cool to be able to index the actual PDF files themselves.
> 
> Any pointers on this? The most be something out there for the web that
> will work as google has the handy "pdf -> html" option.
> 
> -Nick
> 
> p.s. We generate most, if not all, the pdfs we have/will host on our
> own. So its not really going to be a question about security or
> anything of the like.
> _______________________________________________
> Rails mailing list
> Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
> http://lists.rubyonrails.org/mailman/listinfo/rails

Austin Ziegler

2005-Oct-25 17:39 UTC

head link

Re: Read pdf files

On 10/25/05, Nick Stuart
<nicholas.stuart-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
wrote:> Ok, so I know there is a library out there to write pdfs, but is there
> one to read the text of pdfs!?
Not yet. Sometime in 2006 I expect to have PDF::Reader available.
Sooner if I can get volunteers to help out. Right now, you''d be best
off using another solution to index your PDF contents.

It wouldn''t be too hard to pull out the raw text -- you''re not
after a
manipulatable document. You wouldn''t, however, have any context on
what any of it means.

-austin
--
Austin Ziegler * halostatue-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
               * Alternate: austin-/yODNl0JVVCozMbzO90S/Q@public.gmane.org

Nick Stuart

2005-Oct-25 18:52 UTC

head link

Re: Read pdf files

Right, I''m not looking for anything fancy at this point. I''ll
already
know the file name/location/etc and all I need is the text to be able
to search for that document. Dont care where it is in the document,
just want to be able to find it. I''ll look around for some other
libraries in some different languages. I think Java has some open
source stuff for this the last time I looked. Was trying to keep it
all ruby, but will see what I can find.

-Nick

On 10/25/05, Austin Ziegler
<halostatue-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
wrote:> On 10/25/05, Nick Stuart
<nicholas.stuart-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > Ok, so I know there is a library out there to write pdfs, but is there
> > one to read the text of pdfs!?
>
> Not yet. Sometime in 2006 I expect to have PDF::Reader available.
> Sooner if I can get volunteers to help out. Right now, you''d be
best
> off using another solution to index your PDF contents.
>
> It wouldn''t be too hard to pull out the raw text --
you''re not after a
> manipulatable document. You wouldn''t, however, have any context on
> what any of it means.
>
> -austin
> --
> Austin Ziegler * halostatue-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
>                * Alternate: austin-/yODNl0JVVCozMbzO90S/Q@public.gmane.org
> _______________________________________________
> Rails mailing list
> Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
> http://lists.rubyonrails.org/mailman/listinfo/rails
>

Klaus Kaempf

2005-Oct-26 09:59 UTC

head link

Re: Read pdf files

* Nick Stuart <nicholas.stuart-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
[Oct 25. 2005 20:52]:> Right, I''m not looking for anything fancy at this point.
I''ll already
> know the file name/location/etc and all I need is the text to be able
> to search for that document. Dont care where it is in the document,
> just want to be able to find it. I''ll look around for some other
> libraries in some different languages. I think Java has some open
> source stuff for this the last time I looked. Was trying to keep it
> all ruby, but will see what I can find.

Koffice (Kword to be precise) is quite good in reading PDF documents.
Maybe you can re-use code from there ?

Klaus

Nick Stuart

2005-Oct-26 12:36 UTC

head link

Re: Read pdf files

Hmm, didn''t know that about KWord, might have to check it out. Right
now though I''ve found a Java library that seems like it might do the
trick if I can figure out a way to have it run periodically (shouldn''t
be that tough). If anyone''s interesed
http://www.pdfbox.org

I''m thinking I should be able to get this, Lucene, and ferret all
working together to get what I need.

-Nick

On 10/26/05, Klaus Kaempf <kkaempf-l3A5Bk7waGM@public.gmane.org>
wrote:> * Nick Stuart
<nicholas.stuart-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> [Oct 25. 2005
20:52]:
> > Right, I''m not looking for anything fancy at this point.
I''ll already
> > know the file name/location/etc and all I need is the text to be able
> > to search for that document. Dont care where it is in the document,
> > just want to be able to find it. I''ll look around for some
other
> > libraries in some different languages. I think Java has some open
> > source stuff for this the last time I looked. Was trying to keep it
> > all ruby, but will see what I can find.
>
>
> Koffice (Kword to be precise) is quite good in reading PDF documents.
> Maybe you can re-use code from there ?
>
> Klaus
> _______________________________________________
> Rails mailing list
> Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
> http://lists.rubyonrails.org/mailman/listinfo/rails
>

Patrick Gundlach

2005-Oct-26 12:46 UTC

head link

Re: Read pdf files

Hi,
> this for the search functionality of our site. Right now all we can
> really do is search off a description or keywords, but it would be
> really cool to be able to index the actual PDF files themselves.
I have just done this on <http://articles.contextgarden.net/>. I am
using swish-e, pdftotext and ruby/rails. To try out, use keywords like
''layer'', ''starttext'',
''latex''. This is work in progress.


Patrick

BTW: swish-e is very quick.

Erik Hatcher

2005-Oct-26 12:50 UTC

head link

Re: Read pdf files

Another plug.... you can find out how Lucene and PDFBox (via  
TextMining.org''s wrapper also) work together from Lucene in Action -  
again the code is freely available at http://www.lucenebook.com

PDFBox is the best that exists in the Java world for extracting text  
from PDF files.   My current usage of Lucene with Rails is to have a  
Java search server (XML-RPC) inquired from a Rails front-end.  Now  
that Ferret is out, I''ll still index with a Java program, but use  
Ferret''s API instead of the search server overhead.

     Erik


On 26 Oct 2005, at 08:36, Nick Stuart wrote:
> Hmm, didn''t know that about KWord, might have to check it out.
Right
> now though I''ve found a Java library that seems like it might do
the
> trick if I can figure out a way to have it run periodically
(shouldn''t
> be that tough). If anyone''s interesed
> http://www.pdfbox.org
>
> I''m thinking I should be able to get this, Lucene, and ferret all
> working together to get what I need.
>
> -Nick
>
> On 10/26/05, Klaus Kaempf <kkaempf-l3A5Bk7waGM@public.gmane.org>
wrote:
>
>> * Nick Stuart
<nicholas.stuart-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> [Oct 25. 2005
20:52]:
>>
>>> Right, I''m not looking for anything fancy at this point.
I''ll
>>> already
>>> know the file name/location/etc and all I need is the text to be  
>>> able
>>> to search for that document. Dont care where it is in the document,
>>> just want to be able to find it. I''ll look around for some
other
>>> libraries in some different languages. I think Java has some open
>>> source stuff for this the last time I looked. Was trying to keep it
>>> all ruby, but will see what I can find.
>>>
>>
>>
>> Koffice (Kword to be precise) is quite good in reading PDF documents.
>> Maybe you can re-use code from there ?
>>
>> Klaus
>> _______________________________________________
>> Rails mailing list
>> Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
>> http://lists.rubyonrails.org/mailman/listinfo/rails
>>
>>
> _______________________________________________
> Rails mailing list
> Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
> http://lists.rubyonrails.org/mailman/listinfo/rails
>

Nick Stuart

2005-Oct-26 13:27 UTC

head link

Re: Re: Read pdf files

That looks interesting. How does swish-e handle data coming from a
database and the like that dont really have a ''document''
backing them.
I would basically need to be able to keep track of the document id''s
from the database so I know what goes with what. This is fairly
trivial in ferret/lucene, but then searching pdfs aint so much.

Also, are you using the straight perl wrappers for it? Or do have some
ruby bindings you might care to share? :)
I would prefer to avoid using perl if possible.

-Nick

On 26 Oct 2005 14:46:40 +0200, Patrick Gundlach
<rails-nG6qG2zCiNXF3kJR96AzWl6hYfS7NtTn@public.gmane.org>
wrote:> Hi,
>
> > this for the search functionality of our site. Right now all we can
> > really do is search off a description or keywords, but it would be
> > really cool to be able to index the actual PDF files themselves.
>
> I have just done this on <http://articles.contextgarden.net/>. I am
> using swish-e, pdftotext and ruby/rails. To try out, use keywords like
> ''layer'', ''starttext'',
''latex''. This is work in progress.
>
>
> Patrick
>
> BTW: swish-e is very quick.
> _______________________________________________
> Rails mailing list
> Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
> http://lists.rubyonrails.org/mailman/listinfo/rails
>

Patrick Gundlach

2005-Oct-26 13:43 UTC

head link

Re: Read pdf files

Hello Nick,
> That looks interesting. How does swish-e handle data coming from a
> database and the like that dont really have a ''document''
backing them.
Swish-e can call external processes to get the data for indexing. So
you might need a separate ruby/active-record program that extracts
data from your database.

I my case I have a directory structure such as:

publication1/document1.pdf
            /document2.txt
publication2/document3.txt
            /document4.pdf

And swish-e uses the the directory name as a meta-keyword and the
''document1'' string as the document name. I can ask swish-e for
these
two attributes and look inside my database to get the id of the
connected article:

article.rb (model):
--------------------------------------------------
  def self.from_swishe_result(res)
    pub=Publication.find_by_filename(res.publication)
    find(:first,:conditions => ["publication_id=? and
filename=?",pub.id,res.docpath])
  end
--------------------------------------------------

''res'' is a result set returned from the swish-e query.
res.publication
would be ''publication1'' in the example above, res.docpath
would be
''document1'' (because I have setup swish-e to strip the rest).
> Also, are you using the straight perl wrappers for it? Or do have some
> ruby bindings you might care to share? :)
Well, I have created swish-e bindings for ruby with swig. I have a
very (!!) minimalistic interface which I can share. (Private mail
please). If people tell me what kind of interface they need, I might
even create a more advanced solution. But the current one is
sufficient for me ATM.
> I would prefer to avoid using perl if possible.
Of course!

Just an example from the swish-e ruby binding that I''ve done:

    sw=SwishE.new(INDEXFILE)
    @results=sw.query("the search term")

and then you can say:

    @results.each do |res|
        puts res.docpath
    end

(this is all there is on my swish-e ruby binding)


Patrick

Patrick Gundlach

2005-Oct-26 13:43 UTC

head link

Re: Read pdf files

Hello Nick,
> That looks interesting. How does swish-e handle data coming from a
> database and the like that dont really have a ''document''
backing them.
Swish-e can call external processes to get the data for indexing. So
you might need a separate ruby/active-record program that extracts
data from your database.

I my case I have a directory structure such as:

publication1/document1.pdf
            /document2.txt
publication2/document3.txt
            /document4.pdf

And swish-e uses the the directory name as a meta-keyword and the
''document1'' string as the document name. I can ask swish-e for
these
two attributes and look inside my database to get the id of the
connected article:

article.rb (model):
--------------------------------------------------
  def self.from_swishe_result(res)
    pub=Publication.find_by_filename(res.publication)
    find(:first,:conditions => ["publication_id=? and
filename=?",pub.id,res.docpath])
  end
--------------------------------------------------

''res'' is a result set returned from the swish-e query.
res.publication
would be ''publication1'' in the example above, res.docpath
would be
''document1'' (because I have setup swish-e to strip the rest).
> Also, are you using the straight perl wrappers for it? Or do have some
> ruby bindings you might care to share? :)
Well, I have created swish-e bindings for ruby with swig. I have a
very (!!) minimalistic interface which I can share. (Private mail
please). If people tell me what kind of interface they need, I might
even create a more advanced solution. But the current one is
sufficient for me ATM.
> I would prefer to avoid using perl if possible.
Of course!

Just an example from the swish-e ruby binding that I''ve done:

    sw=SwishE.new(INDEXFILE)
    @results=sw.query("the search term")

and then you can say:

    @results.each do |res|
        puts res.docpath
    end

(this is all there is on my swish-e ruby binding)


Patrick

Nick Stuart

2005-Oct-26 14:04 UTC

head link

Re: Read pdf files

Yep, thats excatly what I was going to do. The one question I''m facing
now is how to tell the rails that the index has changed so the readers
in ferret can update themselves.

Shouldn''t be to hard, but being new to ruby/rails it will take a
little looking into.

-Nick

p.s. I just found that PDFBox has a LucenePDFDocument object already.
What could be easier then that?!  :)

On 10/26/05, Erik Hatcher
<erik-LIifS8st6VgJvtFkdXX2HpqQE7yCjDx5@public.gmane.org>
wrote:> Another plug.... you can find out how Lucene and PDFBox (via
> TextMining.org''s wrapper also) work together from Lucene in Action
-
> again the code is freely available at http://www.lucenebook.com
>
> PDFBox is the best that exists in the Java world for extracting text
> from PDF files.   My current usage of Lucene with Rails is to have a
> Java search server (XML-RPC) inquired from a Rails front-end.  Now
> that Ferret is out, I''ll still index with a Java program, but use
> Ferret''s API instead of the search server overhead.
>
>      Erik
>
>
> On 26 Oct 2005, at 08:36, Nick Stuart wrote:
>
> > Hmm, didn''t know that about KWord, might have to check it
out. Right
> > now though I''ve found a Java library that seems like it might
do the
> > trick if I can figure out a way to have it run periodically
(shouldn''t
> > be that tough). If anyone''s interesed
> > http://www.pdfbox.org
> >
> > I''m thinking I should be able to get this, Lucene, and ferret
all
> > working together to get what I need.
> >
> > -Nick
> >
> > On 10/26/05, Klaus Kaempf <kkaempf-l3A5Bk7waGM@public.gmane.org>
wrote:
> >
> >> * Nick Stuart
<nicholas.stuart-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> [Oct 25. 2005
20:52]:
> >>
> >>> Right, I''m not looking for anything fancy at this
point. I''ll
> >>> already
> >>> know the file name/location/etc and all I need is the text to
be
> >>> able
> >>> to search for that document. Dont care where it is in the
document,
> >>> just want to be able to find it. I''ll look around for
some other
> >>> libraries in some different languages. I think Java has some
open
> >>> source stuff for this the last time I looked. Was trying to
keep it
> >>> all ruby, but will see what I can find.
> >>>
> >>
> >>
> >> Koffice (Kword to be precise) is quite good in reading PDF
documents.
> >> Maybe you can re-use code from there ?
> >>
> >> Klaus
> >> _______________________________________________
> >> Rails mailing list
> >> Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
> >> http://lists.rubyonrails.org/mailman/listinfo/rails
> >>
> >>
> > _______________________________________________
> > Rails mailing list
> > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
> > http://lists.rubyonrails.org/mailman/listinfo/rails
> >
>
> _______________________________________________
> Rails mailing list
> Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
> http://lists.rubyonrails.org/mailman/listinfo/rails
>

Erik Hatcher

2005-Oct-26 14:31 UTC

head link

Re: Read pdf files

On 26 Oct 2005, at 10:04, Nick Stuart wrote:> Yep, thats excatly what I was going to do. The one question I''m
facing
> now is how to tell the rails that the index has changed so the readers
> in ferret can update themselves.
Hmmm.... good question.  I''ve not had the chance to roll up my  
sleeves with Ferret and Rails yet myself, but you could certainly  
ping the Rails app using a RESTful URL that forces the update, or  
some sort of other messaging layer.
> p.s. I just found that PDFBox has a LucenePDFDocument object already.
> What could be easier then that?!  :)
I personally prefer to control the fields precisely, along with all  
the attributes of those fields, so I don''t use that convenience, but  
it certainly is handy if the field(s) it creates are suitable to your  
application.

     Erik

>
> On 10/26/05, Erik Hatcher
<erik-LIifS8st6VgJvtFkdXX2HpqQE7yCjDx5@public.gmane.org> wrote:
>
>> Another plug.... you can find out how Lucene and PDFBox (via
>> TextMining.org''s wrapper also) work together from Lucene in
Action -
>> again the code is freely available at http://www.lucenebook.com
>>
>> PDFBox is the best that exists in the Java world for extracting text
>> from PDF files.   My current usage of Lucene with Rails is to have a
>> Java search server (XML-RPC) inquired from a Rails front-end.  Now
>> that Ferret is out, I''ll still index with a Java program, but
use
>> Ferret''s API instead of the search server overhead.
>>
>>      Erik
>>
>>
>> On 26 Oct 2005, at 08:36, Nick Stuart wrote:
>>
>>
>>> Hmm, didn''t know that about KWord, might have to check it
out. Right
>>> now though I''ve found a Java library that seems like it
might do the
>>> trick if I can figure out a way to have it run periodically  
>>> (shouldn''t
>>> be that tough). If anyone''s interesed
>>> http://www.pdfbox.org
>>>
>>> I''m thinking I should be able to get this, Lucene, and
ferret all
>>> working together to get what I need.
>>>
>>> -Nick
>>>
>>> On 10/26/05, Klaus Kaempf
<kkaempf-l3A5Bk7waGM@public.gmane.org> wrote:
>>>
>>>
>>>> * Nick Stuart
<nicholas.stuart-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> [Oct 25. 2005
20:52]:
>>>>
>>>>
>>>>> Right, I''m not looking for anything fancy at this
point. I''ll
>>>>> already
>>>>> know the file name/location/etc and all I need is the text
to be
>>>>> able
>>>>> to search for that document. Dont care where it is in the  
>>>>> document,
>>>>> just want to be able to find it. I''ll look around
for some other
>>>>> libraries in some different languages. I think Java has
some open
>>>>> source stuff for this the last time I looked. Was trying to
>>>>> keep it
>>>>> all ruby, but will see what I can find.
>>>>>
>>>>>
>>>>
>>>>
>>>> Koffice (Kword to be precise) is quite good in reading PDF  
>>>> documents.
>>>> Maybe you can re-use code from there ?
>>>>
>>>> Klaus
>>>> _______________________________________________
>>>> Rails mailing list
>>>> Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
>>>> http://lists.rubyonrails.org/mailman/listinfo/rails
>>>>
>>>>
>>>>
>>> _______________________________________________
>>> Rails mailing list
>>> Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
>>> http://lists.rubyonrails.org/mailman/listinfo/rails
>>>
>>>
>>
>> _______________________________________________
>> Rails mailing list
>> Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
>> http://lists.rubyonrails.org/mailman/listinfo/rails
>>
>>
> _______________________________________________
> Rails mailing list
> Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org
> http://lists.rubyonrails.org/mailman/listinfo/rails
>

Nick Stuart

2005-Oct-26 14:42 UTC

head link

Re: Read pdf files

On 10/26/05, Erik Hatcher
<erik-LIifS8st6VgJvtFkdXX2HpqQE7yCjDx5@public.gmane.org>
wrote:>
> Hmmm.... good question.  I''ve not had the chance to roll up my
> sleeves with Ferret and Rails yet myself, but you could certainly
> ping the Rails app using a RESTful URL that forces the update, or
> some sort of other messaging layer.
>
That might work. We already have an admin section of the site, so I
should be able to easily add a action to that to reset the index
reader. Was thinking of having the indexer be scheduled to run and a
given time, but that might be a bit overkill because we wont be adding
documents every day once the site is setup, and hence wanted an
automatic way of doing this.
>
> I personally prefer to control the fields precisely, along with all
> the attributes of those fields, so I don''t use that convenience,
but
> it certainly is handy if the field(s) it creates are suitable to your
> application.
>
Ya, I think the LucenePDFDocument will work fine in our case for now.
I can always change it later, but right now we dont even have any PDF
search capability so this will be better then what we have.  :)

I want to get the basics working to begin with and make it fancy later.

-Nick

Rails - Oct 2005 - Read pdf files

Read pdf files

Re: Read pdf files

Re: Read pdf files

Re: Read pdf files

Re: Read pdf files

Re: Read pdf files

Re: Read pdf files

Re: Read pdf files

Re: Re: Read pdf files

Re: Read pdf files

Re: Read pdf files

Re: Read pdf files

Re: Read pdf files

Re: Read pdf files