Ok, so I know there is a library out there to write pdfs, but is there one to read the text of pdfs!? I guess this is more of general ruby question, but am wondering about this for the search functionality of our site. Right now all we can really do is search off a description or keywords, but it would be really cool to be able to index the actual PDF files themselves. Any pointers on this? The most be something out there for the web that will work as google has the handy "pdf -> html" option. -Nick p.s. We generate most, if not all, the pdfs we have/will host on our own. So its not really going to be a question about security or anything of the like.
Nick- There are utilities such as pdftotext that will do the conversion for you for easy indexing. http://www.glyphandcog.com/XpdfText.html Brandon -- http://ifup.org On 13:33 Tue 25 Oct , Nick Stuart wrote:> Ok, so I know there is a library out there to write pdfs, but is there > one to read the text of pdfs!? > > I guess this is more of general ruby question, but am wondering about > this for the search functionality of our site. Right now all we can > really do is search off a description or keywords, but it would be > really cool to be able to index the actual PDF files themselves. > > Any pointers on this? The most be something out there for the web that > will work as google has the handy "pdf -> html" option. > > -Nick > > p.s. We generate most, if not all, the pdfs we have/will host on our > own. So its not really going to be a question about security or > anything of the like. > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails
On 10/25/05, Nick Stuart <nicholas.stuart-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> Ok, so I know there is a library out there to write pdfs, but is there > one to read the text of pdfs!?Not yet. Sometime in 2006 I expect to have PDF::Reader available. Sooner if I can get volunteers to help out. Right now, you''d be best off using another solution to index your PDF contents. It wouldn''t be too hard to pull out the raw text -- you''re not after a manipulatable document. You wouldn''t, however, have any context on what any of it means. -austin -- Austin Ziegler * halostatue-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org * Alternate: austin-/yODNl0JVVCozMbzO90S/Q@public.gmane.org
Right, I''m not looking for anything fancy at this point. I''ll already know the file name/location/etc and all I need is the text to be able to search for that document. Dont care where it is in the document, just want to be able to find it. I''ll look around for some other libraries in some different languages. I think Java has some open source stuff for this the last time I looked. Was trying to keep it all ruby, but will see what I can find. -Nick On 10/25/05, Austin Ziegler <halostatue-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> On 10/25/05, Nick Stuart <nicholas.stuart-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > > Ok, so I know there is a library out there to write pdfs, but is there > > one to read the text of pdfs!? > > Not yet. Sometime in 2006 I expect to have PDF::Reader available. > Sooner if I can get volunteers to help out. Right now, you''d be best > off using another solution to index your PDF contents. > > It wouldn''t be too hard to pull out the raw text -- you''re not after a > manipulatable document. You wouldn''t, however, have any context on > what any of it means. > > -austin > -- > Austin Ziegler * halostatue-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > * Alternate: austin-/yODNl0JVVCozMbzO90S/Q@public.gmane.org > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >
* Nick Stuart <nicholas.stuart-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> [Oct 25. 2005 20:52]:> Right, I''m not looking for anything fancy at this point. I''ll already > know the file name/location/etc and all I need is the text to be able > to search for that document. Dont care where it is in the document, > just want to be able to find it. I''ll look around for some other > libraries in some different languages. I think Java has some open > source stuff for this the last time I looked. Was trying to keep it > all ruby, but will see what I can find.Koffice (Kword to be precise) is quite good in reading PDF documents. Maybe you can re-use code from there ? Klaus
Hmm, didn''t know that about KWord, might have to check it out. Right now though I''ve found a Java library that seems like it might do the trick if I can figure out a way to have it run periodically (shouldn''t be that tough). If anyone''s interesed http://www.pdfbox.org I''m thinking I should be able to get this, Lucene, and ferret all working together to get what I need. -Nick On 10/26/05, Klaus Kaempf <kkaempf-l3A5Bk7waGM@public.gmane.org> wrote:> * Nick Stuart <nicholas.stuart-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> [Oct 25. 2005 20:52]: > > Right, I''m not looking for anything fancy at this point. I''ll already > > know the file name/location/etc and all I need is the text to be able > > to search for that document. Dont care where it is in the document, > > just want to be able to find it. I''ll look around for some other > > libraries in some different languages. I think Java has some open > > source stuff for this the last time I looked. Was trying to keep it > > all ruby, but will see what I can find. > > > Koffice (Kword to be precise) is quite good in reading PDF documents. > Maybe you can re-use code from there ? > > Klaus > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >
Hi,> this for the search functionality of our site. Right now all we can > really do is search off a description or keywords, but it would be > really cool to be able to index the actual PDF files themselves.I have just done this on <http://articles.contextgarden.net/>. I am using swish-e, pdftotext and ruby/rails. To try out, use keywords like ''layer'', ''starttext'', ''latex''. This is work in progress. Patrick BTW: swish-e is very quick.
Another plug.... you can find out how Lucene and PDFBox (via TextMining.org''s wrapper also) work together from Lucene in Action - again the code is freely available at http://www.lucenebook.com PDFBox is the best that exists in the Java world for extracting text from PDF files. My current usage of Lucene with Rails is to have a Java search server (XML-RPC) inquired from a Rails front-end. Now that Ferret is out, I''ll still index with a Java program, but use Ferret''s API instead of the search server overhead. Erik On 26 Oct 2005, at 08:36, Nick Stuart wrote:> Hmm, didn''t know that about KWord, might have to check it out. Right > now though I''ve found a Java library that seems like it might do the > trick if I can figure out a way to have it run periodically (shouldn''t > be that tough). If anyone''s interesed > http://www.pdfbox.org > > I''m thinking I should be able to get this, Lucene, and ferret all > working together to get what I need. > > -Nick > > On 10/26/05, Klaus Kaempf <kkaempf-l3A5Bk7waGM@public.gmane.org> wrote: > >> * Nick Stuart <nicholas.stuart-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> [Oct 25. 2005 20:52]: >> >>> Right, I''m not looking for anything fancy at this point. I''ll >>> already >>> know the file name/location/etc and all I need is the text to be >>> able >>> to search for that document. Dont care where it is in the document, >>> just want to be able to find it. I''ll look around for some other >>> libraries in some different languages. I think Java has some open >>> source stuff for this the last time I looked. Was trying to keep it >>> all ruby, but will see what I can find. >>> >> >> >> Koffice (Kword to be precise) is quite good in reading PDF documents. >> Maybe you can re-use code from there ? >> >> Klaus >> _______________________________________________ >> Rails mailing list >> Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org >> http://lists.rubyonrails.org/mailman/listinfo/rails >> >> > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >
That looks interesting. How does swish-e handle data coming from a database and the like that dont really have a ''document'' backing them. I would basically need to be able to keep track of the document id''s from the database so I know what goes with what. This is fairly trivial in ferret/lucene, but then searching pdfs aint so much. Also, are you using the straight perl wrappers for it? Or do have some ruby bindings you might care to share? :) I would prefer to avoid using perl if possible. -Nick On 26 Oct 2005 14:46:40 +0200, Patrick Gundlach <rails-nG6qG2zCiNXF3kJR96AzWl6hYfS7NtTn@public.gmane.org> wrote:> Hi, > > > this for the search functionality of our site. Right now all we can > > really do is search off a description or keywords, but it would be > > really cool to be able to index the actual PDF files themselves. > > I have just done this on <http://articles.contextgarden.net/>. I am > using swish-e, pdftotext and ruby/rails. To try out, use keywords like > ''layer'', ''starttext'', ''latex''. This is work in progress. > > > Patrick > > BTW: swish-e is very quick. > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >
Hello Nick,> That looks interesting. How does swish-e handle data coming from a > database and the like that dont really have a ''document'' backing them.Swish-e can call external processes to get the data for indexing. So you might need a separate ruby/active-record program that extracts data from your database. I my case I have a directory structure such as: publication1/document1.pdf /document2.txt publication2/document3.txt /document4.pdf And swish-e uses the the directory name as a meta-keyword and the ''document1'' string as the document name. I can ask swish-e for these two attributes and look inside my database to get the id of the connected article: article.rb (model): -------------------------------------------------- def self.from_swishe_result(res) pub=Publication.find_by_filename(res.publication) find(:first,:conditions => ["publication_id=? and filename=?",pub.id,res.docpath]) end -------------------------------------------------- ''res'' is a result set returned from the swish-e query. res.publication would be ''publication1'' in the example above, res.docpath would be ''document1'' (because I have setup swish-e to strip the rest).> Also, are you using the straight perl wrappers for it? Or do have some > ruby bindings you might care to share? :)Well, I have created swish-e bindings for ruby with swig. I have a very (!!) minimalistic interface which I can share. (Private mail please). If people tell me what kind of interface they need, I might even create a more advanced solution. But the current one is sufficient for me ATM.> I would prefer to avoid using perl if possible.Of course! Just an example from the swish-e ruby binding that I''ve done: sw=SwishE.new(INDEXFILE) @results=sw.query("the search term") and then you can say: @results.each do |res| puts res.docpath end (this is all there is on my swish-e ruby binding) Patrick
Hello Nick,> That looks interesting. How does swish-e handle data coming from a > database and the like that dont really have a ''document'' backing them.Swish-e can call external processes to get the data for indexing. So you might need a separate ruby/active-record program that extracts data from your database. I my case I have a directory structure such as: publication1/document1.pdf /document2.txt publication2/document3.txt /document4.pdf And swish-e uses the the directory name as a meta-keyword and the ''document1'' string as the document name. I can ask swish-e for these two attributes and look inside my database to get the id of the connected article: article.rb (model): -------------------------------------------------- def self.from_swishe_result(res) pub=Publication.find_by_filename(res.publication) find(:first,:conditions => ["publication_id=? and filename=?",pub.id,res.docpath]) end -------------------------------------------------- ''res'' is a result set returned from the swish-e query. res.publication would be ''publication1'' in the example above, res.docpath would be ''document1'' (because I have setup swish-e to strip the rest).> Also, are you using the straight perl wrappers for it? Or do have some > ruby bindings you might care to share? :)Well, I have created swish-e bindings for ruby with swig. I have a very (!!) minimalistic interface which I can share. (Private mail please). If people tell me what kind of interface they need, I might even create a more advanced solution. But the current one is sufficient for me ATM.> I would prefer to avoid using perl if possible.Of course! Just an example from the swish-e ruby binding that I''ve done: sw=SwishE.new(INDEXFILE) @results=sw.query("the search term") and then you can say: @results.each do |res| puts res.docpath end (this is all there is on my swish-e ruby binding) Patrick
Yep, thats excatly what I was going to do. The one question I''m facing now is how to tell the rails that the index has changed so the readers in ferret can update themselves. Shouldn''t be to hard, but being new to ruby/rails it will take a little looking into. -Nick p.s. I just found that PDFBox has a LucenePDFDocument object already. What could be easier then that?! :) On 10/26/05, Erik Hatcher <erik-LIifS8st6VgJvtFkdXX2HpqQE7yCjDx5@public.gmane.org> wrote:> Another plug.... you can find out how Lucene and PDFBox (via > TextMining.org''s wrapper also) work together from Lucene in Action - > again the code is freely available at http://www.lucenebook.com > > PDFBox is the best that exists in the Java world for extracting text > from PDF files. My current usage of Lucene with Rails is to have a > Java search server (XML-RPC) inquired from a Rails front-end. Now > that Ferret is out, I''ll still index with a Java program, but use > Ferret''s API instead of the search server overhead. > > Erik > > > On 26 Oct 2005, at 08:36, Nick Stuart wrote: > > > Hmm, didn''t know that about KWord, might have to check it out. Right > > now though I''ve found a Java library that seems like it might do the > > trick if I can figure out a way to have it run periodically (shouldn''t > > be that tough). If anyone''s interesed > > http://www.pdfbox.org > > > > I''m thinking I should be able to get this, Lucene, and ferret all > > working together to get what I need. > > > > -Nick > > > > On 10/26/05, Klaus Kaempf <kkaempf-l3A5Bk7waGM@public.gmane.org> wrote: > > > >> * Nick Stuart <nicholas.stuart-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> [Oct 25. 2005 20:52]: > >> > >>> Right, I''m not looking for anything fancy at this point. I''ll > >>> already > >>> know the file name/location/etc and all I need is the text to be > >>> able > >>> to search for that document. Dont care where it is in the document, > >>> just want to be able to find it. I''ll look around for some other > >>> libraries in some different languages. I think Java has some open > >>> source stuff for this the last time I looked. Was trying to keep it > >>> all ruby, but will see what I can find. > >>> > >> > >> > >> Koffice (Kword to be precise) is quite good in reading PDF documents. > >> Maybe you can re-use code from there ? > >> > >> Klaus > >> _______________________________________________ > >> Rails mailing list > >> Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > >> http://lists.rubyonrails.org/mailman/listinfo/rails > >> > >> > > _______________________________________________ > > Rails mailing list > > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > > http://lists.rubyonrails.org/mailman/listinfo/rails > > > > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >
On 26 Oct 2005, at 10:04, Nick Stuart wrote:> Yep, thats excatly what I was going to do. The one question I''m facing > now is how to tell the rails that the index has changed so the readers > in ferret can update themselves.Hmmm.... good question. I''ve not had the chance to roll up my sleeves with Ferret and Rails yet myself, but you could certainly ping the Rails app using a RESTful URL that forces the update, or some sort of other messaging layer.> p.s. I just found that PDFBox has a LucenePDFDocument object already. > What could be easier then that?! :)I personally prefer to control the fields precisely, along with all the attributes of those fields, so I don''t use that convenience, but it certainly is handy if the field(s) it creates are suitable to your application. Erik> > On 10/26/05, Erik Hatcher <erik-LIifS8st6VgJvtFkdXX2HpqQE7yCjDx5@public.gmane.org> wrote: > >> Another plug.... you can find out how Lucene and PDFBox (via >> TextMining.org''s wrapper also) work together from Lucene in Action - >> again the code is freely available at http://www.lucenebook.com >> >> PDFBox is the best that exists in the Java world for extracting text >> from PDF files. My current usage of Lucene with Rails is to have a >> Java search server (XML-RPC) inquired from a Rails front-end. Now >> that Ferret is out, I''ll still index with a Java program, but use >> Ferret''s API instead of the search server overhead. >> >> Erik >> >> >> On 26 Oct 2005, at 08:36, Nick Stuart wrote: >> >> >>> Hmm, didn''t know that about KWord, might have to check it out. Right >>> now though I''ve found a Java library that seems like it might do the >>> trick if I can figure out a way to have it run periodically >>> (shouldn''t >>> be that tough). If anyone''s interesed >>> http://www.pdfbox.org >>> >>> I''m thinking I should be able to get this, Lucene, and ferret all >>> working together to get what I need. >>> >>> -Nick >>> >>> On 10/26/05, Klaus Kaempf <kkaempf-l3A5Bk7waGM@public.gmane.org> wrote: >>> >>> >>>> * Nick Stuart <nicholas.stuart-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> [Oct 25. 2005 20:52]: >>>> >>>> >>>>> Right, I''m not looking for anything fancy at this point. I''ll >>>>> already >>>>> know the file name/location/etc and all I need is the text to be >>>>> able >>>>> to search for that document. Dont care where it is in the >>>>> document, >>>>> just want to be able to find it. I''ll look around for some other >>>>> libraries in some different languages. I think Java has some open >>>>> source stuff for this the last time I looked. Was trying to >>>>> keep it >>>>> all ruby, but will see what I can find. >>>>> >>>>> >>>> >>>> >>>> Koffice (Kword to be precise) is quite good in reading PDF >>>> documents. >>>> Maybe you can re-use code from there ? >>>> >>>> Klaus >>>> _______________________________________________ >>>> Rails mailing list >>>> Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org >>>> http://lists.rubyonrails.org/mailman/listinfo/rails >>>> >>>> >>>> >>> _______________________________________________ >>> Rails mailing list >>> Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org >>> http://lists.rubyonrails.org/mailman/listinfo/rails >>> >>> >> >> _______________________________________________ >> Rails mailing list >> Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org >> http://lists.rubyonrails.org/mailman/listinfo/rails >> >> > _______________________________________________ > Rails mailing list > Rails-1W37MKcQCpIf0INCOvqR/iCwEArCW2h5@public.gmane.org > http://lists.rubyonrails.org/mailman/listinfo/rails >
On 10/26/05, Erik Hatcher <erik-LIifS8st6VgJvtFkdXX2HpqQE7yCjDx5@public.gmane.org> wrote:> > Hmmm.... good question. I''ve not had the chance to roll up my > sleeves with Ferret and Rails yet myself, but you could certainly > ping the Rails app using a RESTful URL that forces the update, or > some sort of other messaging layer. >That might work. We already have an admin section of the site, so I should be able to easily add a action to that to reset the index reader. Was thinking of having the indexer be scheduled to run and a given time, but that might be a bit overkill because we wont be adding documents every day once the site is setup, and hence wanted an automatic way of doing this.> > I personally prefer to control the fields precisely, along with all > the attributes of those fields, so I don''t use that convenience, but > it certainly is handy if the field(s) it creates are suitable to your > application. >Ya, I think the LucenePDFDocument will work fine in our case for now. I can always change it later, but right now we dont even have any PDF search capability so this will be better then what we have. :) I want to get the basics working to begin with and make it fancy later. -Nick