I am using ferret right now, and it works great for all my regular text documents/information. My problem arises when I want to index/search all of our assets (mostly pdf files). Currently, there is no way to READ pdfs from Ruby. Because of this I have to resort to using Java to read the PDF''s and then Lucene to index them. My problem here is a couple things. One, to index a asset I have to either fire up a complete new JVM for each asset, or have to the index rebuilt each night at a set time. Each way has their own advantages/downfalls, but the biggest is that Ferret doesn''t like to talk to Lucene created indexes :( doh! So, on to number two. So now I can go at this from a couple angles. I could create a Java webservice to do the indexing and the searching and then return the results. Or I could simply write a small utility program (with groovy perhaps?) that uses Java just to get the content of the pdf files and use ferret for everything. Or some combination of one or the other or something completly different. I''m interested in what you folks out there have to say about this. I would really really like to avoid creating a whole web service just for searching, but if thats the most viable way then I may go that route. -Nick "searching for a clue" S -------------- next part -------------- An HTML attachment was scrubbed... URL: http://wrath.rubyonrails.org/pipermail/rails/attachments/20060207/d4608f7a/attachment.html
On Tuesday, February 07, 2006, at 3:27 PM, Nick Stuart wrote:>asset, or have to the index rebuilt each night at a set time. Each way has >their own advantages/downfalls, but the biggest is that Ferret doesn''t like >to talk to Lucene created indexes :( doh!I could swear that I had read Ferret would use Lucene''s indices and vice versa...I just can''t dig up the link now. It''s a true port, right? Out of curiosity, what are you using to read the pdfs in Java? Thanks! John -- Posted with http://DevLists.com. Sign up and save your time!
Also, is this a Linux or unix box? Have you considered running the pdfs through pdf2txt and indexing the results? -- Posted with http://DevLists.com. Sign up and save your time!
It does read Lucene indexes to a point, and in fact worked for my initial testing. However, once I started to really put content into the indexes it ran into a problem with some encoding issues. David knows about this and stated it was a non-trivial issue with how Lucene produces it''s indexes, or how ferret reads them (can''t remember exactly). For reading the PDF''s I''m using PDFBox and its worked great so far. -Nick On 7 Feb 2006 20:38:29 -0000, John Wells <devlists-rubyonrails@devlists.com> wrote:> > On Tuesday, February 07, 2006, at 3:27 PM, Nick Stuart wrote: > >asset, or have to the index rebuilt each night at a set time. Each way > has > >their own advantages/downfalls, but the biggest is that Ferret doesn''t > like > >to talk to Lucene created indexes :( doh! > > I could swear that I had read Ferret would use Lucene''s indices and vice > versa...I just can''t dig up the link now. It''s a true port, right? > > Out of curiosity, what are you using to read the pdfs in Java? > > Thanks! > John > > -- > Posted with http://DevLists.com. Sign up and save your time! > _______________________________________________ > Rails mailing list > Rails@lists.rubyonrails.org > http://lists.rubyonrails.org/mailman/listinfo/rails >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://wrath.rubyonrails.org/pipermail/rails/attachments/20060207/c105045d/attachment.html
This is unfortunatly deployed on a windows server. I haven''t looked to see if there is a tool like pdf2text for windows, but would like to keep this as platform agnostic as possible. -Nick On 7 Feb 2006 20:40:51 -0000, John Wells <devlists-rubyonrails@devlists.com> wrote:> Also, is this a Linux or unix box? Have you considered running the pdfs > through pdf2txt and indexing the results? > > > -- > Posted with http://DevLists.com. Sign up and save your time! > _______________________________________________ > Rails mailing list > Rails@lists.rubyonrails.org > http://lists.rubyonrails.org/mailman/listinfo/rails >
Another option is to create a Java-based index (and even search) server. There is one in incubation at Apache (http:// incubator.apache.org/projects/solr.html), which is distilled from the system that drives CNET''s amazing faceted search system. This option would allow you to use PDFBox (or the wrapper TextMining) to deal with PDF''s quite reliably. I currently use a simple custom Java XML-RPC search server with my Rails front-end, and it works quite well and more than acceptably fast. I index with a Java process as well. And yeah, unfortunately there is a mismatch with Ferret and Lucene indexes when certain characters get used. This has been discussed in great detail on the java-dev@lucene e-mail list: http:// marc.theaimsgroup.com/?l=lucene-user&m=112529400725721&w=2 - it is an unfortunate situation that seems to result in either making Java Lucene somewhat slower to be more standard or make the ports have to deal with a Java oddity in UTF. The low-level specifics have thus far gone over my head, but have been discussed by some very sharp folks at that thread and others. PyLucene does not suffer this issue because it is truly Java Lucene underneath (via GCJ and SWIG). Erik On Feb 7, 2006, at 3:27 PM, Nick Stuart wrote:> I am using ferret right now, and it works great for all my regular > text documents/information. My problem arises when I want to index/ > search all of our assets (mostly pdf files). Currently, there is no > way to READ pdfs from Ruby. Because of this I have to resort to > using Java to read the PDF''s and then Lucene to index them. My > problem here is a couple things. > One, to index a asset I have to either fire up a complete new JVM > for each asset, or have to the index rebuilt each night at a set > time. Each way has their own advantages/downfalls, but the biggest > is that Ferret doesn''t like to talk to Lucene created indexes : > ( doh! > So, on to number two. So now I can go at this from a couple angles. > I could create a Java webservice to do the indexing and the > searching and then return the results. Or I could simply write a > small utility program (with groovy perhaps?) that uses Java just to > get the content of the pdf files and use ferret for everything. Or > some combination of one or the other or something completly different. > > I''m interested in what you folks out there have to say about this. > I would really really like to avoid creating a whole web service > just for searching, but if thats the most viable way then I may go > that route. > > -Nick "searching for a clue" S > > _______________________________________________ > Rails mailing list > Rails@lists.rubyonrails.org > http://lists.rubyonrails.org/mailman/listinfo/rails
On 2/7/06, Erik Hatcher <erik@ehatchersolutions.com> wrote:> > Another option is to create a Java-based index (and even search) > server. There is one in incubation at Apache (http:// > incubator.apache.org/projects/solr.html), which is distilled from the > system that drives CNET''s amazing faceted search system. This option > would allow you to use PDFBox (or the wrapper TextMining) to deal > with PDF''s quite reliably.Well have to take a look into that, hadn''t noticed it before I currently use a simple custom Java XML-RPC search server with my> Rails front-end, and it works quite well and more than acceptably > fast. I index with a Java process as well.Ya, I think is going to have to be the route I take. I have a couple of questions though if you dont mind sharing. What rpc server do you use? Am currently thinking about using axis, but am not sure what else is out there. And for building the indexes, do you have a task that just runs on a scheduled time to rebuild the indexes? (do you build them from scratch or add onto them?) And yeah, unfortunately there is a mismatch with Ferret and Lucene> indexes when certain characters get used. This has been discussed in > great detail on the java-dev@lucene e-mail list: http:// > marc.theaimsgroup.com/?l=lucene-user&m=112529400725721&w=2 - it is an > unfortunate situation that seems to result in either making Java > Lucene somewhat slower to be more standard or make the ports have to > deal with a Java oddity in UTF. The low-level specifics have thus > far gone over my head, but have been discussed by some very sharp > folks at that thread and others. PyLucene does not suffer this issue > because it is truly Java Lucene underneath (via GCJ and SWIG).Ya, that was what I found. The detail of the problem is not really something I''m interested in, or can even currently understand. Its unfortunate too, because everything about ferret has been working great. The only reason I need Java at all at this point is to get stuff out of PDF''s. Erik Thanks for the info Erik, and all, looks like I have some research to do here. -Nick On Feb 7, 2006, at 3:27 PM, Nick Stuart wrote:> > > I am using ferret right now, and it works great for all my regular > > text documents/information. My problem arises when I want to index/ > > search all of our assets (mostly pdf files). Currently, there is no > > way to READ pdfs from Ruby. Because of this I have to resort to > > using Java to read the PDF''s and then Lucene to index them. My > > problem here is a couple things. > > One, to index a asset I have to either fire up a complete new JVM > > for each asset, or have to the index rebuilt each night at a set > > time. Each way has their own advantages/downfalls, but the biggest > > is that Ferret doesn''t like to talk to Lucene created indexes : > > ( doh! > > So, on to number two. So now I can go at this from a couple angles. > > I could create a Java webservice to do the indexing and the > > searching and then return the results. Or I could simply write a > > small utility program (with groovy perhaps?) that uses Java just to > > get the content of the pdf files and use ferret for everything. Or > > some combination of one or the other or something completly different. > > > > I''m interested in what you folks out there have to say about this. > > I would really really like to avoid creating a whole web service > > just for searching, but if thats the most viable way then I may go > > that route. > > > > -Nick "searching for a clue" S > > > > _______________________________________________ > > Rails mailing list > > Rails@lists.rubyonrails.org > > http://lists.rubyonrails.org/mailman/listinfo/rails > > _______________________________________________ > Rails mailing list > Rails@lists.rubyonrails.org > http://lists.rubyonrails.org/mailman/listinfo/rails >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://wrath.rubyonrails.org/pipermail/rails/attachments/20060208/e3f4e9e6/attachment.html
On Feb 7, 2006, at 8:09 PM, Nick Stuart wrote:> On 2/7/06, Erik Hatcher <erik@ehatchersolutions.com> wrote: > Another option is to create a Java-based index (and even search) > server. There is one in incubation at Apache (http:// > incubator.apache.org/projects/solr.html ), which is distilled from the > system that drives CNET''s amazing faceted search system. This option > would allow you to use PDFBox (or the wrapper TextMining) to deal > with PDF''s quite reliably. > > Well have to take a look into that, hadn''t noticed it before > > I currently use a simple custom Java XML-RPC search server with my > Rails front-end, and it works quite well and more than acceptably > fast. I index with a Java process as well. > > Ya, I think is going to have to be the route I take. I have a > couple of questions though if you dont mind sharing. > What rpc server do you use? Am currently thinking about using axis, > but am not sure what else is out there.I use Apache''s XML-RPC implementation: http://ws.apache.org/xmlrpc/ My code is essentially this: public class SearchServer { private IndexReader reader; private IndexSearcher searcher; public SearchServer(Directory directory) throws IOException { reader = IndexReader.open(directory); searcher = new IndexSearcher(reader); } public Hashtable search(Vector constraints, int start, int max) throws IOException, ParseException { // ... } public static void main(String [] args) { String indexPath = args[0]; int port = 8076; if (args.length > 1) { port = Integer.valueOf(args[1]).intValue(); } try { WebServer server = new WebServer(port); server.addHandler("$default", new SearchServer (FSDirectory.getDirectory(indexPath, false))); server.start(); } catch (Exception e) { System.err.println("SearchServer: " + e.toString()); e.printStackTrace(System.err); } } }> And for building the indexes, do you have a task that just runs on > a scheduled time to rebuild the indexes? (do you build them from > scratch or add onto them?)My project deals with static data. I build the index from scratch once and that is it, so there is no updating of it on the fly. When the data changes, the entire index is rebuilt.> > Ya, that was what I found. The detail of the problem is not really > something I''m interested in, or can even currently understand. Its > unfortunate too, because everything about ferret has been working > great. The only reason I need Java at all at this point is to get > stuff out of PDF''s.With your use of Ferret, I''m curious what size of data, such as how many documents, you''re dealing with? What does your index size end up being? Thanks, Erik
http://arton.no-ip.info/collabo/backyard/?RubyJavaBridge Works like a charm on modern linuxes. On windows you have to (or at least i had to) fix (undo) the automatic iconv that''s going on. Just make the check for whether or not iconv has been installed always return false. -- Posted via http://www.ruby-forum.com/.
On Tuesday, February 07, 2006, at 4:48 PM, Erik Hatcher wrote:> PyLucene does not suffer this >issue because it is truly Java Lucene underneath (via GCJ and SWIG).Just out of curiosity, is there anything technically preventing Ruby from going the same route (i.e., binding to a GCJ-compiled verison through SWIG)? -- Posted with http://DevLists.com. Sign up and save your time!
On 2/8/06, Erik Hatcher <erik@ehatchersolutions.com> wrote:> > > I currently use a simple custom Java XML-RPC search server with my > > Rails front-end, and it works quite well and more than acceptably > > fast. I index with a Java process as well. > >That was the kind of thing I''m looking for. I knew tomcat or any other container was bit overkill for this.> I use Apache''s XML-RPC implementation: http://ws.apache.org/xmlrpc/ > > My code is essentially this: > > public class SearchServer { > ... > } >Looks to be about the amount of code I want to write for this. I already have the indexing part built so hopefully I''ll just be able to add a few more rpc calls to add/delete/modify the indexes as needed.> > My project deals with static data. I build the index from scratch > once and that is it, so there is no updating of it on the fly. When > the data changes, the entire index is rebuilt. > > > > > Ya, that was what I found. The detail of the problem is not really > > something I''m interested in, or can even currently understand. Its > > unfortunate too, because everything about ferret has been working > > great. The only reason I need Java at all at this point is to get > > stuff out of PDF''s. > > With your use of Ferret, I''m curious what size of data, such as how > many documents, you''re dealing with? What does your index size end > up being? >Currently I''m only indexing our page content and with this I''m not storing anything in the index besides an id field so I know the page that the stuff is linked to. With that, at about 100 pages the index size is only 252kb, pretty damn small. With that though, I know we have a couple hundred PDFs that are going to be indexed as well, and again, wont actually store anything besides a id field so I can load it up when needed. I expect though that this index wont be that sizable either. Thanks for the info Erik!
I''ll have to look into that Colin, thanks for pointing it out! On 2/8/06, Colin <Colin@no.no> wrote:> http://arton.no-ip.info/collabo/backyard/?RubyJavaBridge > > Works like a charm on modern linuxes. On windows you have to (or at > least i had to) fix (undo) the automatic iconv that''s going on. Just > make the check for whether or not iconv has been installed always return > false. > > -- > Posted via http://www.ruby-forum.com/. > _______________________________________________ > Rails mailing list > Rails@lists.rubyonrails.org > http://lists.rubyonrails.org/mailman/listinfo/rails >
On Feb 8, 2006, at 12:53 PM, Nick Stuart wrote:> On 2/8/06, Erik Hatcher <erik@ehatchersolutions.com> wrote: >> >>> I currently use a simple custom Java XML-RPC search server with my >>> Rails front-end, and it works quite well and more than acceptably >>> fast. I index with a Java process as well. >>> > > That was the kind of thing I''m looking for. I knew tomcat or any other > container was bit overkill for this. > >> I use Apache''s XML-RPC implementation: http://ws.apache.org/xmlrpc/ >> >> My code is essentially this: >> >> public class SearchServer { >> ... >> } >> > > Looks to be about the amount of code I want to write for this. I > already have the indexing part built so hopefully I''ll just be able > to add a few more rpc calls to add/delete/modify the indexes as > needed.It''s one of those things thats pragmatically simple enough to solve the problem at hand. Why of course I''d love to stay in Ruby all the time, but there is absolutely nothing wrong with using the best of all the computing tools at our disposal, even if it is that evil four letter word Java :) I''m as simple as they come. I gravitate away from complexity. And Apache''s XML-RPC was the simplest and most performant way I could glue the oh so incredibly powerful Lucene into a Rails front-end. Nothing wrong with being multi-lingual, I say. ????> Currently I''m only indexing our page content and with this I''m not > storing anything in the index besides an id field so I know the page > that the stuff is linked to. With that, at about 100 pages the index > size is only 252kb, pretty damn small. > > With that though, I know we have a couple hundred PDFs that are going > to be indexed as well, and again, wont actually store anything besides > a id field so I can load it up when needed. I expect though that this > index wont be that sizable either.Ok, so we''re still talking < 100 documents in the Ferret index total, which is quite suitable, I think for the pure Ruby Ferret to handle. I''m curious how it would fair with my 30k document set - ashamedly I''ve yet to try it actually. I sorta put Ferret on the backburner due to time constraints and needing to move my project forward using the Java-based indexing code I already had, and hit the wall when my Java built index did not jive with Ferret sadly. No offense to Dave whatsoever, in fact I''m still in awe of the cleanliness and internal elegance to Ferret and how it matches Java code-wise. A work of art truly. More on this thread in a sec.... Erik
On Feb 8, 2006, at 10:47 AM, John Wells wrote:> > On Tuesday, February 07, 2006, at 4:48 PM, Erik Hatcher wrote: >> PyLucene does not suffer this >> issue because it is truly Java Lucene underneath (via GCJ and SWIG). > > Just out of curiosity, is there anything technically preventing Ruby > from going the same route (i.e., binding to a GCJ-compiled verison > through SWIG)?Nothing whatsoever and it''s a project I keep dreaming of but never making the time to do it myself. I have, however, been tinkering with PyLucene''s build process recently in a half-hearted attempt to do this myself. I would definitely open source it under the ASL and host it at lucene.apache.org. PyLucene has expressed interest in migrating to be a sibling of Java Lucene also. The primary reason I think the GCJ/SWIG approach is the right way to the most amazing search engine available, even considering commercial products, is that Doug Cutting and the other information retrieval experts are spending their time improving the Java codebase constantly. I personally want the latest greatest text search engine features to be available exactly the same from both Java and Ruby, with of course the added semantic (syntactic sugar you may call it) goodies that Ruby offers. Dave - go full steam with Ferret! It rocks! I am as much a cheerleader of your efforts as anyone possibly could be. Erik
On 2/8/06, Erik Hatcher <erik@ehatchersolutions.com> wrote:> > > On Feb 8, 2006, at 12:53 PM, Nick Stuart wrote: > > On 2/8/06, Erik Hatcher <erik@ehatchersolutions.com> wrote: > >> > >>> I currently use a simple custom Java XML-RPC search server with my > >>> Rails front-end, and it works quite well and more than acceptably > >>> fast. I index with a Java process as well. > >>> > > > > That was the kind of thing I''m looking for. I knew tomcat or any other > > container was bit overkill for this. > > > >> I use Apache''s XML-RPC implementation: http://ws.apache.org/xmlrpc/ > >> > >> My code is essentially this: > >> > >> public class SearchServer { > >> ... > >> } > >> > > > > Looks to be about the amount of code I want to write for this. I > > already have the indexing part built so hopefully I''ll just be able > > to add a few more rpc calls to add/delete/modify the indexes as > > needed. > > It''s one of those things thats pragmatically simple enough to solve > the problem at hand. Why of course I''d love to stay in Ruby all the > time, but there is absolutely nothing wrong with using the best of > all the computing tools at our disposal, even if it is that evil four > letter word Java :) > > I''m as simple as they come. I gravitate away from complexity. And > Apache''s XML-RPC was the simplest and most performant way I could > glue the oh so incredibly powerful Lucene into a Rails front-end. > Nothing wrong with being multi-lingual, I say. ????No, nothing wrong with it at all. I still work on another internal project that is Java/Swing based so I get to deal with the four letter words on a regular basis. What I did want to avoid was having to setup a whole container app just to run my searches. That, and everything that goes with it, seemed to be a bit to much heavy lifting for this.> Currently I''m only indexing our page content and with this I''m not > > storing anything in the index besides an id field so I know the page > > that the stuff is linked to. With that, at about 100 pages the index > > size is only 252kb, pretty damn small. > > > > With that though, I know we have a couple hundred PDFs that are going > > to be indexed as well, and again, wont actually store anything besides > > a id field so I can load it up when needed. I expect though that this > > index wont be that sizable either. > > Ok, so we''re still talking < 100 documents in the Ferret index total, > which is quite suitable, I think for the pure Ruby Ferret to handle. > I''m curious how it would fair with my 30k document set - ashamedly > I''ve yet to try it actually. I sorta put Ferret on the backburner > due to time constraints and needing to move my project forward using > the Java-based indexing code I already had, and hit the wall when my > Java built index did not jive with Ferret sadly. No offense to Dave > whatsoever, in fact I''m still in awe of the cleanliness and internal > elegance to Ferret and how it matches Java code-wise. A work of art > truly.I agree with the comments on Ferret. And as I said before, if I had a reliable way to get the text out of these silly PDF files easily and in a pure ruby fashion, then I would most certainly stay with it. I was never concerned with any of the percieved slowness because of my relativly low doc count, and Ferret as been great to get me started so far! More on this thread in a sec....> > Erik > > _______________________________________________ > Rails mailing list > Rails@lists.rubyonrails.org > http://lists.rubyonrails.org/mailman/listinfo/rails >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://wrath.rubyonrails.org/pipermail/rails/attachments/20060209/79143433/attachment.html
Hey Erik, just wanted to say thanks! Got the search stuff up and running following your suggestions. It works like a champ, and it is pretty quick too. The longest part being re-indexing the assets themselves. The Ruby XMLRPC client actually times out on that, so I have to change that still, but the indexing and search work great. For anyone interested I have two postings on my site with a little more detail on what I went through. http://blog.nicholasstuart.com/ -Nick