David Kahn
2010-Aug-05 15:50 UTC
Search large XML file -- REXML slower than a slug, regex instantaneous
Got a question hopefully someone can answer - I am working on functionality to match on certain nodes of a largish (65mb) xml file. I implemented this with REXML and was 2 minutes and counting before I killed the process. After this, I just opened the console and loaded the file into a string and did a regex search for my data -- the result was almost instantaneous. The question is, if I can get away with it, am I better off just going the regex route, or is it really worth my while to investigate a faster XML parser (I know REXML is notorious for being slow, but given how fast it was to call a regex on the file, I am thinking that this will still be faster than all parsers). Any comments or suggestions appreciated. David -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
Marnen Laibow-Koser
2010-Aug-05 15:55 UTC
Re: Search large XML file -- REXML slower than a slug, regex instantaneous
David Kahn wrote:> Got a question hopefully someone can answer - > > I am working on functionality to match on certain nodes of a largish > (65mb) > xml file. I implemented this with REXML and was 2 minutes and counting > before I killed the process. After this, I just opened the console and > loaded the file into a string and did a regex search for my data -- the > result was almost instantaneous. > > The question is, if I can get away with it, am I better off just going > the > regex route, or is it really worth my while to investigate a faster XML > parser (I know REXML is notorious for being slow,Then why the heck are you even bringing it up in this situation? I *think* Nokogiri is supposed to be much faster.> but given how fast it > was > to call a regex on the file, I am thinking that this will still be > faster > than all parsers).Who cares how fast it is if it''s inaccurate? Regular expressions are the wrong tool for parsing XML, because they can''t cope easily (or at all) with lots of valid XML constructs. If you''re parsing XML, use an actual XML parser, or you risk serious errors.> > Any comments or suggestions appreciated. > > DavidBest, -- Marnen Laibow-Koser http://www.marnen.org marnen-sbuyVjPbboAdnm+yROfE0A@public.gmane.org -- Posted via http://www.ruby-forum.com/. -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
David Kahn
2010-Aug-05 18:57 UTC
Re: Re: Search large XML file -- REXML slower than a slug, regex instantaneous
Actually I and my client care how fast, even if it means more work and tests to hedge accuracy. I did try Nokogiri - which I liked getting to know, but it also plods in at ~ 150 seconds which is just unacceptable for someone waiting at a browser. That''s what I was trying to get at with my original post and should have provided more data, i.e. am I wasting time with unrealistic expectations for any XML parser in this endeavor. Unless anyone can point out a more efficient search (code and example xml below), it seems practical in absence of other ideas, to go the way of regex at least to triangulate the data before throwing it to an xml parser to get the details or put the data into a db (which I am trying to avoid). Below, the second line is what takes forever, understandably. gsa_epls_xml_doc = Nokogiri::HTML(doc_xml) @gsa_epls_xml_doc.xpath("//records/record[last=''#{last_name}'' and first=''#{first_name}'']").each do |possible_match_record| ... File structure - <Records> with a lot (65mb) of <Record> nodes. <Records> <Record> <Prefix></Prefix> <First>Vr</First> <Middle>A</Middle> <Last>C</Last> <Suffix></Suffix> <Classification>Individual</Classification> <CTType>Reciprocal</CTType> <Addresses> <Address> <City>R</City> <ZIP>11576</ZIP> <Province/> <State>NY</State> <DUNS/> </Address> </Addresses> <References/> <Actions> <Action> <ActionDate>22-Apr-2004</ActionDate> <TermDate>Indef.</TermDate> <CTCode>Z2</CTCode> <AgencyComponent>OPM</AgencyComponent> </Action> <Action> <ActionDate>19-Feb-2004</ActionDate> <TermDate>Indef.</TermDate> <CTCode>Z1</CTCode> <AgencyComponent>HHS</AgencyComponent> </Action> </Actions> <Description/> <AgencyIdentifiers/> </Record> . . . n </Records> On Thu, Aug 5, 2010 at 11:55 AM, Marnen Laibow-Koser <lists-fsXkhYbjdPsEEoCn2XhGlw@public.gmane.org>wrote:> David Kahn wrote: > > Got a question hopefully someone can answer - > > > > I am working on functionality to match on certain nodes of a largish > > (65mb) > > xml file. I implemented this with REXML and was 2 minutes and counting > > before I killed the process. After this, I just opened the console and > > loaded the file into a string and did a regex search for my data -- the > > result was almost instantaneous. > > > > The question is, if I can get away with it, am I better off just going > > the > > regex route, or is it really worth my while to investigate a faster XML > > parser (I know REXML is notorious for being slow, > > Then why the heck are you even bringing it up in this situation? I > *think* Nokogiri is supposed to be much faster. > > > but given how fast it > > was > > to call a regex on the file, I am thinking that this will still be > > faster > > than all parsers). > > Who cares how fast it is if it''s inaccurate? Regular expressions are > the wrong tool for parsing XML, because they can''t cope easily (or at > all) with lots of valid XML constructs. If you''re parsing XML, use an > actual XML parser, or you risk serious errors. > > > > > Any comments or suggestions appreciated. > > > > David > > Best, > -- > Marnen Laibow-Koser > http://www.marnen.org > marnen-sbuyVjPbboAdnm+yROfE0A@public.gmane.org > -- > Posted via http://www.ruby-forum.com/. > > -- > You received this message because you are subscribed to the Google Groups > "Ruby on Rails: Talk" group. > To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To unsubscribe from this group, send email to > rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<rubyonrails-talk%2Bunsubscribe-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org> > . > For more options, visit this group at > http://groups.google.com/group/rubyonrails-talk?hl=en. > >-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
Marnen Laibow-Koser
2010-Aug-05 19:28 UTC
Re: Re: Search large XML file -- REXML slower than a slug, regex instantaneous
Please quote when replying. It is very hard to follow the discussion if you don''t. David Kahn wrote:> Actually I and my client care how fast, even if it means more work and > tests > to hedge accuracy.And by the time you do that extra work for correctness, you will have developed a system equivalent to REXML or Nokogiri, and likely with similar or worse performance. You''re fighting a losing battle here.> I did try Nokogiri - which I liked getting to know, > but > it also plods in at ~ 150 seconds which is just unacceptable for someone > waiting at a browser.Waiting at a browser? Let me get this straight -- your app is trying to process a 65MB file in real time? That''s insane. Do some of the processing in advance, or tell the user that he can expect a 2-minute wait (which is absolutely reasonable for that much data).> That''s what I was trying to get at with my > original > post and should have provided more data, i.e. am I wasting time with > unrealistic expectations for any XML parser in this endeavor. > > Unless anyone can point out a more efficient search (code and example > xml > below), it seems practical in absence of other ideas, to go the way of > regex > at least to triangulate the data before throwing it to an xml parser to > get > the details or put the data into a db (which I am trying to avoid).Why are you trying to avoid putting the data into a DB? Databases are designed for quick searches through lots of data -- in other words, exactly what you are doing. XML really is not. (You could try eXistDB, though.)> > Below, the second line is what takes forever, understandably. > gsa_epls_xml_doc = Nokogiri::HTML(doc_xml) > @gsa_epls_xml_doc.xpath("//records/record[last=''#{last_name}'' and > first=''#{first_name}'']").each do |possible_match_record| ...I''m assuming gsa is Google Search Appliance. Can''t it do the searching itself and give you back only the records you need? Best, -- Marnen Laibow-Koser http://www.marnen.org marnen-sbuyVjPbboAdnm+yROfE0A@public.gmane.org -- Posted via http://www.ruby-forum.com/. -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
Jeffrey L. Taylor
2010-Aug-05 20:12 UTC
Re: Search large XML file -- REXML slower than a slug, regex instantaneous
Quoting David Kahn <dk-rfEMNHKVqOwNic7Bib+Ti1W1rNmOCjRP@public.gmane.org>:> Got a question hopefully someone can answer - > > I am working on functionality to match on certain nodes of a largish (65mb) > xml file. I implemented this with REXML and was 2 minutes and counting > before I killed the process. After this, I just opened the console and > loaded the file into a string and did a regex search for my data -- the > result was almost instantaneous. > > The question is, if I can get away with it, am I better off just going the > regex route, or is it really worth my while to investigate a faster XML > parser (I know REXML is notorious for being slow, but given how fast it was > to call a regex on the file, I am thinking that this will still be faster > than all parsers). >Look at using LibXML::XML::Reader http://libxml.rubyforge.org/rdoc/index.html What most XML parsing libraries are doing is reading the entire XML file into memory, probably storing the raw text, parsing it, and creating an even bigger data structure for the whole file, then searching over it. Nokogiri at least does some of the searching in C, instead of Ruby (it uses libxml2). With LibXML::XML::Reader is possible (with some not very pretty code) to make one pass thru the XML file, parsing as you go, and create data structures for just the information of interest. Enormously faster. HTH, Jeffrey -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
Marnen Laibow-Koser
2010-Aug-05 20:41 UTC
Re: Search large XML file -- REXML slower than a slug, regex instantaneous
Jeffrey L. Taylor wrote:> Quoting David Kahn <dk-rfEMNHKVqOwNic7Bib+Ti1W1rNmOCjRP@public.gmane.org>: >> parser (I know REXML is notorious for being slow, but given how fast it was >> to call a regex on the file, I am thinking that this will still be faster >> than all parsers). >> > > Look at using LibXML::XML::Reader > > http://libxml.rubyforge.org/rdoc/index.html > > What most XML parsing libraries are doing is reading the entire XML file > into > memory, probably storing the raw text, parsing it, and creating an even > bigger > data structure for the whole file, then searching over it. Nokogiri at > least > does some of the searching in C, instead of Ruby (it uses libxml2). > > With LibXML::XML::Reader is possible (with some not very pretty code) to > make > one pass thru the XML file, parsing as you go, and create data > structures for > just the information of interest. Enormously faster.Interesting; that seems worth knowing about. But wouldn''t Reader still have to create a DOM tree to do the searching in the first place?> > HTH, > JeffreyBest, -- Marnen Laibow-Koser http://www.marnen.org marnen-sbuyVjPbboAdnm+yROfE0A@public.gmane.org -- Posted via http://www.ruby-forum.com/. -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
Frederick Cheung
2010-Aug-05 20:58 UTC
Re: Search large XML file -- REXML slower than a slug, regex instantaneous
On Aug 5, 9:41 pm, Marnen Laibow-Koser <li...-fsXkhYbjdPsEEoCn2XhGlw@public.gmane.org> wrote:> > Interesting; that seems worth knowing about. But wouldn''t Reader still > have to create a DOM tree to do the searching in the first place?Not necessarily - that''s essentially the difference between a SAX type parse and a document based one. Fred> > > > > HTH, > > Jeffrey > > Best, > -- > Marnen Laibow-Koserhttp://www.marnen.org > mar...-sbuyVjPbboAdnm+yROfE0A@public.gmane.org > -- > Posted viahttp://www.ruby-forum.com/.-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.
Marnen Laibow-Koser
2010-Aug-05 21:33 UTC
Re: Search large XML file -- REXML slower than a slug, regex instantaneous
Frederick Cheung wrote:> On Aug 5, 9:41�pm, Marnen Laibow-Koser <li...-fsXkhYbjdPsEEoCn2XhGlw@public.gmane.org> wrote: >> >> Interesting; that seems worth knowing about. �But wouldn''t Reader still >> have to create a DOM tree to do the searching in the first place? > > Not necessarily - that''s essentially the difference between a SAX type > parse and a document based one. >:P I used to know that, back when I actually worked regularly with XML. Thanks for the reminder.> FredBest, -- Marnen Laibow-Koser http://www.marnen.org marnen-sbuyVjPbboAdnm+yROfE0A@public.gmane.org Sent from my iPhone -- Posted via http://www.ruby-forum.com/. -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To unsubscribe from this group, send email to rubyonrails-talk+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en.