Monserrat Foster
2013-Oct-10 20:36 UTC
What''s the best way to approach reading and parse large XLSX files?
Hello, I''m developing an app that basically, receives a 10MB or less XLSX files with +30000 rows or so, and another XLSX file with about 200rows, I have to read one row of the smallest file, look it up on the largest file and write data from both files to a new one. I just did a test reading a few rows from the largest file using ROO (Spreadsheet doesn''t support XSLX and Creek look good but I can''t find a way to read row by row) and it basically made my computer crash, the server crashed, I tried rebooting it and it said It was already started, anyway, it was a disaster. So, my question was, is there gem that works best with large XLSX files or is there another way to approach this withouth crashing my computer? This is what I had (It''s very possible I''m doing it wrong, help is welcome) *What i was trying to do here, was to process the files and create the new XLS file after both of the XLSX files were uploaded:* require ''roo'' require ''spreadsheet'' require ''creek'' class UploadFiles < ActiveRecord::Base after_commit :process_files attr_accessible :inventory, :material_list has_one :inventory has_one :material_list has_attached_file :inventory, :url=>"/:current_user/inventory", :path=>":rails_root/tmp/users/uploaded_files/inventory/inventory.:extension" has_attached_file :material_list, :url=>"/:current_user/material_list", :path=>":rails_root/tmp/users/uploaded_files/material_list/material_list.:extension" validates_attachment_presence :material_list accepts_nested_attributes_for :material_list, :allow_destroy => true accepts_nested_attributes_for :inventory, :allow_destroy => true validates_attachment_content_type :inventory, :content_type => ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], :message => "Only .XSLX files are accepted as Inventory" validates_attachment_content_type :material_list, :content_type => ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], :message => "Only .XSLX files are accepted as Material List" def process_files inventory = Creek::Book.new(Rails.root.to_s + "/tmp/users/uploaded_files/inventory/inventory.xlsx") material_list = Creek::Book.new(Rails.root.to_s + "/tmp/users/uploaded_files/material_list/material_list.xlsx") inventory = inventory.sheets[0] scl = Spreadsheet::Workbook.new sheet1 = scl.create_worksheet inventory.rows.each do |row| row.inspect sheet1.row(1).push(row) end sheet1.name = "Site Configuration List" scl.write(Rails.root.to_s + "/tmp/users/generated/siteconfigurationlist.xls") end end -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/bc470d4d-19c4-4969-8ba7-4ead7a35d40c%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Walter Lee Davis
2013-Oct-10 20:42 UTC
Re: What''s the best way to approach reading and parse large XLSX files?
On Oct 10, 2013, at 4:36 PM, Monserrat Foster wrote:> Hello, I''m developing an app that basically, receives a 10MB or less XLSX files with +30000 rows or so, and another XLSX file with about 200rows, I have to read one row of the smallest file, look it up on the largest file and write data from both files to a new one.Wow. Do you have to do all this in a single request? You may want to look at Nokogiri and its SAX parser. SAX parsers don''t care about the size of the document they operate on, because they work one node at a time, and don''t load the whole thing into memory at once. There are some limitations on what kind of work a SAX parser can perform, because it isn''t able to see the entire document and "know" where it is within the document at any point. But for certain kinds of problems, it can be the only way to go. Sounds like you may need something like this. Walter> > I just did a test reading a few rows from the largest file using ROO (Spreadsheet doesn''t support XSLX and Creek look good but I can''t find a way to read row by row) > and it basically made my computer crash, the server crashed, I tried rebooting it and it said It was already started, anyway, it was a disaster. > > So, my question was, is there gem that works best with large XLSX files or is there another way to approach this withouth crashing my computer? > > This is what I had (It''s very possible I''m doing it wrong, help is welcome) > What i was trying to do here, was to process the files and create the new XLS file after both of the XLSX files were uploaded: > > > require ''roo'' > require ''spreadsheet'' > require ''creek'' > class UploadFiles < ActiveRecord::Base > after_commit :process_files > attr_accessible :inventory, :material_list > has_one :inventory > has_one :material_list > has_attached_file :inventory, :url=>"/:current_user/inventory", :path=>":rails_root/tmp/users/uploaded_files/inventory/inventory.:extension" > has_attached_file :material_list, :url=>"/:current_user/material_list", :path=>":rails_root/tmp/users/uploaded_files/material_list/material_list.:extension" > validates_attachment_presence :material_list > accepts_nested_attributes_for :material_list, :allow_destroy => true > accepts_nested_attributes_for :inventory, :allow_destroy => true > validates_attachment_content_type :inventory, :content_type => ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], :message => "Only .XSLX files are accepted as Inventory" > validates_attachment_content_type :material_list, :content_type => ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], :message => "Only .XSLX files are accepted as Material List" > > > def process_files > inventory = Creek::Book.new(Rails.root.to_s + "/tmp/users/uploaded_files/inventory/inventory.xlsx") > material_list = Creek::Book.new(Rails.root.to_s + "/tmp/users/uploaded_files/material_list/material_list.xlsx") > inventory = inventory.sheets[0] > scl = Spreadsheet::Workbook.new > sheet1 = scl.create_worksheet > inventory.rows.each do |row| > row.inspect > sheet1.row(1).push(row) > end > > sheet1.name = "Site Configuration List" > scl.write(Rails.root.to_s + "/tmp/users/generated/siteconfigurationlist.xls") > end > end > > > -- > You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. > To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/bc470d4d-19c4-4969-8ba7-4ead7a35d40c%40googlegroups.com. > For more options, visit https://groups.google.com/groups/opt_out.-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/760C6DBE-BE3F-4F1C-B7E9-431BA5DF52C3%40wdstudio.com. For more options, visit https://groups.google.com/groups/opt_out.
Monserrat Foster
2013-Oct-10 20:50 UTC
Re: What''s the best way to approach reading and parse large XLSX files?
A coworker suggested I should use just basic OOP for this, to create a class that reads files, and then another to load the files into memory. Could please point me in the right direction for this (where can I read about it)? I have no idea what''s he talking about, as I''ve never done this before. I''ll look up nokogiri and SAX On Thursday, October 10, 2013 4:12:33 PM UTC-4:30, Walter Lee Davis wrote:> > On Oct 10, 2013, at 4:36 PM, Monserrat Foster wrote: > > > Hello, I''m developing an app that basically, receives a 10MB or less > XLSX files with +30000 rows or so, and another XLSX file with about > 200rows, I have to read one row of the smallest file, look it up on the > largest file and write data from both files to a new one. > > Wow. Do you have to do all this in a single request? > > You may want to look at Nokogiri and its SAX parser. SAX parsers don''t > care about the size of the document they operate on, because they work one > node at a time, and don''t load the whole thing into memory at once. There > are some limitations on what kind of work a SAX parser can perform, because > it isn''t able to see the entire document and "know" where it is within the > document at any point. But for certain kinds of problems, it can be the > only way to go. Sounds like you may need something like this. > > Walter > > > > > I just did a test reading a few rows from the largest file using ROO > (Spreadsheet doesn''t support XSLX and Creek look good but I can''t find a > way to read row by row) > > and it basically made my computer crash, the server crashed, I tried > rebooting it and it said It was already started, anyway, it was a disaster. > > > > So, my question was, is there gem that works best with large XLSX files > or is there another way to approach this withouth crashing my computer? > > > > This is what I had (It''s very possible I''m doing it wrong, help is > welcome) > > What i was trying to do here, was to process the files and create the > new XLS file after both of the XLSX files were uploaded: > > > > > > require ''roo'' > > require ''spreadsheet'' > > require ''creek'' > > class UploadFiles < ActiveRecord::Base > > after_commit :process_files > > attr_accessible :inventory, :material_list > > has_one :inventory > > has_one :material_list > > has_attached_file :inventory, :url=>"/:current_user/inventory", > :path=>":rails_root/tmp/users/uploaded_files/inventory/inventory.:extension" > > > has_attached_file :material_list, > :url=>"/:current_user/material_list", > :path=>":rails_root/tmp/users/uploaded_files/material_list/material_list.:extension" > > > validates_attachment_presence :material_list > > accepts_nested_attributes_for :material_list, :allow_destroy => true > > accepts_nested_attributes_for :inventory, :allow_destroy => true > > validates_attachment_content_type :inventory, :content_type => > ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], > :message => "Only .XSLX files are accepted as Inventory" > > validates_attachment_content_type :material_list, :content_type => > ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], > :message => "Only .XSLX files are accepted as Material List" > > > > > > def process_files > > inventory = Creek::Book.new(Rails.root.to_s + > "/tmp/users/uploaded_files/inventory/inventory.xlsx") > > material_list = Creek::Book.new(Rails.root.to_s + > "/tmp/users/uploaded_files/material_list/material_list.xlsx") > > inventory = inventory.sheets[0] > > scl = Spreadsheet::Workbook.new > > sheet1 = scl.create_worksheet > > inventory.rows.each do |row| > > row.inspect > > sheet1.row(1).push(row) > > end > > > > sheet1.name = "Site Configuration List" > > scl.write(Rails.root.to_s + > "/tmp/users/generated/siteconfigurationlist.xls") > > end > > end > > > > > > -- > > You received this message because you are subscribed to the Google > Groups "Ruby on Rails: Talk" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to rubyonrails-ta...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <javascript:>. > > To post to this group, send email to rubyonra...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<javascript:>. > > > To view this discussion on the web visit > https://groups.google.com/d/msgid/rubyonrails-talk/bc470d4d-19c4-4969-8ba7-4ead7a35d40c%40googlegroups.com. > > > For more options, visit https://groups.google.com/groups/opt_out. > >-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/0325dc87-0649-45fc-9d55-0fbcd8bed0a0%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Martin Streicher
2013-Oct-11 11:01 UTC
Re: What''s the best way to approach reading and parse large XLSX files?
> > I highly recommend the RubyXL gem. It opens xlsx files and seems very > reliable. I use it all the time.-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/60f3ea2d-8246-447d-bc96-dc7e974beae3%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Walter Lee Davis
2013-Oct-11 13:14 UTC
Re: What''s the best way to approach reading and parse large XLSX files?
On Oct 10, 2013, at 4:50 PM, Monserrat Foster wrote:> A coworker suggested I should use just basic OOP for this, to create a class that reads files, and then another to load the files into memory. Could please point me in the right direction for this (where can I read about it)? I have no idea what''s he talking about, as I''ve never done this before.How many of these files are you planning to parse at any one time? Do you have the memory on your server to deal with this load? I can see this approach working, but getting slow and process-bound very quickly. Lots of edge cases to deal with when parsing big uploaded files. Walter> > I''ll look up nokogiri and SAX > > On Thursday, October 10, 2013 4:12:33 PM UTC-4:30, Walter Lee Davis wrote: > On Oct 10, 2013, at 4:36 PM, Monserrat Foster wrote: > > > Hello, I''m developing an app that basically, receives a 10MB or less XLSX files with +30000 rows or so, and another XLSX file with about 200rows, I have to read one row of the smallest file, look it up on the largest file and write data from both files to a new one. > > Wow. Do you have to do all this in a single request? > > You may want to look at Nokogiri and its SAX parser. SAX parsers don''t care about the size of the document they operate on, because they work one node at a time, and don''t load the whole thing into memory at once. There are some limitations on what kind of work a SAX parser can perform, because it isn''t able to see the entire document and "know" where it is within the document at any point. But for certain kinds of problems, it can be the only way to go. Sounds like you may need something like this. > > Walter > > > > > I just did a test reading a few rows from the largest file using ROO (Spreadsheet doesn''t support XSLX and Creek look good but I can''t find a way to read row by row) > > and it basically made my computer crash, the server crashed, I tried rebooting it and it said It was already started, anyway, it was a disaster. > > > > So, my question was, is there gem that works best with large XLSX files or is there another way to approach this withouth crashing my computer? > > > > This is what I had (It''s very possible I''m doing it wrong, help is welcome) > > What i was trying to do here, was to process the files and create the new XLS file after both of the XLSX files were uploaded: > > > > > > require ''roo'' > > require ''spreadsheet'' > > require ''creek'' > > class UploadFiles < ActiveRecord::Base > > after_commit :process_files > > attr_accessible :inventory, :material_list > > has_one :inventory > > has_one :material_list > > has_attached_file :inventory, :url=>"/:current_user/inventory", :path=>":rails_root/tmp/users/uploaded_files/inventory/inventory.:extension" > > has_attached_file :material_list, :url=>"/:current_user/material_list", :path=>":rails_root/tmp/users/uploaded_files/material_list/material_list.:extension" > > validates_attachment_presence :material_list > > accepts_nested_attributes_for :material_list, :allow_destroy => true > > accepts_nested_attributes_for :inventory, :allow_destroy => true > > validates_attachment_content_type :inventory, :content_type => ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], :message => "Only .XSLX files are accepted as Inventory" > > validates_attachment_content_type :material_list, :content_type => ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], :message => "Only .XSLX files are accepted as Material List" > > > > > > def process_files > > inventory = Creek::Book.new(Rails.root.to_s + "/tmp/users/uploaded_files/inventory/inventory.xlsx") > > material_list = Creek::Book.new(Rails.root.to_s + "/tmp/users/uploaded_files/material_list/material_list.xlsx") > > inventory = inventory.sheets[0] > > scl = Spreadsheet::Workbook.new > > sheet1 = scl.create_worksheet > > inventory.rows.each do |row| > > row.inspect > > sheet1.row(1).push(row) > > end > > > > sheet1.name = "Site Configuration List" > > scl.write(Rails.root.to_s + "/tmp/users/generated/siteconfigurationlist.xls") > > end > > end > > > > > > -- > > You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. > > To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-ta...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > To post to this group, send email to rubyonra...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/bc470d4d-19c4-4969-8ba7-4ead7a35d40c%40googlegroups.com. > > For more options, visit https://groups.google.com/groups/opt_out. > > > -- > You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. > To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/0325dc87-0649-45fc-9d55-0fbcd8bed0a0%40googlegroups.com. > For more options, visit https://groups.google.com/groups/opt_out.-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/8D1E231B-04BB-4721-B405-27F310874D91%40wdstudio.com. For more options, visit https://groups.google.com/groups/opt_out.
Monserrat Foster
2013-Oct-11 15:30 UTC
Re: What''s the best way to approach reading and parse large XLSX files?
One 30000+ row file and another with just over 200. How much memory should I need for this not to take forever parsing? (I''m currently using my computer as server and I can see ruby taking about 1GB in the task manager when processing this (and it takes forever). The 30000+ row file is about 7MB, which is not that much (I think) On Friday, October 11, 2013 8:44:22 AM UTC-4:30, Walter Lee Davis wrote:> > > On Oct 10, 2013, at 4:50 PM, Monserrat Foster wrote: > > > A coworker suggested I should use just basic OOP for this, to create a > class that reads files, and then another to load the files into memory. > Could please point me in the right direction for this (where can I read > about it)? I have no idea what''s he talking about, as I''ve never done this > before. > > How many of these files are you planning to parse at any one time? Do you > have the memory on your server to deal with this load? I can see this > approach working, but getting slow and process-bound very quickly. Lots of > edge cases to deal with when parsing big uploaded files. > > Walter > > > > > I''ll look up nokogiri and SAX > > > > On Thursday, October 10, 2013 4:12:33 PM UTC-4:30, Walter Lee Davis > wrote: > > On Oct 10, 2013, at 4:36 PM, Monserrat Foster wrote: > > > > > Hello, I''m developing an app that basically, receives a 10MB or less > XLSX files with +30000 rows or so, and another XLSX file with about > 200rows, I have to read one row of the smallest file, look it up on the > largest file and write data from both files to a new one. > > > > Wow. Do you have to do all this in a single request? > > > > You may want to look at Nokogiri and its SAX parser. SAX parsers don''t > care about the size of the document they operate on, because they work one > node at a time, and don''t load the whole thing into memory at once. There > are some limitations on what kind of work a SAX parser can perform, because > it isn''t able to see the entire document and "know" where it is within the > document at any point. But for certain kinds of problems, it can be the > only way to go. Sounds like you may need something like this. > > > > Walter > > > > > > > > I just did a test reading a few rows from the largest file using ROO > (Spreadsheet doesn''t support XSLX and Creek look good but I can''t find a > way to read row by row) > > > and it basically made my computer crash, the server crashed, I tried > rebooting it and it said It was already started, anyway, it was a disaster. > > > > > > So, my question was, is there gem that works best with large XLSX > files or is there another way to approach this withouth crashing my > computer? > > > > > > This is what I had (It''s very possible I''m doing it wrong, help is > welcome) > > > What i was trying to do here, was to process the files and create the > new XLS file after both of the XLSX files were uploaded: > > > > > > > > > require ''roo'' > > > require ''spreadsheet'' > > > require ''creek'' > > > class UploadFiles < ActiveRecord::Base > > > after_commit :process_files > > > attr_accessible :inventory, :material_list > > > has_one :inventory > > > has_one :material_list > > > has_attached_file :inventory, :url=>"/:current_user/inventory", > :path=>":rails_root/tmp/users/uploaded_files/inventory/inventory.:extension" > > > > has_attached_file :material_list, > :url=>"/:current_user/material_list", > :path=>":rails_root/tmp/users/uploaded_files/material_list/material_list.:extension" > > > > validates_attachment_presence :material_list > > > accepts_nested_attributes_for :material_list, :allow_destroy => true > > > > accepts_nested_attributes_for :inventory, :allow_destroy => true > > > validates_attachment_content_type :inventory, :content_type => > ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], > :message => "Only .XSLX files are accepted as Inventory" > > > validates_attachment_content_type :material_list, :content_type => > ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], > :message => "Only .XSLX files are accepted as Material List" > > > > > > > > > def process_files > > > inventory = Creek::Book.new(Rails.root.to_s + > "/tmp/users/uploaded_files/inventory/inventory.xlsx") > > > material_list = Creek::Book.new(Rails.root.to_s + > "/tmp/users/uploaded_files/material_list/material_list.xlsx") > > > inventory = inventory.sheets[0] > > > scl = Spreadsheet::Workbook.new > > > sheet1 = scl.create_worksheet > > > inventory.rows.each do |row| > > > row.inspect > > > sheet1.row(1).push(row) > > > end > > > > > > sheet1.name = "Site Configuration List" > > > scl.write(Rails.root.to_s + > "/tmp/users/generated/siteconfigurationlist.xls") > > > end > > > end > > > > > > > > > -- > > > You received this message because you are subscribed to the Google > Groups "Ruby on Rails: Talk" group. > > > To unsubscribe from this group and stop receiving emails from it, send > an email to rubyonrails-ta...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > > To post to this group, send email to rubyonra...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > > To view this discussion on the web visit > https://groups.google.com/d/msgid/rubyonrails-talk/bc470d4d-19c4-4969-8ba7-4ead7a35d40c%40googlegroups.com. > > > > For more options, visit https://groups.google.com/groups/opt_out. > > > > > > -- > > You received this message because you are subscribed to the Google > Groups "Ruby on Rails: Talk" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to rubyonrails-ta...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <javascript:>. > > To post to this group, send email to rubyonra...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<javascript:>. > > > To view this discussion on the web visit > https://groups.google.com/d/msgid/rubyonrails-talk/0325dc87-0649-45fc-9d55-0fbcd8bed0a0%40googlegroups.com. > > > For more options, visit https://groups.google.com/groups/opt_out. > >-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/ba633f69-5527-4dc1-8518-b6104e414e15%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Walter Lee Davis
2013-Oct-11 15:42 UTC
Re: What''s the best way to approach reading and parse large XLSX files?
On Oct 11, 2013, at 11:30 AM, Monserrat Foster wrote:> One 30000+ row file and another with just over 200. How much memory should I need for this not to take forever parsing? (I''m currently using my computer as server and I can see ruby taking about 1GB in the task manager when processing this (and it takes forever). > > The 30000+ row file is about 7MB, which is not that much (I think)I have a collection of 1200 XML files, ranging in size from 3MB to 12MB each (they''re books, in TEI encoding) that I parse with Nokogiri on a 2GB Joyent SmartMachine to convert them to XHTML and then on to Epub. This process takes 17 minutes for the first pass, and 24 minutes for the second pass. It does not crash, but the server is unable to do much of anything else while the loop is running. My question here was, is this something that is a self-serve web service, or an admin-level (one-privileged-user-once-in-a-while) type thing? In my case, there''s one admin who adds maybe two or three books per month to the collection, and the 40-minute do-everything loop was used only for development purposes -- it was my test cycle as I checked all of the titles against a validator to ensure that my adjustments to the transcoding process didn''t result in invalid code. I would not advise putting something like this live against the world, as the potential for DOS is extremely great. Anything that can pull the kinds of loads you get when you load a huge file into memory and start fiddling with it should not be public! Walter> > On Friday, October 11, 2013 8:44:22 AM UTC-4:30, Walter Lee Davis wrote: > > On Oct 10, 2013, at 4:50 PM, Monserrat Foster wrote: > > > A coworker suggested I should use just basic OOP for this, to create a class that reads files, and then another to load the files into memory. Could please point me in the right direction for this (where can I read about it)? I have no idea what''s he talking about, as I''ve never done this before. > > How many of these files are you planning to parse at any one time? Do you have the memory on your server to deal with this load? I can see this approach working, but getting slow and process-bound very quickly. Lots of edge cases to deal with when parsing big uploaded files. > > Walter > > > > > I''ll look up nokogiri and SAX > > > > On Thursday, October 10, 2013 4:12:33 PM UTC-4:30, Walter Lee Davis wrote: > > On Oct 10, 2013, at 4:36 PM, Monserrat Foster wrote: > > > > > Hello, I''m developing an app that basically, receives a 10MB or less XLSX files with +30000 rows or so, and another XLSX file with about 200rows, I have to read one row of the smallest file, look it up on the largest file and write data from both files to a new one. > > > > Wow. Do you have to do all this in a single request? > > > > You may want to look at Nokogiri and its SAX parser. SAX parsers don''t care about the size of the document they operate on, because they work one node at a time, and don''t load the whole thing into memory at once. There are some limitations on what kind of work a SAX parser can perform, because it isn''t able to see the entire document and "know" where it is within the document at any point. But for certain kinds of problems, it can be the only way to go. Sounds like you may need something like this. > > > > Walter > > > > > > > > I just did a test reading a few rows from the largest file using ROO (Spreadsheet doesn''t support XSLX and Creek look good but I can''t find a way to read row by row) > > > and it basically made my computer crash, the server crashed, I tried rebooting it and it said It was already started, anyway, it was a disaster. > > > > > > So, my question was, is there gem that works best with large XLSX files or is there another way to approach this withouth crashing my computer? > > > > > > This is what I had (It''s very possible I''m doing it wrong, help is welcome) > > > What i was trying to do here, was to process the files and create the new XLS file after both of the XLSX files were uploaded: > > > > > > > > > require ''roo'' > > > require ''spreadsheet'' > > > require ''creek'' > > > class UploadFiles < ActiveRecord::Base > > > after_commit :process_files > > > attr_accessible :inventory, :material_list > > > has_one :inventory > > > has_one :material_list > > > has_attached_file :inventory, :url=>"/:current_user/inventory", :path=>":rails_root/tmp/users/uploaded_files/inventory/inventory.:extension" > > > has_attached_file :material_list, :url=>"/:current_user/material_list", :path=>":rails_root/tmp/users/uploaded_files/material_list/material_list.:extension" > > > validates_attachment_presence :material_list > > > accepts_nested_attributes_for :material_list, :allow_destroy => true > > > accepts_nested_attributes_for :inventory, :allow_destroy => true > > > validates_attachment_content_type :inventory, :content_type => ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], :message => "Only .XSLX files are accepted as Inventory" > > > validates_attachment_content_type :material_list, :content_type => ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], :message => "Only .XSLX files are accepted as Material List" > > > > > > > > > def process_files > > > inventory = Creek::Book.new(Rails.root.to_s + "/tmp/users/uploaded_files/inventory/inventory.xlsx") > > > material_list = Creek::Book.new(Rails.root.to_s + "/tmp/users/uploaded_files/material_list/material_list.xlsx") > > > inventory = inventory.sheets[0] > > > scl = Spreadsheet::Workbook.new > > > sheet1 = scl.create_worksheet > > > inventory.rows.each do |row| > > > row.inspect > > > sheet1.row(1).push(row) > > > end > > > > > > sheet1.name = "Site Configuration List" > > > scl.write(Rails.root.to_s + "/tmp/users/generated/siteconfigurationlist.xls") > > > end > > > end > > > > > > > > > -- > > > You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. > > > To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-ta...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > > To post to this group, send email to rubyonra...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > > To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/bc470d4d-19c4-4969-8ba7-4ead7a35d40c%40googlegroups.com. > > > For more options, visit https://groups.google.com/groups/opt_out. > > > > > > -- > > You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. > > To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-ta...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > To post to this group, send email to rubyonra...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/0325dc87-0649-45fc-9d55-0fbcd8bed0a0%40googlegroups.com. > > For more options, visit https://groups.google.com/groups/opt_out. > > > -- > You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. > To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/ba633f69-5527-4dc1-8518-b6104e414e15%40googlegroups.com. > For more options, visit https://groups.google.com/groups/opt_out.-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/02C62EB8-0DCB-47CF-840C-BC485D966065%40wdstudio.com. For more options, visit https://groups.google.com/groups/opt_out.
Jordon Bedwell
2013-Oct-11 15:43 UTC
Re: What''s the best way to approach reading and parse large XLSX files?
On Fri, Oct 11, 2013 at 10:30 AM, Monserrat Foster <monsefoster-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:> One 30000+ row file and another with just over 200. How much memory should I > need for this not to take forever parsing? (I''m currently using my computer > as server and I can see ruby taking about 1GB in the task manager when > processing this (and it takes forever). > > The 30000+ row file is about 7MB, which is not that much (I think)Check for a memory leak. -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/CAM5XQnzR9KyRzfvTOHabUifVuRMSQH0EsSnB4AarCy-5dZXXOA%40mail.gmail.com. For more options, visit https://groups.google.com/groups/opt_out.
Donald Ziesig
2013-Oct-11 16:04 UTC
Re: What''s the best way to approach reading and parse large XLSX files?
On 10/11/2013 11:30 AM, Monserrat Foster wrote:> One 30000+ row file and another with just over 200. How much memory > should I need for this not to take forever parsing? (I''m currently > using my computer as server and I can see ruby taking about 1GB in the > task manager when processing this (and it takes forever). > > The 30000+ row file is about 7MB, which is not that much (I think) > > On Friday, October 11, 2013 8:44:22 AM UTC-4:30, Walter Lee Davis wrote: > > > On Oct 10, 2013, at 4:50 PM, Monserrat Foster wrote: > > > A coworker suggested I should use just basic OOP for this, to > create a class that reads files, and then another to load the > files into memory. Could please point me in the right direction > for this (where can I read about it)? I have no idea what''s he > talking about, as I''ve never done this before. > > How many of these files are you planning to parse at any one time? > Do you have the memory on your server to deal with this load? I > can see this approach working, but getting slow and process-bound > very quickly. Lots of edge cases to deal with when parsing big > uploaded files. > > Walter > > > > > I''ll look up nokogiri and SAX > > > > On Thursday, October 10, 2013 4:12:33 PM UTC-4:30, Walter Lee > Davis wrote: > > On Oct 10, 2013, at 4:36 PM, Monserrat Foster wrote: > > > > > Hello, I''m developing an app that basically, receives a 10MB > or less XLSX files with +30000 rows or so, and another XLSX file > with about 200rows, I have to read one row of the smallest file, > look it up on the largest file and write data from both files to a > new one. > > > > Wow. Do you have to do all this in a single request? > > > > You may want to look at Nokogiri and its SAX parser. SAX parsers > don''t care about the size of the document they operate on, because > they work one node at a time, and don''t load the whole thing into > memory at once. There are some limitations on what kind of work a > SAX parser can perform, because it isn''t able to see the entire > document and "know" where it is within the document at any point. > But for certain kinds of problems, it can be the only way to go. > Sounds like you may need something like this. > > > > Walter > > > > > > > > I just did a test reading a few rows from the largest file > using ROO (Spreadsheet doesn''t support XSLX and Creek look good > but I can''t find a way to read row by row) > > > and it basically made my computer crash, the server crashed, I > tried rebooting it and it said It was already started, anyway, it > was a disaster. > > > > > > So, my question was, is there gem that works best with large > XLSX files or is there another way to approach this withouth > crashing my computer? > > > > > > This is what I had (It''s very possible I''m doing it wrong, > help is welcome) > > > What i was trying to do here, was to process the files and > create the new XLS file after both of the XLSX files were uploaded: > > > > > > > > > require ''roo'' > > > require ''spreadsheet'' > > > require ''creek'' > > > class UploadFiles < ActiveRecord::Base > > > after_commit :process_files > > > attr_accessible :inventory, :material_list > > > has_one :inventory > > > has_one :material_list > > > has_attached_file :inventory, > :url=>"/:current_user/inventory", > :path=>":rails_root/tmp/users/uploaded_files/inventory/inventory.:extension" > > > > has_attached_file :material_list, > :url=>"/:current_user/material_list", > :path=>":rails_root/tmp/users/uploaded_files/material_list/material_list.:extension" > > > > validates_attachment_presence :material_list > > > accepts_nested_attributes_for :material_list, :allow_destroy > => true > > > accepts_nested_attributes_for :inventory, :allow_destroy => > true > > > validates_attachment_content_type :inventory, :content_type > => > ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], > :message => "Only .XSLX files are accepted as Inventory" > > > validates_attachment_content_type :material_list, > :content_type => > ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], > :message => "Only .XSLX files are accepted as Material List" > > > > > > > > > def process_files > > > inventory = Creek::Book.new(Rails.root.to_s + > "/tmp/users/uploaded_files/inventory/inventory.xlsx") > > > material_list = Creek::Book.new(Rails.root.to_s + > "/tmp/users/uploaded_files/material_list/material_list.xlsx") > > > inventory = inventory.sheets[0] > > > scl = Spreadsheet::Workbook.new > > > sheet1 = scl.create_worksheet > > > inventory.rows.each do |row| > > > row.inspect > > > sheet1.row(1).push(row) > > > end > > > > > > sheet1.name <http://sheet1.name> = "Site Configuration List" > > > scl.write(Rails.root.to_s + > "/tmp/users/generated/siteconfigurationlist.xls") > > > end > > > end > > > > > > > > > -- > > > You received this message because you are subscribed to the > Google Groups "Ruby on Rails: Talk" group. > > > To unsubscribe from this group and stop receiving emails from > it, send an email to rubyonrails-ta...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > > To post to this group, send email to > rubyonra...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > > To view this discussion on the web visit > https://groups.google.com/d/msgid/rubyonrails-talk/bc470d4d-19c4-4969-8ba7-4ead7a35d40c%40googlegroups.com > <https://groups.google.com/d/msgid/rubyonrails-talk/bc470d4d-19c4-4969-8ba7-4ead7a35d40c%40googlegroups.com>. > > > > For more options, visit > https://groups.google.com/groups/opt_out > <https://groups.google.com/groups/opt_out>. > > > > > > -- > > You received this message because you are subscribed to the > Google Groups "Ruby on Rails: Talk" group. > > To unsubscribe from this group and stop receiving emails from > it, send an email to rubyonrails-ta...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org > <javascript:>. > > To post to this group, send email to > rubyonra...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <javascript:>. > > To view this discussion on the web visit > https://groups.google.com/d/msgid/rubyonrails-talk/0325dc87-0649-45fc-9d55-0fbcd8bed0a0%40googlegroups.com > <https://groups.google.com/d/msgid/rubyonrails-talk/0325dc87-0649-45fc-9d55-0fbcd8bed0a0%40googlegroups.com>. > > > For more options, visit https://groups.google.com/groups/opt_out > <https://groups.google.com/groups/opt_out>. > > -- > You received this message because you are subscribed to the Google > Groups "Ruby on Rails: Talk" group. > To unsubscribe from this group and stop receiving emails from it, send > an email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit > https://groups.google.com/d/msgid/rubyonrails-talk/ba633f69-5527-4dc1-8518-b6104e414e15%40googlegroups.com. > For more options, visit https://groups.google.com/groups/opt_out.I use a rather indirect route that works fine for me with 15,000 lines and about 26 MB. I export the file from LibreOffice Calc using csv (Comma separated variables). Then, in the rails controller I use something like: require ''csv'' class TheControllerController # ;'') # other controller code def upload data = CSV.parse(params[:entries].tempfile.read) # from Ruby''s CSV class for line in data do logger.debug "line: #{line.inspect}" #each line is an array of strings containing the columns of the one row of the csv file #I use these data to populate the appropriate db table / rails model at this point end end end make sure that your routes.db points to this: match ''the_controller/upload'' => ''the_controller#upload'' from your client machine''s command line curl -F entries=@yourdata.csv localhost:3000/the_controller/upload note that ''entries'' in the curl command matches the ''entries'' in the param[:entries] in the controller. If you want to do this from a rails gui form, look at http://guides.rubyonrails.org/form_helpers.html#uploading-files During testing on my 4-core, 8 GB laptop, processing the really big files take several minutes. When I have the app on heroku, this causes a timeout so I break up the csv file into multiple sections such that each section takes less than 30 seconds to upload. By leaving a little ''slack'' in the size, I have this automated so it occurs in the background while I am doing other work. Hope these suggestions help. Don Ziesig -- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/52582197.6080105%40ziesig.org. For more options, visit https://groups.google.com/groups/opt_out.
Monserrat Foster
2013-Oct-11 16:08 UTC
Re: What''s the best way to approach reading and parse large XLSX files?
Hi, the files automatically download in .XLSX formats, I can''t change them and I can''t force the users to change it in order to make my job easier. Thanks for the suggestion. On Friday, October 11, 2013 11:34:39 AM UTC-4:30, donz wrote:> > On 10/11/2013 11:30 AM, Monserrat Foster wrote: > > One 30000+ row file and another with just over 200. How much memory should > I need for this not to take forever parsing? (I''m currently using my > computer as server and I can see ruby taking about 1GB in the task manager > when processing this (and it takes forever). > > The 30000+ row file is about 7MB, which is not that much (I think) > > On Friday, October 11, 2013 8:44:22 AM UTC-4:30, Walter Lee Davis wrote: >> >> >> On Oct 10, 2013, at 4:50 PM, Monserrat Foster wrote: >> >> > A coworker suggested I should use just basic OOP for this, to create a >> class that reads files, and then another to load the files into memory. >> Could please point me in the right direction for this (where can I read >> about it)? I have no idea what''s he talking about, as I''ve never done this >> before. >> >> How many of these files are you planning to parse at any one time? Do you >> have the memory on your server to deal with this load? I can see this >> approach working, but getting slow and process-bound very quickly. Lots of >> edge cases to deal with when parsing big uploaded files. >> >> Walter >> >> > >> > I''ll look up nokogiri and SAX >> > >> > On Thursday, October 10, 2013 4:12:33 PM UTC-4:30, Walter Lee Davis >> wrote: >> > On Oct 10, 2013, at 4:36 PM, Monserrat Foster wrote: >> > >> > > Hello, I''m developing an app that basically, receives a 10MB or less >> XLSX files with +30000 rows or so, and another XLSX file with about >> 200rows, I have to read one row of the smallest file, look it up on the >> largest file and write data from both files to a new one. >> > >> > Wow. Do you have to do all this in a single request? >> > >> > You may want to look at Nokogiri and its SAX parser. SAX parsers don''t >> care about the size of the document they operate on, because they work one >> node at a time, and don''t load the whole thing into memory at once. There >> are some limitations on what kind of work a SAX parser can perform, because >> it isn''t able to see the entire document and "know" where it is within the >> document at any point. But for certain kinds of problems, it can be the >> only way to go. Sounds like you may need something like this. >> > >> > Walter >> > >> > > >> > > I just did a test reading a few rows from the largest file using ROO >> (Spreadsheet doesn''t support XSLX and Creek look good but I can''t find a >> way to read row by row) >> > > and it basically made my computer crash, the server crashed, I tried >> rebooting it and it said It was already started, anyway, it was a disaster. >> > > >> > > So, my question was, is there gem that works best with large XLSX >> files or is there another way to approach this withouth crashing my >> computer? >> > > >> > > This is what I had (It''s very possible I''m doing it wrong, help is >> welcome) >> > > What i was trying to do here, was to process the files and create the >> new XLS file after both of the XLSX files were uploaded: >> > > >> > > >> > > require ''roo'' >> > > require ''spreadsheet'' >> > > require ''creek'' >> > > class UploadFiles < ActiveRecord::Base >> > > after_commit :process_files >> > > attr_accessible :inventory, :material_list >> > > has_one :inventory >> > > has_one :material_list >> > > has_attached_file :inventory, :url=>"/:current_user/inventory", >> :path=>":rails_root/tmp/users/uploaded_files/inventory/inventory.:extension" >> >> > > has_attached_file :material_list, >> :url=>"/:current_user/material_list", >> :path=>":rails_root/tmp/users/uploaded_files/material_list/material_list.:extension" >> >> > > validates_attachment_presence :material_list >> > > accepts_nested_attributes_for :material_list, :allow_destroy => >> true >> > > accepts_nested_attributes_for :inventory, :allow_destroy => true >> > > validates_attachment_content_type :inventory, :content_type => >> ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], >> :message => "Only .XSLX files are accepted as Inventory" >> > > validates_attachment_content_type :material_list, :content_type => >> ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], >> :message => "Only .XSLX files are accepted as Material List" >> > > >> > > >> > > def process_files >> > > inventory = Creek::Book.new(Rails.root.to_s + >> "/tmp/users/uploaded_files/inventory/inventory.xlsx") >> > > material_list = Creek::Book.new(Rails.root.to_s + >> "/tmp/users/uploaded_files/material_list/material_list.xlsx") >> > > inventory = inventory.sheets[0] >> > > scl = Spreadsheet::Workbook.new >> > > sheet1 = scl.create_worksheet >> > > inventory.rows.each do |row| >> > > row.inspect >> > > sheet1.row(1).push(row) >> > > end >> > > >> > > sheet1.name = "Site Configuration List" >> > > scl.write(Rails.root.to_s + >> "/tmp/users/generated/siteconfigurationlist.xls") >> > > end >> > > end >> > > >> > > >> > > -- >> > > You received this message because you are subscribed to the Google >> Groups "Ruby on Rails: Talk" group. >> > > To unsubscribe from this group and stop receiving emails from it, >> send an email to rubyonrails-ta...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> > > To post to this group, send email to rubyonra...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> > > To view this discussion on the web visit >> https://groups.google.com/d/msgid/rubyonrails-talk/bc470d4d-19c4-4969-8ba7-4ead7a35d40c%40googlegroups.com. >> >> > > For more options, visit https://groups.google.com/groups/opt_out. >> > >> > >> > -- >> > You received this message because you are subscribed to the Google >> Groups "Ruby on Rails: Talk" group. >> > To unsubscribe from this group and stop receiving emails from it, send >> an email to rubyonrails-ta...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> > To post to this group, send email to rubyonra...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/rubyonrails-talk/0325dc87-0649-45fc-9d55-0fbcd8bed0a0%40googlegroups.com. >> >> > For more options, visit https://groups.google.com/groups/opt_out. >> >> -- > You received this message because you are subscribed to the Google Groups > "Ruby on Rails: Talk" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to rubyonrails-ta...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <javascript:>. > To post to this group, send email to rubyonra...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<javascript:> > . > To view this discussion on the web visit > https://groups.google.com/d/msgid/rubyonrails-talk/ba633f69-5527-4dc1-8518-b6104e414e15%40googlegroups.com > . > For more options, visit https://groups.google.com/groups/opt_out. > > I use a rather indirect route that works fine for me with 15,000 lines and > about 26 MB. I export the file from LibreOffice Calc using csv (Comma > separated variables). Then, in the rails controller I use something like: > > require ''csv'' > > class TheControllerController # ;'') > > # other controller code > > def upload > data = CSV.parse(params[:entries].tempfile.read) # from Ruby''s CSV > class > for line in data do > logger.debug "line: #{line.inspect}" > #each line is an array of strings containing the columns of the one > row of the csv file > #I use these data to populate the appropriate db table / rails > model at this point > end > end > > end > > make sure that your routes.db points to this: > > match ''the_controller/upload'' => ''the_controller#upload'' > > from your client machine''s command line > > curl -F entr...-T3Dt0BzDhkvWhsDj+USV4g@public.gmane.org <javascript:>localhost:3000/the_controller/upload > > note that ''entries'' in the curl command matches the ''entries'' in the > param[:entries] in the controller. > > If you want to do this from a rails gui form, look at > http://guides.rubyonrails.org/form_helpers.html#uploading-files > > During testing on my 4-core, 8 GB laptop, processing the really big files > take several minutes. When I have the app on heroku, this causes a timeout > so I break up the csv file into multiple sections such that each section > takes less than 30 seconds to upload. By leaving a little ''slack'' in the > size, I have this automated so it occurs in the background while I am doing > other work. > > Hope these suggestions help. > > Don Ziesig > > > > > > >-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/2d162c44-039d-4bb3-8949-4b7b0464f83f%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Monserrat Foster
2013-Oct-11 20:33 UTC
Re: What''s the best way to approach reading and parse large XLSX files?
This is an everyday, initially maybe a couple people at the same time uploading and parsing files to generate the new one, but eventually it will extend to other people, so... I used a logger and It does retrieve and save the files using the comparation. But it takes forever, like 30min or so in generating the file. The process starts as soon as the files are uploaded but it seems to be taking most of the time into opening the file, once it''s opened it takes maybe 5min at most to generate the new file. Do you know where can i find an example on how to read an xlsx file with nokogiri? I can''t seem to find one On Friday, October 11, 2013 11:12:20 AM UTC-4:30, Walter Lee Davis wrote:> > > On Oct 11, 2013, at 11:30 AM, Monserrat Foster wrote: > > > One 30000+ row file and another with just over 200. How much memory > should I need for this not to take forever parsing? (I''m currently using my > computer as server and I can see ruby taking about 1GB in the task manager > when processing this (and it takes forever). > > > > The 30000+ row file is about 7MB, which is not that much (I think) > > I have a collection of 1200 XML files, ranging in size from 3MB to 12MB > each (they''re books, in TEI encoding) that I parse with Nokogiri on a 2GB > Joyent SmartMachine to convert them to XHTML and then on to Epub. This > process takes 17 minutes for the first pass, and 24 minutes for the second > pass. It does not crash, but the server is unable to do much of anything > else while the loop is running. > > My question here was, is this something that is a self-serve web service, > or an admin-level (one-privileged-user-once-in-a-while) type thing? In my > case, there''s one admin who adds maybe two or three books per month to the > collection, and the 40-minute do-everything loop was used only for > development purposes -- it was my test cycle as I checked all of the titles > against a validator to ensure that my adjustments to the transcoding > process didn''t result in invalid code. I would not advise putting something > like this live against the world, as the potential for DOS is extremely > great. Anything that can pull the kinds of loads you get when you load a > huge file into memory and start fiddling with it should not be public! > > Walter > > > > > On Friday, October 11, 2013 8:44:22 AM UTC-4:30, Walter Lee Davis wrote: > > > > On Oct 10, 2013, at 4:50 PM, Monserrat Foster wrote: > > > > > A coworker suggested I should use just basic OOP for this, to create a > class that reads files, and then another to load the files into memory. > Could please point me in the right direction for this (where can I read > about it)? I have no idea what''s he talking about, as I''ve never done this > before. > > > > How many of these files are you planning to parse at any one time? Do > you have the memory on your server to deal with this load? I can see this > approach working, but getting slow and process-bound very quickly. Lots of > edge cases to deal with when parsing big uploaded files. > > > > Walter > > > > > > > > I''ll look up nokogiri and SAX > > > > > > On Thursday, October 10, 2013 4:12:33 PM UTC-4:30, Walter Lee Davis > wrote: > > > On Oct 10, 2013, at 4:36 PM, Monserrat Foster wrote: > > > > > > > Hello, I''m developing an app that basically, receives a 10MB or less > XLSX files with +30000 rows or so, and another XLSX file with about > 200rows, I have to read one row of the smallest file, look it up on the > largest file and write data from both files to a new one. > > > > > > Wow. Do you have to do all this in a single request? > > > > > > You may want to look at Nokogiri and its SAX parser. SAX parsers don''t > care about the size of the document they operate on, because they work one > node at a time, and don''t load the whole thing into memory at once. There > are some limitations on what kind of work a SAX parser can perform, because > it isn''t able to see the entire document and "know" where it is within the > document at any point. But for certain kinds of problems, it can be the > only way to go. Sounds like you may need something like this. > > > > > > Walter > > > > > > > > > > > I just did a test reading a few rows from the largest file using ROO > (Spreadsheet doesn''t support XSLX and Creek look good but I can''t find a > way to read row by row) > > > > and it basically made my computer crash, the server crashed, I tried > rebooting it and it said It was already started, anyway, it was a disaster. > > > > > > > > So, my question was, is there gem that works best with large XLSX > files or is there another way to approach this withouth crashing my > computer? > > > > > > > > This is what I had (It''s very possible I''m doing it wrong, help is > welcome) > > > > What i was trying to do here, was to process the files and create > the new XLS file after both of the XLSX files were uploaded: > > > > > > > > > > > > require ''roo'' > > > > require ''spreadsheet'' > > > > require ''creek'' > > > > class UploadFiles < ActiveRecord::Base > > > > after_commit :process_files > > > > attr_accessible :inventory, :material_list > > > > has_one :inventory > > > > has_one :material_list > > > > has_attached_file :inventory, :url=>"/:current_user/inventory", > :path=>":rails_root/tmp/users/uploaded_files/inventory/inventory.:extension" > > > > > has_attached_file :material_list, > :url=>"/:current_user/material_list", > :path=>":rails_root/tmp/users/uploaded_files/material_list/material_list.:extension" > > > > > validates_attachment_presence :material_list > > > > accepts_nested_attributes_for :material_list, :allow_destroy => > true > > > > accepts_nested_attributes_for :inventory, :allow_destroy => true > > > > validates_attachment_content_type :inventory, :content_type => > ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], > :message => "Only .XSLX files are accepted as Inventory" > > > > validates_attachment_content_type :material_list, :content_type => > ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], > :message => "Only .XSLX files are accepted as Material List" > > > > > > > > > > > > def process_files > > > > inventory = Creek::Book.new(Rails.root.to_s + > "/tmp/users/uploaded_files/inventory/inventory.xlsx") > > > > material_list = Creek::Book.new(Rails.root.to_s + > "/tmp/users/uploaded_files/material_list/material_list.xlsx") > > > > inventory = inventory.sheets[0] > > > > scl = Spreadsheet::Workbook.new > > > > sheet1 = scl.create_worksheet > > > > inventory.rows.each do |row| > > > > row.inspect > > > > sheet1.row(1).push(row) > > > > end > > > > > > > > sheet1.name = "Site Configuration List" > > > > scl.write(Rails.root.to_s + > "/tmp/users/generated/siteconfigurationlist.xls") > > > > end > > > > end > > > > > > > > > > > > -- > > > > You received this message because you are subscribed to the Google > Groups "Ruby on Rails: Talk" group. > > > > To unsubscribe from this group and stop receiving emails from it, > send an email to rubyonrails-ta...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > > > To post to this group, send email to rubyonra...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > > > To view this discussion on the web visit > https://groups.google.com/d/msgid/rubyonrails-talk/bc470d4d-19c4-4969-8ba7-4ead7a35d40c%40googlegroups.com. > > > > > For more options, visit https://groups.google.com/groups/opt_out. > > > > > > > > > -- > > > You received this message because you are subscribed to the Google > Groups "Ruby on Rails: Talk" group. > > > To unsubscribe from this group and stop receiving emails from it, send > an email to rubyonrails-ta...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > > To post to this group, send email to rubyonra...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > > To view this discussion on the web visit > https://groups.google.com/d/msgid/rubyonrails-talk/0325dc87-0649-45fc-9d55-0fbcd8bed0a0%40googlegroups.com. > > > > For more options, visit https://groups.google.com/groups/opt_out. > > > > > > -- > > You received this message because you are subscribed to the Google > Groups "Ruby on Rails: Talk" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to rubyonrails-ta...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org <javascript:>. > > To post to this group, send email to rubyonra...-/JYPxA39Uh5TLH3MbocFFw@public.gmane.org<javascript:>. > > > To view this discussion on the web visit > https://groups.google.com/d/msgid/rubyonrails-talk/ba633f69-5527-4dc1-8518-b6104e414e15%40googlegroups.com. > > > For more options, visit https://groups.google.com/groups/opt_out. > >-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/64a04728-0f8d-4795-9e8f-ca6509e05cbc%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Monserrat Foster
2013-Oct-11 20:35 UTC
Re: What''s the best way to approach reading and parse large XLSX files?
I forgot to say after it reads all rows and writes the file, throws [1m [35m (600.1ms) [0m begin transaction [1m [36m (52.0ms) [0m [1mcommit transaction [0m failed to allocate memory Redirected to http://localhost:3000/upload_files/110 Completed 406 Not Acceptable in 1207471ms (ActiveRecord: 693.1ms) On Friday, October 11, 2013 4:03:12 PM UTC-4:30, Monserrat Foster wrote:> > This is an everyday, initially maybe a couple people at the same time > uploading and parsing files to generate the new one, but eventually it will > extend to other people, so... > > I used a logger and It does retrieve and save the files using the > comparation. But it takes forever, like 30min or so in generating the file. > The process starts as soon as the files are uploaded but it seems to be > taking most of the time into opening the file, once it''s opened it takes > maybe 5min at most to generate the new file. > > Do you know where can i find an example on how to read an xlsx file with > nokogiri? I can''t seem to find one > > On Friday, October 11, 2013 11:12:20 AM UTC-4:30, Walter Lee Davis wrote: >> >> >> On Oct 11, 2013, at 11:30 AM, Monserrat Foster wrote: >> >> > One 30000+ row file and another with just over 200. How much memory >> should I need for this not to take forever parsing? (I''m currently using my >> computer as server and I can see ruby taking about 1GB in the task manager >> when processing this (and it takes forever). >> > >> > The 30000+ row file is about 7MB, which is not that much (I think) >> >> I have a collection of 1200 XML files, ranging in size from 3MB to 12MB >> each (they''re books, in TEI encoding) that I parse with Nokogiri on a 2GB >> Joyent SmartMachine to convert them to XHTML and then on to Epub. This >> process takes 17 minutes for the first pass, and 24 minutes for the second >> pass. It does not crash, but the server is unable to do much of anything >> else while the loop is running. >> >> My question here was, is this something that is a self-serve web service, >> or an admin-level (one-privileged-user-once-in-a-while) type thing? In my >> case, there''s one admin who adds maybe two or three books per month to the >> collection, and the 40-minute do-everything loop was used only for >> development purposes -- it was my test cycle as I checked all of the titles >> against a validator to ensure that my adjustments to the transcoding >> process didn''t result in invalid code. I would not advise putting something >> like this live against the world, as the potential for DOS is extremely >> great. Anything that can pull the kinds of loads you get when you load a >> huge file into memory and start fiddling with it should not be public! >> >> Walter >> >> > >> > On Friday, October 11, 2013 8:44:22 AM UTC-4:30, Walter Lee Davis >> wrote: >> > >> > On Oct 10, 2013, at 4:50 PM, Monserrat Foster wrote: >> > >> > > A coworker suggested I should use just basic OOP for this, to create >> a class that reads files, and then another to load the files into memory. >> Could please point me in the right direction for this (where can I read >> about it)? I have no idea what''s he talking about, as I''ve never done this >> before. >> > >> > How many of these files are you planning to parse at any one time? Do >> you have the memory on your server to deal with this load? I can see this >> approach working, but getting slow and process-bound very quickly. Lots of >> edge cases to deal with when parsing big uploaded files. >> > >> > Walter >> > >> > > >> > > I''ll look up nokogiri and SAX >> > > >> > > On Thursday, October 10, 2013 4:12:33 PM UTC-4:30, Walter Lee Davis >> wrote: >> > > On Oct 10, 2013, at 4:36 PM, Monserrat Foster wrote: >> > > >> > > > Hello, I''m developing an app that basically, receives a 10MB or >> less XLSX files with +30000 rows or so, and another XLSX file with about >> 200rows, I have to read one row of the smallest file, look it up on the >> largest file and write data from both files to a new one. >> > > >> > > Wow. Do you have to do all this in a single request? >> > > >> > > You may want to look at Nokogiri and its SAX parser. SAX parsers >> don''t care about the size of the document they operate on, because they >> work one node at a time, and don''t load the whole thing into memory at >> once. There are some limitations on what kind of work a SAX parser can >> perform, because it isn''t able to see the entire document and "know" where >> it is within the document at any point. But for certain kinds of problems, >> it can be the only way to go. Sounds like you may need something like this. >> > > >> > > Walter >> > > >> > > > >> > > > I just did a test reading a few rows from the largest file using >> ROO (Spreadsheet doesn''t support XSLX and Creek look good but I can''t find >> a way to read row by row) >> > > > and it basically made my computer crash, the server crashed, I >> tried rebooting it and it said It was already started, anyway, it was a >> disaster. >> > > > >> > > > So, my question was, is there gem that works best with large XLSX >> files or is there another way to approach this withouth crashing my >> computer? >> > > > >> > > > This is what I had (It''s very possible I''m doing it wrong, help is >> welcome) >> > > > What i was trying to do here, was to process the files and create >> the new XLS file after both of the XLSX files were uploaded: >> > > > >> > > > >> > > > require ''roo'' >> > > > require ''spreadsheet'' >> > > > require ''creek'' >> > > > class UploadFiles < ActiveRecord::Base >> > > > after_commit :process_files >> > > > attr_accessible :inventory, :material_list >> > > > has_one :inventory >> > > > has_one :material_list >> > > > has_attached_file :inventory, :url=>"/:current_user/inventory", >> :path=>":rails_root/tmp/users/uploaded_files/inventory/inventory.:extension" >> >> > > > has_attached_file :material_list, >> :url=>"/:current_user/material_list", >> :path=>":rails_root/tmp/users/uploaded_files/material_list/material_list.:extension" >> >> > > > validates_attachment_presence :material_list >> > > > accepts_nested_attributes_for :material_list, :allow_destroy => >> true >> > > > accepts_nested_attributes_for :inventory, :allow_destroy => true >> >> > > > validates_attachment_content_type :inventory, :content_type => >> ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], >> :message => "Only .XSLX files are accepted as Inventory" >> > > > validates_attachment_content_type :material_list, :content_type >> => ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], >> :message => "Only .XSLX files are accepted as Material List" >> > > > >> > > > >> > > > def process_files >> > > > inventory = Creek::Book.new(Rails.root.to_s + >> "/tmp/users/uploaded_files/inventory/inventory.xlsx") >> > > > material_list = Creek::Book.new(Rails.root.to_s + >> "/tmp/users/uploaded_files/material_list/material_list.xlsx") >> > > > inventory = inventory.sheets[0] >> > > > scl = Spreadsheet::Workbook.new >> > > > sheet1 = scl.create_worksheet >> > > > inventory.rows.each do |row| >> > > > row.inspect >> > > > sheet1.row(1).push(row) >> > > > end >> > > > >> > > > sheet1.name = "Site Configuration List" >> > > > scl.write(Rails.root.to_s + >> "/tmp/users/generated/siteconfigurationlist.xls") >> > > > end >> > > > end >> > > > >> > > > >> > > > -- >> > > > You received this message because you are subscribed to the Google >> Groups "Ruby on Rails: Talk" group. >> > > > To unsubscribe from this group and stop receiving emails from it, >> send an email to rubyonrails-ta...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> > > > To post to this group, send email to rubyonra...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> > > > To view this discussion on the web visit >> https://groups.google.com/d/msgid/rubyonrails-talk/bc470d4d-19c4-4969-8ba7-4ead7a35d40c%40googlegroups.com. >> >> > > > For more options, visit https://groups.google.com/groups/opt_out. >> > > >> > > >> > > -- >> > > You received this message because you are subscribed to the Google >> Groups "Ruby on Rails: Talk" group. >> > > To unsubscribe from this group and stop receiving emails from it, >> send an email to rubyonrails-ta...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> > > To post to this group, send email to rubyonra...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> > > To view this discussion on the web visit >> https://groups.google.com/d/msgid/rubyonrails-talk/0325dc87-0649-45fc-9d55-0fbcd8bed0a0%40googlegroups.com. >> >> > > For more options, visit https://groups.google.com/groups/opt_out. >> > >> > >> > -- >> > You received this message because you are subscribed to the Google >> Groups "Ruby on Rails: Talk" group. >> > To unsubscribe from this group and stop receiving emails from it, send >> an email to rubyonrails-ta...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> > To post to this group, send email to rubyonra...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/rubyonrails-talk/ba633f69-5527-4dc1-8518-b6104e414e15%40googlegroups.com. >> >> > For more options, visit https://groups.google.com/groups/opt_out. >> >>-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/87eace87-5b65-4265-8b5a-3fa804036629%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
Walter Lee Davis
2013-Oct-11 21:38 UTC
Re: What''s the best way to approach reading and parse large XLSX files?
On Oct 11, 2013, at 4:33 PM, Monserrat Foster wrote:> This is an everyday, initially maybe a couple people at the same time uploading and parsing files to generate the new one, but eventually it will extend to other people, so... > > I used a logger and It does retrieve and save the files using the comparation. But it takes forever, like 30min or so in generating the file. > The process starts as soon as the files are uploaded but it seems to be taking most of the time into opening the file, once it''s opened it takes maybe 5min at most to generate the new file. > > Do you know where can i find an example on how to read an xlsx file with nokogiri? I can''t seem to find oneXSLX is just an Excel file expressed in XML. It''s no different than parsing any other XML file. First, find a good basic example of file parsing with Nokogiri. http://nokogiri.org/tutorials/searching_a_xml_html_document.html Next, open up your file in a text editor, and look for the elements you want to access. You can use either xpath or css syntax to locate your elements, and Nokogiri allows you to access either attributes or content of any element you can locate. If you run into trouble with all the prefixes that Microsoft like to litter their formats with, you can pass remove_namespaces to clean that right up. Walter> > On Friday, October 11, 2013 11:12:20 AM UTC-4:30, Walter Lee Davis wrote: > > On Oct 11, 2013, at 11:30 AM, Monserrat Foster wrote: > > > One 30000+ row file and another with just over 200. How much memory should I need for this not to take forever parsing? (I''m currently using my computer as server and I can see ruby taking about 1GB in the task manager when processing this (and it takes forever). > > > > The 30000+ row file is about 7MB, which is not that much (I think) > > I have a collection of 1200 XML files, ranging in size from 3MB to 12MB each (they''re books, in TEI encoding) that I parse with Nokogiri on a 2GB Joyent SmartMachine to convert them to XHTML and then on to Epub. This process takes 17 minutes for the first pass, and 24 minutes for the second pass. It does not crash, but the server is unable to do much of anything else while the loop is running. > > My question here was, is this something that is a self-serve web service, or an admin-level (one-privileged-user-once-in-a-while) type thing? In my case, there''s one admin who adds maybe two or three books per month to the collection, and the 40-minute do-everything loop was used only for development purposes -- it was my test cycle as I checked all of the titles against a validator to ensure that my adjustments to the transcoding process didn''t result in invalid code. I would not advise putting something like this live against the world, as the potential for DOS is extremely great. Anything that can pull the kinds of loads you get when you load a huge file into memory and start fiddling with it should not be public! > > Walter > > > > > On Friday, October 11, 2013 8:44:22 AM UTC-4:30, Walter Lee Davis wrote: > > > > On Oct 10, 2013, at 4:50 PM, Monserrat Foster wrote: > > > > > A coworker suggested I should use just basic OOP for this, to create a class that reads files, and then another to load the files into memory. Could please point me in the right direction for this (where can I read about it)? I have no idea what''s he talking about, as I''ve never done this before. > > > > How many of these files are you planning to parse at any one time? Do you have the memory on your server to deal with this load? I can see this approach working, but getting slow and process-bound very quickly. Lots of edge cases to deal with when parsing big uploaded files. > > > > Walter > > > > > > > > I''ll look up nokogiri and SAX > > > > > > On Thursday, October 10, 2013 4:12:33 PM UTC-4:30, Walter Lee Davis wrote: > > > On Oct 10, 2013, at 4:36 PM, Monserrat Foster wrote: > > > > > > > Hello, I''m developing an app that basically, receives a 10MB or less XLSX files with +30000 rows or so, and another XLSX file with about 200rows, I have to read one row of the smallest file, look it up on the largest file and write data from both files to a new one. > > > > > > Wow. Do you have to do all this in a single request? > > > > > > You may want to look at Nokogiri and its SAX parser. SAX parsers don''t care about the size of the document they operate on, because they work one node at a time, and don''t load the whole thing into memory at once. There are some limitations on what kind of work a SAX parser can perform, because it isn''t able to see the entire document and "know" where it is within the document at any point. But for certain kinds of problems, it can be the only way to go. Sounds like you may need something like this. > > > > > > Walter > > > > > > > > > > > I just did a test reading a few rows from the largest file using ROO (Spreadsheet doesn''t support XSLX and Creek look good but I can''t find a way to read row by row) > > > > and it basically made my computer crash, the server crashed, I tried rebooting it and it said It was already started, anyway, it was a disaster. > > > > > > > > So, my question was, is there gem that works best with large XLSX files or is there another way to approach this withouth crashing my computer? > > > > > > > > This is what I had (It''s very possible I''m doing it wrong, help is welcome) > > > > What i was trying to do here, was to process the files and create the new XLS file after both of the XLSX files were uploaded: > > > > > > > > > > > > require ''roo'' > > > > require ''spreadsheet'' > > > > require ''creek'' > > > > class UploadFiles < ActiveRecord::Base > > > > after_commit :process_files > > > > attr_accessible :inventory, :material_list > > > > has_one :inventory > > > > has_one :material_list > > > > has_attached_file :inventory, :url=>"/:current_user/inventory", :path=>":rails_root/tmp/users/uploaded_files/inventory/inventory.:extension" > > > > has_attached_file :material_list, :url=>"/:current_user/material_list", :path=>":rails_root/tmp/users/uploaded_files/material_list/material_list.:extension" > > > > validates_attachment_presence :material_list > > > > accepts_nested_attributes_for :material_list, :allow_destroy => true > > > > accepts_nested_attributes_for :inventory, :allow_destroy => true > > > > validates_attachment_content_type :inventory, :content_type => ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], :message => "Only .XSLX files are accepted as Inventory" > > > > validates_attachment_content_type :material_list, :content_type => ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], :message => "Only .XSLX files are accepted as Material List" > > > > > > > > > > > > def process_files > > > > inventory = Creek::Book.new(Rails.root.to_s + "/tmp/users/uploaded_files/inventory/inventory.xlsx") > > > > material_list = Creek::Book.new(Rails.root.to_s + "/tmp/users/uploaded_files/material_list/material_list.xlsx") > > > > inventory = inventory.sheets[0] > > > > scl = Spreadsheet::Workbook.new > > > > sheet1 = scl.create_worksheet > > > > inventory.rows.each do |row| > > > > row.inspect > > > > sheet1.row(1).push(row) > > > > end > > > > > > > > sheet1.name = "Site Configuration List" > > > > scl.write(Rails.root.to_s + "/tmp/users/generated/siteconfigurationlist.xls") > > > > end > > > > end > > > > > > > > > > > > -- > > > > You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. > > > > To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-ta...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > > > To post to this group, send email to rubyonra...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > > > To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/bc470d4d-19c4-4969-8ba7-4ead7a35d40c%40googlegroups.com. > > > > For more options, visit https://groups.google.com/groups/opt_out. > > > > > > > > > -- > > > You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. > > > To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-ta...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > > To post to this group, send email to rubyonra...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > > To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/0325dc87-0649-45fc-9d55-0fbcd8bed0a0%40googlegroups.com. > > > For more options, visit https://groups.google.com/groups/opt_out. > > > > > > -- > > You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. > > To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-ta...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > To post to this group, send email to rubyonra...-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > > To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/ba633f69-5527-4dc1-8518-b6104e414e15%40googlegroups.com. > > For more options, visit https://groups.google.com/groups/opt_out. > > > -- > You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. > To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org > To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/64a04728-0f8d-4795-9e8f-ca6509e05cbc%40googlegroups.com. > For more options, visit https://groups.google.com/groups/opt_out.-- You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to rubyonrails-talk+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To post to this group, send email to rubyonrails-talk-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/654EC0C2-C9A1-4C50-A9D9-41BFA7B46BE2%40wdstudio.com. For more options, visit https://groups.google.com/groups/opt_out.
Reasonably Related Threads
- is there any way to convert .xlsx to .xls
- Installing packages "xslx" on Ubuntu (32bit)
- Consuming a web service created with Rails, ETL vs Rest?
- TypeError: no implicit conversion of Symbol into Hash when submitting form to upload files
- Summary of data for each year