Suresh_FSFM
2009-Feb-08 17:39 UTC
[R] Tip for performance improvement while handling huge data?
Hello All, For certain calculations, I have to handle a dataframe with say 10 million rows and multiple columns of different datatypes. When I try to perform calculations on certain elements in each row, the program just goes in "busy" mode for really long time. To avoid this "busy" mode, I split the dataframe into subsets of 10000 rows. Then the calculation was done very fast. within reasonable time. Is there any other tip to improve the performance ? Regards, Suresh -- View this message in context: http://www.nabble.com/Tip-for-performance-improvement-while-handling-huge-data--tp21901287p21901287.html Sent from the R help mailing list archive at Nabble.com.
Philipp Pagel
2009-Feb-08 19:28 UTC
[R] Tip for performance improvement while handling huge data?
> For certain calculations, I have to handle a dataframe with say 10 million > rows and multiple columns of different datatypes. > When I try to perform calculations on certain elements in each row, the > program just goes in "busy" mode for really long time. > To avoid this "busy" mode, I split the dataframe into subsets of 10000 rows. > Then the calculation was done very fast. within reasonable time. > > Is there any other tip to improve the performance ?Depending on what exactly it is you are doing and what causes the slowdown there may be a number of useful strategies: - Buy RAM (lots of it) - it's cheap - Vectorize whatever you are doing - Don't use all the data you have but draw a random sample of reasonalbe size - ... To be more helpful we'd have to know - what are the computations involved? - how are they implemented at the moment? -> example code - what is the range of "really long time"? cu Philipp -- Dr. Philipp Pagel Lehrstuhl f?r Genomorientierte Bioinformatik Technische Universit?t M?nchen Wissenschaftszentrum Weihenstephan 85350 Freising, Germany http://mips.gsf.de/staff/pagel