R-devel, The performance of as.Date differs by a large degree between one of my machines with glibc 2.3.2:> system.time(x <- as.Date(rep("01-01-2005", 100000), format = "%m-%d-%Y"))[1] 1.17 0.00 1.18 0.00 0.00 and a comparable machine with glibc 2.3.3:> system.time(x <- as.Date(rep("01-01-2005", 100000), format = "%m-%d-%Y"))[1] 31.20 46.89 81.01 0.00 0.00 both with the same R version:> R.version_ platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status major 2 minor 1.0 year 2005 month 04 day 18 language R I'm focusing on differences in glibc versions because of as.Date's use of strptime. Does it seem likely that the cause of this discrepancy is in fact glibc? If so, can anyone tell me how to make the performance of the second machine more like the first? I have verified that using the chron package, which I don't believe uses strptime, for the above character conversion performs equally well on both machines. Thanks in advance, Jeff
On 5/4/05, Jeff Enos <jeff@kanecap.com> wrote:> R-devel, > > The performance of as.Date differs by a large degree between one of my > machines with glibc 2.3.2: > > > system.time(x <- as.Date(rep("01-01-2005", 100000), format = "%m-%d-%Y")) > [1] 1.17 0.00 1.18 0.00 0.00 > > and a comparable machine with glibc 2.3.3: > > > system.time(x <- as.Date(rep("01-01-2005", 100000), format = "%m-%d-%Y")) > [1] 31.20 46.89 81.01 0.00 0.00 > > both with the same R version: > > > R.version > _ > platform i686-pc-linux-gnu > arch i686 > os linux-gnu > system i686, linux-gnu > status > major 2 > minor 1.0 > year 2005 > month 04 > day 18 > language R > > I'm focusing on differences in glibc versions because of as.Date's use > of strptime. > > Does it seem likely that the cause of this discrepancy is in fact > glibc? If so, can anyone tell me how to make the performance of the > second machine more like the first? > > I have verified that using the chron package, which I don't believe > uses strptime, for the above character conversion performs equally > well on both machines.I think its likely the character processing that is the bottleneck. You can speed that part up by extracting the substrings directly:> system.time({+ dd <- rep("01-01-2005", 10000) + year <- as.numeric(substr(dd, 7, 10)) + mon <- as.numeric(substr(dd, 1, 2)) + day <- as.numeric(substr(dd, 4, 5)) + x <- as.Date(ISOdate(year, mon, day)) + }, gc = TRUE) [1] 0.42 0.00 0.51 NA NA> system.time(x <- as.Date(rep("01-01-2005", 100000), format = "%m-%d-%Y"), gc=TRUE)[1] 1.08 0.00 1.22 NA NA
One other possibly difference would be locale, but this is slow on FC3 (2.3.4 now) in the C locale. Almost all the time is in strptime: R profiling shows> summaryRprof()$by.self self.time self.pct total.time total.pct "strptime" 29.58 99.7 29.58 99.7 "as.Date.character" 0.10 0.3 29.68 100.0 "as.Date" 0.00 0.0 29.68 100.0 "eval" 0.00 0.0 29.68 100.0 "system.time" 0.00 0.0 29.68 100.0 Now on a glibc 2.3.x system R's internal replacement for strptime will be used (to work around bugs) so it must be some other part of the POSIX time-handling that has changed. The next step would be to do C-level profiling and then retrofit the crucial code from glibc 2.3.2. It does seem a pretty unusual application of R for 10^5 date conversions to be needed and for 30 secs to be an appreciable part of the analysis time on such a data set. On Wed, 4 May 2005, Jeff Enos wrote:> R-devel, > > The performance of as.Date differs by a large degree between one of my > machines with glibc 2.3.2: > >> system.time(x <- as.Date(rep("01-01-2005", 100000), format = "%m-%d-%Y")) > [1] 1.17 0.00 1.18 0.00 0.00 > > and a comparable machine with glibc 2.3.3: > >> system.time(x <- as.Date(rep("01-01-2005", 100000), format = "%m-%d-%Y")) > [1] 31.20 46.89 81.01 0.00 0.00 > > both with the same R version: > >> R.version > _ > platform i686-pc-linux-gnu > arch i686 > os linux-gnu > system i686, linux-gnu > status > major 2 > minor 1.0 > year 2005 > month 04 > day 18 > language R > > I'm focusing on differences in glibc versions because of as.Date's use > of strptime. > > Does it seem likely that the cause of this discrepancy is in fact > glibc? If so, can anyone tell me how to make the performance of the > second machine more like the first? > > I have verified that using the chron package, which I don't believe > uses strptime, for the above character conversion performs equally > well on both machines.-- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595