Hello list! I have a data.frame which looks like this:> servdatum op.read op.write read write 1 2011-01-29 10:00:00 0 0 0 0 2 2011-01-29 10:00:01 0 0 0 0 3 2011-01-29 10:00:02 0 0 0 0 4 2011-01-29 10:00:03 0 4 0 647168 5 2011-01-29 10:00:04 0 0 0 0 6 2011-01-29 10:00:05 0 14 0 1960837 7 2011-01-29 10:00:06 0 0 0 0 ... 115 2011-01-30 10:00:54 0 0 0 0 116 2011-01-30 10:00:55 0 0 0 0 117 2011-01-30 10:00:56 0 0 0 0 118 2011-01-30 10:00:57 54 0 29184 0 119 2011-01-30 10:00:58 204 0 122880 0 120 2011-01-30 10:00:59 0 0 0 0 ... I want to compare read/write from each day. I already have a solution, but it is pretty slow. # read the data serv <- read.delim("cut.inp") # Reformat the dates from the file serv$datum <- strptime(serv$datum, "%Y-%m-%d %H:%M:%S") # select all single days dates.serv <- unique(strptime(serv$datum, format="%Y-%m-%d")) # create a data.frame values <- data.frame(row.names=1, datum=numeric(0), write=numeric(0), read=numeric(0)) for(i in as.character(dates.serv)) { # build up a values for a day-range searchstart <- as.POSIXlt(paste(i, "00:00:00", sep=" ")) searchend <- as.POSIXlt(paste(i, "23:59:59", sep=" ")) # select all values from a specific day day <- serv[(serv$datum >= searchstart & serv$datum <= searchend),] write <- as.numeric(sum(as.numeric(day$write))) read <- as.numeric(sum(as.numeric(day$read))) # add to the data.frame values <- rbind(values, data.frame(datum=i, write=write, read=read)) } This is my first try using R for statistics so I'm sure this isn't the best solution. The for-loop does it's job, but as I said is really slow. My data is for 21 days and 1 line per second. Is there a better way to select the date-ranges instead of a for-loop? The line where I select all values for "day" seems to be the heaviest. Any idea? Kind regards, Benjamin PS: I attached some sample data, in case you want to try for yourself. -------------- next part -------------- datum op.read op.write read write 2011-01-29 10:00:00 0 0 0 0 2011-01-29 10:00:01 0 0 0 0 2011-01-29 10:00:02 0 0 0 0 2011-01-29 10:00:03 0 4 0 647168 2011-01-29 10:00:04 0 0 0 0 2011-01-29 10:00:05 0 14 0 1960837 2011-01-29 10:00:06 0 0 0 0 2011-01-29 10:00:07 0 611 0 3533701 2011-01-29 10:00:08 1 0 9728 0 2011-01-29 10:00:09 0 0 0 0 2011-01-29 10:00:10 3 0 13824 0 2011-01-29 10:00:11 1 0 1023 0 2011-01-29 10:00:12 2 1 13824 90112 2011-01-29 10:00:13 0 0 0 0 2011-01-29 10:00:14 0 0 0 0 2011-01-29 10:00:15 0 0 0 0 2011-01-29 10:00:16 0 0 0 0 2011-01-29 10:00:17 0 0 0 0 2011-01-29 10:00:18 0 0 0 0 2011-01-29 10:00:19 0 0 0 0 2011-01-29 10:00:20 0 0 0 0 2011-01-29 10:00:21 0 0 0 0 2011-01-29 10:00:22 0 0 0 0 2011-01-29 10:00:23 0 0 0 0 2011-01-29 10:00:24 0 0 0 0 2011-01-29 10:00:25 0 0 0 0 2011-01-29 10:00:26 0 0 0 0 2011-01-29 10:00:27 0 0 0 0 2011-01-29 10:00:28 0 0 0 0 2011-01-29 10:00:29 0 0 0 0 2011-01-29 10:00:30 0 0 0 0 2011-01-29 10:00:31 0 0 0 0 2011-01-29 10:00:32 0 0 0 0 2011-01-29 10:00:33 0 0 0 0 2011-01-29 10:00:34 0 0 0 0 2011-01-29 10:00:35 0 0 0 0 2011-01-29 10:00:36 0 0 0 0 2011-01-29 10:00:37 0 651 0 3397386 2011-01-29 10:00:38 0 0 0 0 2011-01-29 10:00:39 0 0 0 0 2011-01-29 10:00:40 0 0 0 0 2011-01-29 10:00:41 0 0 0 0 2011-01-29 10:00:42 0 0 0 0 2011-01-29 10:00:43 0 0 0 0 2011-01-29 10:00:44 0 0 0 0 2011-01-29 10:00:45 0 0 0 0 2011-01-29 10:00:46 0 0 0 0 2011-01-29 10:00:47 0 0 0 0 2011-01-29 10:00:48 0 0 0 0 2011-01-29 10:00:49 0 0 0 0 2011-01-29 10:00:50 0 0 0 0 2011-01-29 10:00:51 0 0 0 0 2011-01-29 10:00:52 0 0 0 0 2011-01-29 10:00:53 8 0 20480 0 2011-01-29 10:00:54 42 0 63488 0 2011-01-29 10:00:55 58 4 721920 655360 2011-01-29 10:00:56 16 3 29696 524288 2011-01-29 10:00:57 0 0 0 131072 2011-01-29 10:00:58 17 0 27648 0 2011-01-29 10:00:59 26 5 119808 786432 2011-01-30 10:00:00 0 0 0 0 2011-01-30 10:00:01 0 0 2560 0 2011-01-30 10:00:02 0 0 0 0 2011-01-30 10:00:03 0 0 0 0 2011-01-30 10:00:04 0 0 0 0 2011-01-30 10:00:05 0 0 0 0 2011-01-30 10:00:06 0 0 0 0 2011-01-30 10:00:07 0 0 0 0 2011-01-30 10:00:08 0 0 0 0 2011-01-30 10:00:09 0 0 0 0 2011-01-30 10:00:10 0 0 0 0 2011-01-30 10:00:11 0 0 0 0 2011-01-30 10:00:12 0 0 0 0 2011-01-30 10:00:13 0 433 0 1279262 2011-01-30 10:00:14 0 5 0 49152 2011-01-30 10:00:15 0 0 0 0 2011-01-30 10:00:16 0 0 0 0 2011-01-30 10:00:17 0 0 0 0 2011-01-30 10:00:18 0 0 0 0 2011-01-30 10:00:19 0 0 0 0 2011-01-30 10:00:20 0 0 0 0 2011-01-30 10:00:21 0 0 0 0 2011-01-30 10:00:22 0 0 0 0 2011-01-30 10:00:23 0 0 0 0 2011-01-30 10:00:24 0 0 0 0 2011-01-30 10:00:25 0 4 1023 327680 2011-01-30 10:00:26 10 0 36352 0 2011-01-30 10:00:27 1 0 6144 0 2011-01-30 10:00:28 21 0 52736 0 2011-01-30 10:00:29 0 0 0 0 2011-01-30 10:00:30 0 0 0 0 2011-01-30 10:00:31 0 0 0 0 2011-01-30 10:00:32 25 0 86016 0 2011-01-30 10:00:33 0 0 0 0 2011-01-30 10:00:34 0 0 0 0 2011-01-30 10:00:35 0 0 0 0 2011-01-30 10:00:36 0 0 0 0 2011-01-30 10:00:37 0 0 0 0 2011-01-30 10:00:38 0 0 0 0 2011-01-30 10:00:39 0 0 0 0 2011-01-30 10:00:40 3 0 7168 0 2011-01-30 10:00:41 0 0 0 0 2011-01-30 10:00:42 0 0 0 0 2011-01-30 10:00:43 95 204 359424 992256 2011-01-30 10:00:44 121 364 381952 1572864 2011-01-30 10:00:45 0 0 0 0 2011-01-30 10:00:46 0 0 1023 0 2011-01-30 10:00:47 0 0 0 0 2011-01-30 10:00:48 0 0 0 0 2011-01-30 10:00:49 0 0 0 0 2011-01-30 10:00:50 0 0 0 0 2011-01-30 10:00:51 0 0 0 0 2011-01-30 10:00:52 0 3 3072 413696 2011-01-30 10:00:53 0 0 0 0 2011-01-30 10:00:54 0 0 0 0 2011-01-30 10:00:55 0 0 0 0 2011-01-30 10:00:56 0 0 0 0 2011-01-30 10:00:57 54 0 29184 0 2011-01-30 10:00:58 204 0 122880 0 2011-01-30 10:00:59 0 0 0 0
Benjamin, A more elegant "R-style" solution would be to use one of R's "apply"/aggregation routines, of which there are many. For example, the "by" function can split a data.frame by some factor/categorical variable(s), and then apply a function to each "slice". The result can then be pieced back together. See below for an example in which this factor is simply a parallel vector of pure dates: # extract pure date component of time and date dates <- format(serv$datum, "%Y-%m-%d") # write auxilliary function to aggregate a "slice" of the data.frame # x will be a "slice" of data from a single day aggregateDf <- function(x) { # return a one-row data.frame data.frame(datum = format(x$datum[1], "%Y-%m-%d"), write = sum(x$write), read = sum(x$read) ) } # now process each "slice" of the serv data.frame using "by" splitVals <- by(serv, dates, aggregateDf ) # bind back into a single data.frame values <- do.call(rbind, splitVals) The difference in execution speed is pretty negligible on my machine, so it's a more concise solution but I don't know if it is much faster. HTH, Francisco On Thu, Mar 10, 2011 at 1:23 PM, Benjamin Stier < benjamin.stier@ub.uni-tuebingen.de> wrote:> Hello list! > > I have a data.frame which looks like this: > > serv > datum op.read op.write read write > 1 2011-01-29 10:00:00 0 0 0 0 > 2 2011-01-29 10:00:01 0 0 0 0 > 3 2011-01-29 10:00:02 0 0 0 0 > 4 2011-01-29 10:00:03 0 4 0 647168 > 5 2011-01-29 10:00:04 0 0 0 0 > 6 2011-01-29 10:00:05 0 14 0 1960837 > 7 2011-01-29 10:00:06 0 0 0 0 > ... > 115 2011-01-30 10:00:54 0 0 0 0 > 116 2011-01-30 10:00:55 0 0 0 0 > 117 2011-01-30 10:00:56 0 0 0 0 > 118 2011-01-30 10:00:57 54 0 29184 0 > 119 2011-01-30 10:00:58 204 0 122880 0 > 120 2011-01-30 10:00:59 0 0 0 0 > ... > > I want to compare read/write from each day. I already have a solution, but > it > is pretty slow. > > # read the data > serv <- read.delim("cut.inp") > > # Reformat the dates from the file > serv$datum <- strptime(serv$datum, "%Y-%m-%d %H:%M:%S") > > # select all single days > dates.serv <- unique(strptime(serv$datum, format="%Y-%m-%d")) > > # create a data.frame > values <- data.frame(row.names=1, datum=numeric(0), write=numeric(0), > read=numeric(0)) > for(i in as.character(dates.serv)) { > # build up a values for a day-range > searchstart <- as.POSIXlt(paste(i, "00:00:00", sep=" ")) > searchend <- as.POSIXlt(paste(i, "23:59:59", sep=" ")) > # select all values from a specific day > day <- serv[(serv$datum >= searchstart & serv$datum <= searchend),] > write <- as.numeric(sum(as.numeric(day$write))) > read <- as.numeric(sum(as.numeric(day$read))) > # add to the data.frame > values <- rbind(values, data.frame(datum=i, write=write, read=read)) > } > > This is my first try using R for statistics so I'm sure this isn't the best > solution. > The for-loop does it's job, but as I said is really slow. My data is for 21 > days and 1 line per second. > Is there a better way to select the date-ranges instead of a for-loop? The > line where I select all values for "day" seems to be the heaviest. Any > idea? > > Kind regards, > > Benjamin > > PS: I attached some sample data, in case you want to try for yourself. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >[[alternative HTML version deleted]]
On Mar 10, 2011, at 8:23 AM, Benjamin Stier wrote:> Hello list! > > I have a data.frame which looks like this: >> serv > datum op.read op.write read write > 1 2011-01-29 10:00:00 0 0 0 0 > 2 2011-01-29 10:00:01 0 0 0 0 > 3 2011-01-29 10:00:02 0 0 0 0 > 4 2011-01-29 10:00:03 0 4 0 647168 > 5 2011-01-29 10:00:04 0 0 0 0 > 6 2011-01-29 10:00:05 0 14 0 1960837 > 7 2011-01-29 10:00:06 0 0 0 0 > ... > 115 2011-01-30 10:00:54 0 0 0 0 > 116 2011-01-30 10:00:55 0 0 0 0 > 117 2011-01-30 10:00:56 0 0 0 0 > 118 2011-01-30 10:00:57 54 0 29184 0 > 119 2011-01-30 10:00:58 204 0 122880 0 > 120 2011-01-30 10:00:59 0 0 0 0 > ... > > I want to compare read/write from each day. I already have a > solution, but it > is pretty slow.See if this is any faster: > aggregate(serv[, c("read", "write")], list(format(serv$datum, "%Y- %m-%d")), sum) Group.1 read write 1 2011-01-29 1021439 11726356 2 2011-01-30 1089534 4634910> > # read the data > serv <- read.delim("cut.inp") > > # Reformat the dates from the file > serv$datum <- strptime(serv$datum, "%Y-%m-%d %H:%M:%S") > > # select all single days > dates.serv <- unique(strptime(serv$datum, format="%Y-%m-%d")) > > # create a data.frame > values <- data.frame(row.names=1, datum=numeric(0), > write=numeric(0), read=numeric(0)) > for(i in as.character(dates.serv)) { > # build up a values for a day-range > searchstart <- as.POSIXlt(paste(i, "00:00:00", sep=" ")) > searchend <- as.POSIXlt(paste(i, "23:59:59", sep=" ")) > # select all values from a specific day > day <- serv[(serv$datum >= searchstart & serv$datum <= > searchend),] > write <- as.numeric(sum(as.numeric(day$write))) > read <- as.numeric(sum(as.numeric(day$read))) > # add to the data.frame > values <- rbind(values, data.frame(datum=i, write=write, > read=read)) > } > > This is my first try using R for statistics so I'm sure this isn't > the best > solution. > The for-loop does it's job, but as I said is really slow. My data is > for 21 > days and 1 line per second. > Is there a better way to select the date-ranges instead of a for- > loop? The > line where I select all values for "day" seems to be the heaviest. > Any idea? > > Kind regards, > > Benjamin > > PS: I attached some sample data, in case you want to try for yourself. > <cut.inp>______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD West Hartford, CT