I have two dataframes, each with a column for timestamp. I want to merge the two dataframes such that each row from first dataframe is matched with the row in the second dataframe with most recent but preceding timestamp. Here is an example. option.trades <- data.frame(timestamp = as.POSIXct(c("2023-08-07 10:23:22", "2023-08-07 10:25:33", "2023-08-07 10:28:41")), option.price = c(2.5, 2.7, 1.8)) stock.trades <- data.frame(timestamp = as.POSIXct(c("2023-08-07 10:23:21", "2023-08-07 10:23:34", "2023-08-07 10:24:57", "2023-08-07 10:28:37", "2023-08-07 10:29:01")), stock.price = c(102.2, 102.9, 103.1, 101.8, 101.7)) stock.trades <- stock.trades[order(stock.trades$timestamp),] library(plyr) mystock.prices <- ldply(option.trades$timestamp, function(tstamp) tail(subset(stock.trades, timestamp <= tstamp), 1)) names(mystock.prices)[1] <- "stock.timestamp" myres <- cbind(option.trades, mystock.prices) This method works. But for large dataframes, it is very slow. Is there a way to speed up the merge? Thanks, Naresh
Hi Naresh, Perhaps the below is faster than your approach library(dplyr) library(tidyr) merge(option.trades, stock.trades, by="timestamp", all=TRUE) |> dplyr::arrange(timestamp) |> dplyr::mutate(stock.timestamp as.POSIXct(ifelse(is.na(option.price), timestamp, NA))) |> tidyr::fill(stock.price, stock.timestamp) |> dplyr::filter(!is.na(option.price)) |> dplyr::select(1,2,4,3) -> myres2 identical(myres, myres2) ## TRUE On Tue, Aug 8, 2023 at 5:32?AM Naresh Gurbuxani <naresh_gurbuxani at hotmail.com> wrote:> > > I have two dataframes, each with a column for timestamp. I want to > merge the two dataframes such that each row from first dataframe > is matched with the row in the second dataframe with most recent but > preceding timestamp. Here is an example. > > option.trades <- data.frame(timestamp = as.POSIXct(c("2023-08-07 10:23:22", "2023-08-07 10:25:33", "2023-08-07 10:28:41")), option.price = c(2.5, 2.7, 1.8)) > > stock.trades <- data.frame(timestamp = as.POSIXct(c("2023-08-07 10:23:21", "2023-08-07 10:23:34", "2023-08-07 10:24:57", "2023-08-07 10:28:37", "2023-08-07 10:29:01")), stock.price = c(102.2, 102.9, 103.1, 101.8, 101.7)) > > stock.trades <- stock.trades[order(stock.trades$timestamp),] > > library(plyr) > mystock.prices <- ldply(option.trades$timestamp, function(tstamp) tail(subset(stock.trades, timestamp <= tstamp), 1)) > names(mystock.prices)[1] <- "stock.timestamp" > myres <- cbind(option.trades, mystock.prices) > > This method works. But for large dataframes, it is very slow. Is there > a way to speed up the merge? > > Thanks, > Naresh > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
On Mon, 07 Aug 2023, Naresh Gurbuxani writes:> I have two dataframes, each with a column for timestamp. I want to > merge the two dataframes such that each row from first dataframe > is matched with the row in the second dataframe with most recent but > preceding timestamp. Here is an example. > > option.trades <- data.frame(timestamp = as.POSIXct(c("2023-08-07 10:23:22", "2023-08-07 10:25:33", "2023-08-07 10:28:41")), option.price = c(2.5, 2.7, 1.8)) > > stock.trades <- data.frame(timestamp > as.POSIXct(c("2023-08-07 10:23:21", "2023-08-07 > 10:23:34", "2023-08-07 10:24:57", "2023-08-07 > 10:28:37", "2023-08-07 10:29:01")), stock.price > c(102.2, 102.9, 103.1, 101.8, 101.7)) > > stock.trades <- stock.trades[order(stock.trades$timestamp),] > > library(plyr) > mystock.prices <- ldply(option.trades$timestamp, function(tstamp) tail(subset(stock.trades, timestamp <= tstamp), 1)) > names(mystock.prices)[1] <- "stock.timestamp" > myres <- cbind(option.trades, mystock.prices) > > This method works. But for large dataframes, it is very slow. Is there > a way to speed up the merge? > > Thanks, > Naresh >If the timestamps are sorted (or you can sort them), function ?findInterval might be helpful: i <- findInterval(option.trades$timestamp, stock.trades$timestamp) cbind(option.trades, stock.trades[i, ]) ## timestamp option.price timestamp stock.price ## 1 2023-08-07 10:23:22 2.5 2023-08-07 10:23:21 102.2 ## 3 2023-08-07 10:25:33 2.7 2023-08-07 10:24:57 103.1 ## 4 2023-08-07 10:28:41 1.8 2023-08-07 10:28:37 101.8 -- Enrico Schumann Lucerne, Switzerland http://enricoschumann.net
It sounds like you might want a rolling join, e.g. https://dplyr.tidyverse.org/reference/join_by.html#rolling-joins. (And data.table has similar functionality which inspired dplyr) Hadley On Mon, Aug 7, 2023 at 9:32?PM Naresh Gurbuxani <naresh_gurbuxani at hotmail.com> wrote:> > > I have two dataframes, each with a column for timestamp. I want to > merge the two dataframes such that each row from first dataframe > is matched with the row in the second dataframe with most recent but > preceding timestamp. Here is an example. > > option.trades <- data.frame(timestamp = as.POSIXct(c("2023-08-07 10:23:22", "2023-08-07 10:25:33", "2023-08-07 10:28:41")), option.price = c(2.5, 2.7, 1.8)) > > stock.trades <- data.frame(timestamp = as.POSIXct(c("2023-08-07 10:23:21", "2023-08-07 10:23:34", "2023-08-07 10:24:57", "2023-08-07 10:28:37", "2023-08-07 10:29:01")), stock.price = c(102.2, 102.9, 103.1, 101.8, 101.7)) > > stock.trades <- stock.trades[order(stock.trades$timestamp),] > > library(plyr) > mystock.prices <- ldply(option.trades$timestamp, function(tstamp) tail(subset(stock.trades, timestamp <= tstamp), 1)) > names(mystock.prices)[1] <- "stock.timestamp" > myres <- cbind(option.trades, mystock.prices) > > This method works. But for large dataframes, it is very slow. Is there > a way to speed up the merge? > > Thanks, > Naresh > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- http://hadley.nz