Christoph Jäckel
2011-Apr-16 14:35 UTC
[R] Matching Problem: Want to match to data.frame with inexact matching identifier (one identifier has to be in the range of the other).
Hello R-Community, I have the following matching problem: I have two data.frames, one with an observation every month (per company ID), and one with an observation every quarter (per company ID; note that quarter means fiscal quarter; therefore 1Q = Jan, Feb, Mar is not necessarily correct and also, a fiscal quarter is not necessarily 3 month long). For every month and company, I want to get the correct value of that quarter. Consequently, several months have the same value for one quarter. As an example see the code below: #Monthly data d1 <- data.frame(cbind(c(rep("A",5),rep("B",5)),c(1:5,1:5)),stringsAsFactors=FALSE) names(d1) <- c("ID", "Month") d1[,"Month"] <- as.integer(d1[,"Month"]) #Quarterly data, i.e. the value of every quarter has to be matched to several months in d1 #However, I want to match fiscal quarters, which means that one quarter is not necessarily 3 month long d2 <- data.frame(cbind(c("A","A","B","B"),c(1,3,1,4),c(2,5,3,5),c("v1","v2","v3","V4")),stringsAsFactors=FALSE) names(d2) <- c("ID", "Min_Month", "Max_Month","Value") d2[,"Min_Month"] <- as.integer(d2[,"Min_Month"]) d2[,"Max_Month"] <- as.integer(d2[,"Max_Month"]) #Possible solution: Loop through every company and month and check in which quarter (based on Min_Month and Max_Month) the current month is for (i in 1:nrow(d1)){ for (j in 1:nrow(d2)){ if ((d1[i,"ID"] == d2[j,"ID"]) & (d1[i,"Month"]>= d2[j,"Min_Month"]) & (d1[i,"Month"]<= d2[j,"Max_Month"])) {break} } d1[i,"Value"] <- d2[j,"Value"] } The solution works, but I was hoping that there would be a more efficient and elegant one. I checked "match" and "merge", but didn't figure out how I could use those to match without a unique identifier. Any hints would be highly appreciated. Christoph -- -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Christoph J?ckel (Dipl.-Kfm.) -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Research Assistant Chair for Financial Management and Capital Markets | Lehrstuhls f?r Finanzmanagement und Kapitalm?rkte TUM School of Management | Technische Universit?t M?nchen
Christoph Jäckel
2011-Apr-19 15:40 UTC
[R] Matching Problem: Want to match to data.frame with inexact matching identifier (one identifier has to be in the range of the other).
Hi together, I found a solution to my problem that works for me and performs reasonable well in my opinion. Instead of looping through both datasets, I decided to replicate each row in d2 from Min_Month to Max_Month and then use the match-function. Here is the code (I changed the original example to be able to compare the performance of the two solutions): #Example for performance test nobs <- 100 ?#Observation per company [26 companies] d1 <- data.frame(cbind("ID"=rep(letters,each=nobs,1),"Month"=rep(1:nobs,26)),stringsAsFactors=FALSE) d1$Month <- as.integer(d1$Month) d1[,"Unique_ID"] <- mapply(paste,d1$ID,d1$Month,sep="_") d2 <- data.frame(cbind("ID"=rep(letters,each=3,1),"Min_Month"=rep(c(1,as.integer(nobs/3),as.integer(2*nobs/3)),26), ?? ? ?"Max_Month"=rep(c(as.integer(nobs/3)-1,as.integer(2*nobs/3)-1,nobs),26),"Value1"=rep(1:3,26), "Value2"=rep(4:6,26)),stringsAsFactors=FALSE) d2$Min_Month <- as.integer(d2$Min_Month) d2$Max_Month <- as.integer(d2$Max_Month) #Possible solution; is there are more elegant one? proc1 <- proc.time() for (i in 1:nrow(d1)){ ??for (j in 1:nrow(d2)){ ?? ?if ((d1[i,"ID"] == d2[j,"ID"]) & (d1[i,"Month"]>= d2[j,"Min_Month"]) & ?? ?(d1[i,"Month"]<= d2[j,"Max_Month"])) {break} ??} ??d1[i,"Value1"] <- d2[j,"Value1"] ??d1[i,"Value2"] <- d2[j,"Value2"] } proc1 <- proc.time() - proc1 #New Solution; Create rows first, then match it proc2 <- proc.time() createRows <- function(df){ ??dates <- seq(df$Min_Month,df$Max_Month,by=1) ??rows <- data.frame(ID = df$ID,date=dates, df[,c("Value1","Value2")]) ??rows } library(plyr) d2 <- ddply(d2,.(ID,Min_Month),createRows) d2[,"Unique_ID"] <- mapply(paste,d2$ID,d2$date,sep="_") d1[,c("Value1_Alt", "Value2_Alt")] <- d2[match(d1$Unique_ID,d2$Unique_ID),c("Value1","Value2")] proc2 <- proc.time() - proc2 Hope that helps. If there is still a better solution, I would still be more than happy for an advice! Christoph On Sat, Apr 16, 2011 at 4:35 PM, Christoph J?ckel <christoph.jaeckel at wi.tum.de> wrote:> > Hello R-Community, > > I have the following matching problem: I have two data.frames, one > with an observation every month (per company ID), and one with an > observation every quarter (per company ID; note that quarter means > fiscal quarter; therefore 1Q = Jan, Feb, Mar is not necessarily > correct and also, a fiscal quarter is not necessarily 3 month long). > > For every month and company, I want to get the correct value of that > quarter. Consequently, several months have the same value for one > quarter. As an example see the code below: > > #Monthly data > d1 <- data.frame(cbind(c(rep("A",5),rep("B",5)),c(1:5,1:5)),stringsAsFactors=FALSE) > names(d1) <- c("ID", "Month") > d1[,"Month"] <- as.integer(d1[,"Month"]) > > #Quarterly data, i.e. the value of every quarter has to be matched to > several months in d1 > #However, I want to match fiscal quarters, which means that one > quarter is not necessarily 3 month long > d2 <- data.frame(cbind(c("A","A","B","B"),c(1,3,1,4),c(2,5,3,5),c("v1","v2","v3","V4")),stringsAsFactors=FALSE) > names(d2) <- c("ID", "Min_Month", "Max_Month","Value") > d2[,"Min_Month"] <- as.integer(d2[,"Min_Month"]) > d2[,"Max_Month"] <- as.integer(d2[,"Max_Month"]) > > #Possible solution: Loop through every company and month and check in > which quarter (based on Min_Month and Max_Month) the current month is > for (i in 1:nrow(d1)){ > ?for (j in 1:nrow(d2)){ > ? ?if ((d1[i,"ID"] == d2[j,"ID"]) & (d1[i,"Month"]>= d2[j,"Min_Month"]) & > ? ?(d1[i,"Month"]<= d2[j,"Max_Month"])) {break} > ?} > ?d1[i,"Value"] <- d2[j,"Value"] > } > > The solution works, but I was hoping that there would be a more > efficient and elegant one. I checked "match" and "merge", but didn't > figure out how I could use those to match without a unique identifier. > > Any hints would be highly appreciated. > > Christoph > > -- > > -------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Christoph J?ckel (Dipl.-Kfm.) > -------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Research Assistant > Chair for Financial Management and Capital Markets | Lehrstuhls f?r > Finanzmanagement und Kapitalm?rkte > TUM School of Management | Technische Universit?t M?nchen