Christoph Jäckel
2011-Apr-16 14:35 UTC
[R] Matching Problem: Want to match to data.frame with inexact matching identifier (one identifier has to be in the range of the other).
Hello R-Community,
I have the following matching problem: I have two data.frames, one
with an observation every month (per company ID), and one with an
observation every quarter (per company ID; note that quarter means
fiscal quarter; therefore 1Q = Jan, Feb, Mar is not necessarily
correct and also, a fiscal quarter is not necessarily 3 month long).
For every month and company, I want to get the correct value of that
quarter. Consequently, several months have the same value for one
quarter. As an example see the code below:
#Monthly data
d1 <-
data.frame(cbind(c(rep("A",5),rep("B",5)),c(1:5,1:5)),stringsAsFactors=FALSE)
names(d1) <- c("ID", "Month")
d1[,"Month"] <- as.integer(d1[,"Month"])
#Quarterly data, i.e. the value of every quarter has to be matched to
several months in d1
#However, I want to match fiscal quarters, which means that one
quarter is not necessarily 3 month long
d2 <-
data.frame(cbind(c("A","A","B","B"),c(1,3,1,4),c(2,5,3,5),c("v1","v2","v3","V4")),stringsAsFactors=FALSE)
names(d2) <- c("ID", "Min_Month",
"Max_Month","Value")
d2[,"Min_Month"] <- as.integer(d2[,"Min_Month"])
d2[,"Max_Month"] <- as.integer(d2[,"Max_Month"])
#Possible solution: Loop through every company and month and check in
which quarter (based on Min_Month and Max_Month) the current month is
for (i in 1:nrow(d1)){
for (j in 1:nrow(d2)){
if ((d1[i,"ID"] == d2[j,"ID"]) &
(d1[i,"Month"]>= d2[j,"Min_Month"]) &
(d1[i,"Month"]<= d2[j,"Max_Month"])) {break}
}
d1[i,"Value"] <- d2[j,"Value"]
}
The solution works, but I was hoping that there would be a more
efficient and elegant one. I checked "match" and "merge",
but didn't
figure out how I could use those to match without a unique identifier.
Any hints would be highly appreciated.
Christoph
--
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Christoph J?ckel (Dipl.-Kfm.)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Research Assistant
Chair for Financial Management and Capital Markets | Lehrstuhls f?r
Finanzmanagement und Kapitalm?rkte
TUM School of Management | Technische Universit?t M?nchen
Christoph Jäckel
2011-Apr-19 15:40 UTC
[R] Matching Problem: Want to match to data.frame with inexact matching identifier (one identifier has to be in the range of the other).
Hi together,
I found a solution to my problem that works for me and performs
reasonable well in my opinion. Instead of looping through both
datasets, I decided to replicate each row in d2 from Min_Month to
Max_Month and then use the match-function. Here is the code (I changed
the original example to be able to compare the performance of the two
solutions):
#Example for performance test
nobs <- 100 ?#Observation per company [26 companies]
d1 <-
data.frame(cbind("ID"=rep(letters,each=nobs,1),"Month"=rep(1:nobs,26)),stringsAsFactors=FALSE)
d1$Month <- as.integer(d1$Month)
d1[,"Unique_ID"] <- mapply(paste,d1$ID,d1$Month,sep="_")
d2 <-
data.frame(cbind("ID"=rep(letters,each=3,1),"Min_Month"=rep(c(1,as.integer(nobs/3),as.integer(2*nobs/3)),26),
?? ?
?"Max_Month"=rep(c(as.integer(nobs/3)-1,as.integer(2*nobs/3)-1,nobs),26),"Value1"=rep(1:3,26),
"Value2"=rep(4:6,26)),stringsAsFactors=FALSE)
d2$Min_Month <- as.integer(d2$Min_Month)
d2$Max_Month <- as.integer(d2$Max_Month)
#Possible solution; is there are more elegant one?
proc1 <- proc.time()
for (i in 1:nrow(d1)){
??for (j in 1:nrow(d2)){
?? ?if ((d1[i,"ID"] == d2[j,"ID"]) &
(d1[i,"Month"]>= d2[j,"Min_Month"]) &
?? ?(d1[i,"Month"]<= d2[j,"Max_Month"])) {break}
??}
??d1[i,"Value1"] <- d2[j,"Value1"]
??d1[i,"Value2"] <- d2[j,"Value2"]
}
proc1 <- proc.time() - proc1
#New Solution; Create rows first, then match it
proc2 <- proc.time()
createRows <- function(df){
??dates <- seq(df$Min_Month,df$Max_Month,by=1)
??rows <- data.frame(ID = df$ID,date=dates,
df[,c("Value1","Value2")])
??rows
}
library(plyr)
d2 <- ddply(d2,.(ID,Min_Month),createRows)
d2[,"Unique_ID"] <- mapply(paste,d2$ID,d2$date,sep="_")
d1[,c("Value1_Alt", "Value2_Alt")] <-
d2[match(d1$Unique_ID,d2$Unique_ID),c("Value1","Value2")]
proc2 <- proc.time() - proc2
Hope that helps. If there is still a better solution, I would still be
more than happy for an advice!
Christoph
On Sat, Apr 16, 2011 at 4:35 PM, Christoph J?ckel
<christoph.jaeckel at wi.tum.de> wrote:>
> Hello R-Community,
>
> I have the following matching problem: I have two data.frames, one
> with an observation every month (per company ID), and one with an
> observation every quarter (per company ID; note that quarter means
> fiscal quarter; therefore 1Q = Jan, Feb, Mar is not necessarily
> correct and also, a fiscal quarter is not necessarily 3 month long).
>
> For every month and company, I want to get the correct value of that
> quarter. Consequently, several months have the same value for one
> quarter. As an example see the code below:
>
> #Monthly data
> d1 <-
data.frame(cbind(c(rep("A",5),rep("B",5)),c(1:5,1:5)),stringsAsFactors=FALSE)
> names(d1) <- c("ID", "Month")
> d1[,"Month"] <- as.integer(d1[,"Month"])
>
> #Quarterly data, i.e. the value of every quarter has to be matched to
> several months in d1
> #However, I want to match fiscal quarters, which means that one
> quarter is not necessarily 3 month long
> d2 <-
data.frame(cbind(c("A","A","B","B"),c(1,3,1,4),c(2,5,3,5),c("v1","v2","v3","V4")),stringsAsFactors=FALSE)
> names(d2) <- c("ID", "Min_Month",
"Max_Month","Value")
> d2[,"Min_Month"] <- as.integer(d2[,"Min_Month"])
> d2[,"Max_Month"] <- as.integer(d2[,"Max_Month"])
>
> #Possible solution: Loop through every company and month and check in
> which quarter (based on Min_Month and Max_Month) the current month is
> for (i in 1:nrow(d1)){
> ?for (j in 1:nrow(d2)){
> ? ?if ((d1[i,"ID"] == d2[j,"ID"]) &
(d1[i,"Month"]>= d2[j,"Min_Month"]) &
> ? ?(d1[i,"Month"]<= d2[j,"Max_Month"])) {break}
> ?}
> ?d1[i,"Value"] <- d2[j,"Value"]
> }
>
> The solution works, but I was hoping that there would be a more
> efficient and elegant one. I checked "match" and
"merge", but didn't
> figure out how I could use those to match without a unique identifier.
>
> Any hints would be highly appreciated.
>
> Christoph
>
> --
>
>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Christoph J?ckel (Dipl.-Kfm.)
>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Research Assistant
> Chair for Financial Management and Capital Markets | Lehrstuhls f?r
> Finanzmanagement und Kapitalm?rkte
> TUM School of Management | Technische Universit?t M?nchen