I'm trying to insert rows of a data.frame into a database table, or update
where the key fields of a record already exist in the table. I've come up
with a possible solution below, but would like to hear if anyone has a better
solution.
# The problem demonstrated:
# Create a data.frame with test values
library(RODBC)
tbl <- data.frame(
key1 = rep(1:3, each = 2),
key2 = rep(LETTERS[1:2], 3),
somevalue = rnorm(6)
)
# Create table in database using the following SQL
CREATE TABLE tbl
(
key1 integer NOT NULL,
key2 character varying(1) NOT NULL,
somevalue double precision,
CONSTRAINT pktbl PRIMARY KEY (key1, key2)
)
# Continue in R
pg <- odbcConnect("testdb")
sqlSave(pg, tbl[1:2, ], append = TRUE, rownames = FALSE)
sqlSave(pg, tbl[3, ], append = TRUE, rownames = FALSE)
tbl[1, 3] <- 1
sqlUpdate(pg, tbl[1:4, ], index = c("key1", "key2")) # Fails
# Can replace the above sqlUpdate with:
sqlUpdate(pg, tbl[1:3, ], index = c("key1", "key2"))
sqlSave(pg, tbl[4, ], append = TRUE, rownames = FALSE)
# Proposed solution:
tbl[1, 3] <- 0
tmp <- tbl
yes <- sqlQuery(pg, "SELECT key1, key2 FROM tabl", as.is = TRUE)
for (i in seq(along = present$key1)) {
sqlUpdate(pg, tmp[tmp$key1 == yes$key1[i] & tmp$key2 == yes$key2[i], ],
"tbl", index = c("key1", "key2"))
tmp <- tmp[!(tmp$key1 == yes$key1[i] & tmp$key2 == yes$key2[i]), ]
}
sqlSave(pg, tmp, "tbl", append = TRUE, rownames = FALSE)
This is fine for small tables, where the need for updates is frequent, and there
is no risk of anyone else doing the same thing at the same time. If the table is
big and updates are rare, it seems like quite an overhead for what would
essential be inserts. Does anyone have a more rational way of doing this with
big data sets where updates are rare, e.g. only do it if sqlSave fails?
Is it possible to put a lock on the database while doing the updates and
inserts to avoid problems with concurrency?
I'm working with PostgreSQL, but the example should be generic.
Thanks in advance
Mikkel
Rather than selecting all the keys, then having R loop through them, why not
have postgres do it for you with something like:
#go through each line in our entry table
for (i in 1:dim(tbl)[1]){
#check if the pkey already exists
q <- paste ("SELECT key1, key2 FROM tabl WHERE
key1=",tbl[i,1],"
AND key2=",tbl[i,1]",sep="")
yes <- sqlQuery(pg, q, as.is = TRUE)
if (dim(yes)[1] == 1){
#update the row if it exists
sqlUpdate(pg, tbl[i,],"tbl", index = c("key1",
"key2"))
} else {
#add the row if it doesn't
sqlSave(pg, tbl[i,], "tbl", append = TRUE, rownames = FALSE)
}
}
This should work fine for small or large tables (especially if you index the
large table that doesn't change much).
[[alternative HTML version deleted]]
Thanks Steven. It obviously makes sense to loop on the much smaller dataset that
is being added than the set of everything that might already be in the database.
I've added your message in plain text, so that others can see it too. Mikkel
From: Steven Kennedy <stevenkennedy2263 at gmail.com>
Subject: Re: [R] INSERT OR UPDATE
To: "Mikkel Grum" <mi2kelgrum at yahoo.com>
Cc: "R Help" <r-help at r-project.org>
Date: Monday, May 2, 2011, 5:15 PM
Rather than selecting all the keys, then having R loop through them, why not
have postgres do it for you with something like:
?
#go through each line in our entry table
for (i in 1:dim(tbl)[1]){
??? #check if the pkey already exists
??? q <- paste ("SELECT key1, key2 FROM tabl WHERE
key1=",tbl[i,1],"
??????? AND key2=",tbl[i,1]",sep="")?
??? yes <- sqlQuery(pg, q, as.is = TRUE)
??? if (dim(yes)[1] == 1){
??????? #update the row if it exists
??????? sqlUpdate(pg, tbl[i,],"tbl", index = c("key1",
"key2"))
??? } else {
??????? #add the row if it doesn't
??????? sqlSave(pg, tbl[i,], "tbl", append = TRUE, rownames = FALSE)
??? }
}
?
This should work fine for small or large tables (especially if you index the
large table that doesn't change much).
?