Hey everyone,
I don't have a question. Instead some helpful advice with things
I've
learned from trying to connect 'R' to databases using RODBC.
ROBDC is a very handy tool that, once you have everything fixed up
nicely, is a great way to have scripts run fairly autonomously, safe in the
knowledge that data isn't accidentally messed up. But I was fairly ignorant
about databases, and I suspect maybe some others out there are as well.
1) Connecting: the function odbcConnect() works very well, presuming you
first have things set up properly with your connection to a database. For
instance, what is the "data source name", or *dsn*, that odbcConnect()
asks
for? In my ignorant view, I see this as a "nickname" that you have
created
on your computer (i.e., *outside* of 'R') to make a connection between
your
computer & the database you wish to talk to. I have set up my database
connections through windows software, so I can only speak intelligently to
that. There is a windows program called "ODBC Data Source
Administrator"
likely hidden somewhere on your computer. If you have Windows version 7,
simply search for "ODBC" in that Start menu "search programs
& files" box.
You should find the ODBC data source administrator.
Again, this is specific to my Windows 7 computer, but here's what I then
did: click on the User DSN tab. Then click the "add" button. Choose
a
driver type (e.g., SQL Server)... this is dependent upon the database you
are connecting to. In the next window, give it a name & description, then
get the server name (the server where the database resides, your IT admin
would likely know this). I don't want to go into too many other details on
setting this up, beyond saying you will likely need to change the "default
database" to the one which you want to access. Finally, the *name* you
give
this new User DSN is the *dsn* you will use when calling odbcConnect().
Once you've set up this data sourcing junk properly, the odbcConnect() call
is trivial. (Famous last words...)
2) Querying: sqlQuery() works in a fairly straight-forward manner,
presuming 2 things: (a) you have set up the dsn correctly & connected to it
(see above), and (b) you know something of database querying language. I
don't want to get bogged down by any details here, but I find this website
to have good tutorials (although horrible search capabilities or
reference-style documentation): http://sqlzoo.net/
Final point on querying: with an accidentally typo, your query can go from
2,000 rows long to 2,000,000,000 rows long (e.g., if you perform a join
incorrectly). So it's sometimes worthwhile to make your first query a
"count(*)" query so that you know how many rows of data you'll get
back.
3) Populating the database: here is where I mostly wanted to focus. Dr.
Ripley has some great documentation on RODBC. (E.g.,
http://cran.r-project.org/web/packages/RODBC/vignettes/RODBC.pdf has great
info beyond what can be found in the help files.) But it seemed there
was *very
little* info on how to populate a database. If you're anything like me, you
are generating useful results (or so I pretend) and you want to share them
with your organization. So my details below help to *populate an existing
database table*. Most of this is useful for other cases, too.
One of the trickiest things in making sure that you populate the database
correctly is making sure that you have the right *data type*. Databases are
quite picky about data types; they are optimized for fast access & minimal
storage, so if they can do something in 2 bytes instead of 16, they will.
Therefore, there are a bunch of different data types. Instead of guessing
which data types are being used, I simply query the database and grab the
data types (this will work even if the database has columns but is empty),
then use this information when populating. In short, I do the following:
library(RODBC)
dbCon <- odbcConnect(db)
tmp <- sqlColumns(dbCon, dbTable) ## this function grabs a bunch of info
about the columns)
varTypes <- as.character(tmp$TYPE_NAME)
names(varTypes) <- as.character(tmp$COLUMN_NAME)
sqlSave(dbCon, dataSet, dbTable, append=TRUE, rownames=FALSE,
varTypes=varTypes)
The "secret ingredient" here is varTypes : this is a named character
vector
of data types, as said in many other threads. But done in this fashion
above, you're bound to have the types set up properly. E.g., varTypes might
look something like this:
>varTypes
my_primary_key_id my_foreign_key_id ID
"bigint" "bigint"
"varchar"
barcode lane type
"varchar" "int"
"varchar"
predict_value
"float"
Another thing I find handy is to pre-generate the primary key in the data
table, since this is usually some unique numerical identifier, but otherwise
just gibberish. Below I have written a wrapper function that works quite
nicely for me, and that I hope others out there might find handy.
A HUGE thanks to Dr. Ripley for making an excellent package!!
Regards,
Mike
####################################################################################
####################################################################################
db.populate <- function(dataSet=NULL, dbTable=NULL, primeKey=NULL,
db="blah", check.names=TRUE,
verbose=FALSE, safer=TRUE, fast=TRUE,
test=FALSE,
nastring=NULL) {
iAm <- "db.populate"
if (is.null(dataSet) | is.null(dbTable))
stop(paste(iAm,": Both \"dataSet\" and
\"dbtable\" variables must
be",
" provided.", sep=""))
### connect to the database & query tables.
dbCon <- odbcConnect(db)
tmp <- sqlColumns(dbCon, dbTable)
varTypes <- as.character(tmp$TYPE_NAME)
names(varTypes) <- as.character(tmp$COLUMN_NAME)
if (!is.null(primeKey)) {
myQuery <- paste("Select max(",primeKey,") from
",dbTable)
primeKeys <- seq(1,dim(dataSet)[1]) + sqlQuery(dbCon, myQuery)[1,1]
dataSet <- cbind(primeKeys,dataSet) ; names(dataSet)[1] <-
primeKey
} ## end if clause to create prime Key
if (check.names) {
if (length(setdiff(names(dataSet),names(varTypes)))!=0) {
message(paste(iAm,": column names of \"dataSet\" do
not match",
" those of \"dbTable\", ",dbTable,
sep=""))
message("\n\tNames of \"dataSet\":")
print(names(dataSet))
message(paste("\n\tNames in
\"dbTable\",",dbTable,":"))
print(names(varTypes))
stop(paste(iAm,": stopped due to this mis-match.",
sep=""))
} ## end if clause to see if "dataSet" &
"dbTable" names match
dataSet <- dataSet[,names(varTypes)]
} ## end if clause to check names & re-arrange "dataSet" as
needed
message(paste(iAm, ": populating table ",dbTable, " with
\"dataSet\"",
sep=""))
sqlSave(dbCon, dataSet, dbTable, append=TRUE, rownames=FALSE,
verbose=verbose, safer=safer, varTypes=varTypes, fast=fast,
test=test, nastring=nastring)
odbcClose(dbCon)
message(paste(iAm, ": added \"dataSet\" to table
",dbTable,
sep=""))
if (verbose) {
message(paste(iAm, ": ",dim(dataSet)[1]," rows
added.", sep=""))
if (!is.null(primeKey)) {
message(paste(iAm, ": primary key ",primeKey,"
updated."))
message(paste("\t",primeKey," values from
",primeKeys[1],
" to ",primeKeys[length(primeKeys)],
"are the newly updated data",
sep=""))
}
}
####################################################################################
####################################################################################
Finally, a few words to act as good keys if someone out there does a search
for info:
R
R-help
RODBC
ODBC
database
SQL
table
sqlSave
obdcConnect
[[alternative HTML version deleted]]