We have a PostgreSQL table with about 400000 records in it. Using either RODBC or RdbiPgSQL, what is the fastest way to update one (or a few) column(s) in a large collection of records? Currently we're sending sql like BEGIN UPDATE table SET col1=value WHERE id=id (repeated thousands of times for different ids) COMMIT and this takes hours to complete. Surely there must be a quicker way? Duncan Murdoch
I was going to suggest sqlUpdate in RODBC, but it looks like that function also uses the UPDATE command repeated nrow times. A second strategy that I generally prefer because it does not require RODBC (as much) and better supports transaction control is to first create a temporary table with the new columns in it and an identifier column (perhaps using sqlSave). Then you can join the two tables on the identifier column and set the old column to the new column en masse using UPDATE. Often the bottleneck in doing row-by-row updates is searching for the index of the id each time, whereas doing the entire join up front and then updating often speeds this up considerably. In general, if you are ever doing something that resembles a FOR loop in SQL, there's a faster way. Something like this is what I have in mind, although you might need to tweak for PostgreSQL syntax: UPDATE table SET col1 = (SELECT new.col1 FROM table AS old JOIN tempTable AS new ON old.idCol = new.idCol) You should also make sure that your table is indexed well to optimize for updates. HTH, Robert -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Duncan Murdoch Sent: Friday, May 19, 2006 11:17 AM To: r-help at stat.math.ethz.ch Subject: [R] Fast update of a lot of records in a database? We have a PostgreSQL table with about 400000 records in it. Using either RODBC or RdbiPgSQL, what is the fastest way to update one (or a few) column(s) in a large collection of records? Currently we're sending sql like BEGIN UPDATE table SET col1=value WHERE id=id (repeated thousands of times for different ids) COMMIT and this takes hours to complete. Surely there must be a quicker way? Duncan Murdoch ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Your approach seems very inefficient - it looks like you're executing thousands of update statements. Try something like this instead: #---build a table 'updates' (id and value) ... #---do all updates via a single left join UPDATE bigtable a LEFT JOIN updates b ON a.id = b.id SET a.col1 = b.value; You may need to adjust the syntax.> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Duncan Murdoch > Sent: Friday, May 19, 2006 11:17 AM > To: r-help at stat.math.ethz.ch > Subject: [R] Fast update of a lot of records in a database? > > We have a PostgreSQL table with about 400000 records in it. Using > either RODBC or RdbiPgSQL, what is the fastest way to update > one (or a > few) column(s) in a large collection of records? Currently we're > sending sql like > > BEGIN > UPDATE table SET col1=value WHERE id=id > (repeated thousands of times for different ids) > COMMIT > > and this takes hours to complete. Surely there must be a quicker way? > > Duncan Murdoch > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >
On 5/19/2006 11:17 AM, Duncan Murdoch wrote:> We have a PostgreSQL table with about 400000 records in it. Using > either RODBC or RdbiPgSQL, what is the fastest way to update one (or a > few) column(s) in a large collection of records? Currently we're > sending sql like > > BEGIN > UPDATE table SET col1=value WHERE id=id > (repeated thousands of times for different ids) > COMMIT > > and this takes hours to complete. Surely there must be a quicker way?Thanks to Robert McGehee and Bogdan Romocea for their responses. Putting them together, I think the following will do what I want: put the updates into a temporary table called updates UPDATE bigtable AS a FROM updates AS b WHERE a.id = b.id SET a.col1 = b.col1 The FROM clause is a PostgreSQL extension. This is not portable, but MySQL does it with different syntax: UPDATE bigtable AS a, updates AS b WHERE a.id = b.id SET a.col1 = b.col1 I don't think SQLite supports updating one table from another. Duncan Murdoch
Hadley, There are several reasons that running one large load and one large update would be significantly faster than thousands of individual updates. First, the time it takes to execute a query does not grow linearly with the size of a query. That is, the statement: "SELECT TOP 100 * FROM table" takes about 1.8 times as long as "SELECT TOP 10 * FROM table", not 10 times longer (using an estimated query cost on tables in my database using MS-SQL). The reason is that SQL is optimized to perform well for large queries, and many of the steps used in a SQL operation are needlessly repeated when multiple UPDATE/SELECT statements are given rather than one large UPDATE/SELECT. For instance, on most SQL UPDATES, the most time is spent primarily on 1) Sorting the input, 2) performing a clustered index seek, and 3) performing a clustered index update. In a toy example using UPDATE, the physical operation of sorting 2000 rows takes only 6 times longer than sorting a little over 100 rows. The clustered index seek and update take about 10 times longer. This, however, is far less than the 20x longer we would expect from doing a linear row-by-row update. So even if it takes an additional 50% longer to first load the data into a temporary table, we still see the opportunity for large speed increases. A second reason we would expect one large query to run faster is that it is much easier to parallel process on multiple processors. That is, one processor can be joining the tables while a second processor simultaneously is performing clustered indexing. For a bunch of single UPDATE statements, we are forced to run the operation in serial. Thus, if the above examples were more complicated, we should expect an even larger cost savings / row.>From your example, a third reason is that in multiple updates, the SQLserver (at least my MS-SQL server) updates the transaction log after every query (unless you wisely run in batch mode as Duncan did with his BEGIN/COMMIT syntax), and thus significant more I/O time is spent between each UPDATE statement. For instance, the RODBC sqlUpdate function does not take advantage of transaction control, so to speed up long queries, I've often resorted to sending over temporary tables (as I suggested here), stored procedures, or even data stored as XML tables. Lastly, removing your indices before the update would likely only slow down the query. If the table is not indexed on the id key, then the SQL server has to search through the entire table to find the matching id before it can be updated. It would be like searching through a dictionary that wasn't in alphabetical order. That said, indices can slow down queries when a significant number of rows are being added, as you then have to reindex the table when the insert completes. However, Duncan isn't doing that here. Best, Robert -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of hadley wickham Sent: Friday, May 19, 2006 3:20 PM To: Duncan Murdoch Cc: Robert.McGehee at geodecapital.com; r-help at stat.math.ethz.ch; br44114 at gmail.com Subject: Re: [R] Fast update of a lot of records in a database?> put the updates into a temporary table called updates > > UPDATE bigtable AS a > FROM updates AS b > WHERE a.id = b.id > SET a.col1 = b.col1I don't think this will be any faster - why would creating a new table be faster than updating existing rows? I've never had a problem with using large numbers of SQL update statements (in the order of hundreds of thousands) to update a table and having them complete in a reasonable time (a few minutes). How heavily indexed is the field you are updating? You may be able to get some speed improvements by turning off indices before the update and back on again afterwards (highly dependent on your database system though). I would strongly suspect your bottleneck lies elsewhere (eg. when generating the statements in R, or using ODBC to send them) Hadley ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html