thr3ads.net - R help - [R] Fast update of a lot of records in a database? [May 2006]

If this information is useful, please help other people find it:
Share via:

Duncan Murdoch

2006-May-19 15:17 UTC

[R] Fast update of a lot of records in a database?

We have a PostgreSQL table with about 400000 records in it.  Using 
either RODBC or RdbiPgSQL, what is the fastest way to update one (or a 
few) column(s) in a large collection of records?  Currently we're 
sending sql like

BEGIN
UPDATE table SET col1=value WHERE id=id
(repeated thousands of times for different ids)
COMMIT

and this takes hours to complete.  Surely there must be a quicker way?

Duncan Murdoch

McGehee, Robert

2006-May-19 15:48 UTC

head link

[R] Fast update of a lot of records in a database?

I was going to suggest sqlUpdate in RODBC, but it looks like that
function also uses the UPDATE command repeated nrow times.

A second strategy that I generally prefer because it does not require
RODBC (as much) and better supports transaction control is to first
create a temporary table with the new columns in it and an identifier
column (perhaps using sqlSave). Then you can join the two tables on the
identifier column and set the old column to the new column en masse
using UPDATE.

Often the bottleneck in doing row-by-row updates is searching for the
index of the id each time, whereas doing the entire join up front and
then updating often speeds this up considerably. In general, if you are
ever doing something that resembles a FOR loop in SQL, there's a faster
way.

Something like this is what I have in mind, although you might need to
tweak for PostgreSQL syntax:

UPDATE table
SET col1 = 
	(SELECT new.col1 FROM table AS old
	JOIN tempTable AS new
	ON old.idCol = new.idCol)

You should also make sure that your table is indexed well to optimize
for updates.

HTH,
Robert

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Duncan Murdoch
Sent: Friday, May 19, 2006 11:17 AM
To: r-help at stat.math.ethz.ch
Subject: [R] Fast update of a lot of records in a database?

We have a PostgreSQL table with about 400000 records in it.  Using 
either RODBC or RdbiPgSQL, what is the fastest way to update one (or a 
few) column(s) in a large collection of records?  Currently we're 
sending sql like

BEGIN
UPDATE table SET col1=value WHERE id=id
(repeated thousands of times for different ids)
COMMIT

and this takes hours to complete.  Surely there must be a quicker way?

Duncan Murdoch

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

bogdan romocea

2006-May-19 16:01 UTC

head link

[R] Fast update of a lot of records in a database?

Your approach seems very inefficient - it looks like you're executing
thousands of update statements. Try something like this instead:
#---build a table 'updates' (id and value)
...
#---do all updates via a single left join
UPDATE bigtable a LEFT JOIN updates b
ON a.id = b.id
SET a.col1 = b.value;
You may need to adjust the syntax.

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Duncan Murdoch
> Sent: Friday, May 19, 2006 11:17 AM
> To: r-help at stat.math.ethz.ch
> Subject: [R] Fast update of a lot of records in a database?
>
> We have a PostgreSQL table with about 400000 records in it.  Using
> either RODBC or RdbiPgSQL, what is the fastest way to update
> one (or a
> few) column(s) in a large collection of records?  Currently we're
> sending sql like
>
> BEGIN
> UPDATE table SET col1=value WHERE id=id
> (repeated thousands of times for different ids)
> COMMIT
>
> and this takes hours to complete.  Surely there must be a quicker way?
>
> Duncan Murdoch
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>

Duncan Murdoch

2006-May-19 17:15 UTC

head link

[R] Fast update of a lot of records in a database?

On 5/19/2006 11:17 AM, Duncan Murdoch wrote:> We have a PostgreSQL table with about 400000 records in it.  Using 
> either RODBC or RdbiPgSQL, what is the fastest way to update one (or a 
> few) column(s) in a large collection of records?  Currently we're 
> sending sql like
> 
> BEGIN
> UPDATE table SET col1=value WHERE id=id
> (repeated thousands of times for different ids)
> COMMIT
> 
> and this takes hours to complete.  Surely there must be a quicker way?
Thanks to Robert McGehee and Bogdan Romocea for their responses.

Putting them together, I think the following will do what I want:

put the updates into a temporary table called updates

UPDATE bigtable AS a
   FROM updates AS b
   WHERE a.id = b.id
   SET a.col1 = b.col1

The FROM clause is a PostgreSQL extension.  This is not portable, but 
MySQL does it with different syntax:

UPDATE bigtable AS a, updates AS b
   WHERE a.id = b.id
   SET a.col1 = b.col1

I don't think SQLite supports updating one table from another.

Duncan Murdoch

McGehee, Robert

2006-May-19 20:13 UTC

head link

[R] Fast update of a lot of records in a database?

Hadley,
There are several reasons that running one large load and one large
update would be significantly faster than thousands of individual
updates. 

First, the time it takes to execute a query does not grow linearly with
the size of a query. That is, the statement: "SELECT TOP 100 * FROM
table" takes about 1.8 times as long as "SELECT TOP 10 * FROM
table",
not 10 times longer (using an estimated query cost on tables in my
database using MS-SQL). The reason is that SQL is optimized to perform
well for large queries, and many of the steps used in a SQL operation
are needlessly repeated when multiple UPDATE/SELECT statements are given
rather than one large UPDATE/SELECT.

For instance, on most SQL UPDATES, the most time is spent primarily on
1) Sorting the input, 2) performing a clustered index seek, and 3)
performing a clustered index update. In a toy example using UPDATE, the
physical operation of sorting 2000 rows takes only 6 times longer than
sorting a little over 100 rows. The clustered index seek and update take
about 10 times longer. This, however, is far less than the 20x longer we
would expect from doing a linear row-by-row update. So even if it takes
an additional 50% longer to first load the data into a temporary table,
we still see the opportunity for large speed increases.

A second reason we would expect one large query to run faster is that it
is much easier to parallel process on multiple processors. That is, one
processor can be joining the tables while a second processor
simultaneously is performing clustered indexing. For a bunch of single
UPDATE statements, we are forced to run the operation in serial. Thus,
if the above examples were more complicated, we should expect an even
larger cost savings / row.
>From your example, a third reason is that in multiple updates, the SQLserver (at least my MS-SQL server) updates the transaction log after
every query (unless you wisely run in batch mode as Duncan did with his
BEGIN/COMMIT syntax), and thus significant more I/O time is spent
between each UPDATE statement. For instance, the RODBC sqlUpdate
function does not take advantage of transaction control, so to speed up
long queries, I've often resorted to sending over temporary tables (as I
suggested here), stored procedures, or even data stored as XML tables.

Lastly, removing your indices before the update would likely only slow
down the query. If the table is not indexed on the id key, then the SQL
server has to search through the entire table to find the matching id
before it can be updated. It would be like searching through a
dictionary that wasn't in alphabetical order. That said, indices can
slow down queries when a significant number of rows are being added, as
you then have to reindex the table when the insert completes.  However,
Duncan isn't doing that here.

Best,
Robert

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of hadley wickham
Sent: Friday, May 19, 2006 3:20 PM
To: Duncan Murdoch
Cc: Robert.McGehee at geodecapital.com; r-help at stat.math.ethz.ch;
br44114 at gmail.com
Subject: Re: [R] Fast update of a lot of records in a database?
> put the updates into a temporary table called updates
>
> UPDATE bigtable AS a
>    FROM updates AS b
>    WHERE a.id = b.id
>    SET a.col1 = b.col1
I don't think this will be any faster - why would creating a new table
be faster than updating existing rows?

I've never had a problem with using large numbers of SQL update
statements (in the order of hundreds of thousands) to update a table
and having them complete in a reasonable time (a few minutes).  How
heavily indexed is the field you are updating?  You may be able to get
some speed improvements by turning off indices before the update and
back on again afterwards (highly dependent on your database system
though).

I would strongly suspect your bottleneck lies elsewhere (eg. when
generating the statements in R, or using ODBC to send them)

Hadley

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

Maybe Matching Threads

Search for more possibly parallel threads

R help - May 2006 - Fast update of a lot of records in a database?

[R] Fast update of a lot of records in a database?

[R] Fast update of a lot of records in a database?

[R] Fast update of a lot of records in a database?

[R] Fast update of a lot of records in a database?

[R] Fast update of a lot of records in a database?

Maybe Matching Threads