Thread: configuring queries for concurrent updates

configuring queries for concurrent updates

From
Robert Poor
Date:
[std_disclaimer]I'm not a DBA and I'm running PostgreSQL on a
quad-core Intel laptop.  You may read on after you stop
laughing...[/std_disclaimer]

I've written a version of UPSERT designed to import a large number of
records.  It works in two steps: it UPDATEs incumbent records that
share common keys with the new records, then INSERTs new records that
do not share common keys with the incumbent records.  The gist of it
is:

SAVEPOINT ...;
-- update records from subquery that share common keys with tbl
UPDATE tbl
  SET f1 = X.f1, f2 = X.f2, ...
  FROM (subquery) AS X
 WHERE tbl.k1 = X.k1 AND tbl.k2 = X.k2 AND ...;
-- insert records from subquery that do NOT share common keys with tbl
    INSERT INTO tbl (f1, f2, ...)
         SELECT Y.f1, Y.f2, ...
           FROM (subquery) AS Y
LEFT OUTER JOIN tbl
             ON tbl.k1 = Y.k1 AND tbl.k2 = Y.k2 AND ...
          WHERE tbl.id IS NULL;
RELEASE SAVEPOINT ...;

The (subquery) frequently generates 5000 records for update/insert.
Three Unix processes are running the import process concurrently, all
updating/inserting records into a common table.

Most of the time this works, but I've observed examples where an
import process will
* get a "PG::Error: ERROR:  current transaction is aborted, commands
ignored until end of transaction block"
* get a "PG::Error: ERROR:  deadlock detected"
* go catatonic for several minutes

This leads me to believe that my SAVEPOINT / RELEASE SAVEPOINT is not
the correct way to protect the system from multiprocessing mischief.
I've read Tom Lane's presentation on concurrency
(http://www.postgresql.org/files/developer/concurrency.pdf), but I
haven't been able to figure out the right solution.

[For what it's worth, with the data I'm importing, INSERTs will far
outnumber the UPDATES.]

Since I someday plan to spin this code up on a big system with >> 3
import processes, I'd like to get this right.

Hints and solutions warmly accepted.  Thanks.

- rdp

Re: configuring queries for concurrent updates

From
Craig Ringer
Date:
On 06/23/2012 02:33 PM, Robert Poor wrote:
> [std_disclaimer]I'm not a DBA and I'm running PostgreSQL on a
> quad-core Intel laptop.  You may read on after you stop
> laughing...[/std_disclaimer]

Laughing?

You don't need to be a formally-qualified DBA to use Pg. That's half the
point - it's focused on safe and correct behaviour and has great
documentation so you don't need to go through an extensive training
course just to turn it on.

I run Pg on several machines, but of all of them my quad-core
SSD-equiped i7 laptop is by far the fastest. It's perfectly reasonable
to run a small Pg database on very small hardware and expect it to
perform very well.

Of course, if you're throwing millions of rows and terabytes of data
around you'll want to have hardware to match, but Pg scales down as well
as up.

> I've written a version of UPSERT designed to import a large number of
> records.  It works in two steps: it UPDATEs incumbent records that
> share common keys with the new records, then INSERTs new records that
> do not share common keys with the incumbent records.

That's incorrect; it's subject to several nasty races. The best article
I've seen on this is here:

   http://www.depesz.com/2012/06/10/why-is-upsert-so-complicated/

(thanks again depesz for your great articles).

--
Craig Ringer

Re: configuring queries for concurrent updates

From
Robert Poor
Date:
Craig:

On Sun, Jun 24, 2012 at 12:06 AM, Craig Ringer <ringerc@ringerc.id.au> wrote:
> That [implementation of UPSERT] is incorrect; it's subject to several nasty races.
> The best article I've seen on this is here:
>
>  http://www.depesz.com/2012/06/10/why-is-upsert-so-complicated/

You're right -- that's a thorough and lucid note.

Heeding depesz's warning that advisory locks are not a GENERAL
solution, they're appropriate for my application: my code is the only
place where data is added to this particular table.  So advisory locks
sound like the way to go -- I'll give that a shot.

Thank you for the pointer.

- rdp

Re: configuring queries for concurrent updates

From
Craig Ringer
Date:
On 06/24/2012 03:42 PM, Robert Poor wrote:
> Craig:
>
> On Sun, Jun 24, 2012 at 12:06 AM, Craig Ringer <ringerc@ringerc.id.au> wrote:
>> That [implementation of UPSERT] is incorrect; it's subject to several nasty races.
>> The best article I've seen on this is here:
>>
>>   http://www.depesz.com/2012/06/10/why-is-upsert-so-complicated/
>
> You're right -- that's a thorough and lucid note.
>
> Heeding depesz's warning that advisory locks are not a GENERAL
> solution, they're appropriate for my application: my code is the only
> place where data is added to this particular table.  So advisory locks
> sound like the way to go -- I'll give that a shot.

Yep, advisory locks sound like a good choice for that situation.

True predicate locking would solve this, allowing an app to SELECT ...
FOR UPDATE records that may not yet exist. Pg doesn't do full predicate
locking - it's slow, expensive in memory etc, hard to get right, causes
deadlocks all over the place, and usually isn't what users want. Pg's
SERIALIZABLE isolation does do predicate locking, but only lightweight
ones used to detect serialization failures, not to block work from
proceeding.


--
Craig Ringer