Thread: configuring queries for concurrent updates
[std_disclaimer]I'm not a DBA and I'm running PostgreSQL on a quad-core Intel laptop. You may read on after you stop laughing...[/std_disclaimer] I've written a version of UPSERT designed to import a large number of records. It works in two steps: it UPDATEs incumbent records that share common keys with the new records, then INSERTs new records that do not share common keys with the incumbent records. The gist of it is: SAVEPOINT ...; -- update records from subquery that share common keys with tbl UPDATE tbl SET f1 = X.f1, f2 = X.f2, ... FROM (subquery) AS X WHERE tbl.k1 = X.k1 AND tbl.k2 = X.k2 AND ...; -- insert records from subquery that do NOT share common keys with tbl INSERT INTO tbl (f1, f2, ...) SELECT Y.f1, Y.f2, ... FROM (subquery) AS Y LEFT OUTER JOIN tbl ON tbl.k1 = Y.k1 AND tbl.k2 = Y.k2 AND ... WHERE tbl.id IS NULL; RELEASE SAVEPOINT ...; The (subquery) frequently generates 5000 records for update/insert. Three Unix processes are running the import process concurrently, all updating/inserting records into a common table. Most of the time this works, but I've observed examples where an import process will * get a "PG::Error: ERROR: current transaction is aborted, commands ignored until end of transaction block" * get a "PG::Error: ERROR: deadlock detected" * go catatonic for several minutes This leads me to believe that my SAVEPOINT / RELEASE SAVEPOINT is not the correct way to protect the system from multiprocessing mischief. I've read Tom Lane's presentation on concurrency (http://www.postgresql.org/files/developer/concurrency.pdf), but I haven't been able to figure out the right solution. [For what it's worth, with the data I'm importing, INSERTs will far outnumber the UPDATES.] Since I someday plan to spin this code up on a big system with >> 3 import processes, I'd like to get this right. Hints and solutions warmly accepted. Thanks. - rdp
On 06/23/2012 02:33 PM, Robert Poor wrote: > [std_disclaimer]I'm not a DBA and I'm running PostgreSQL on a > quad-core Intel laptop. You may read on after you stop > laughing...[/std_disclaimer] Laughing? You don't need to be a formally-qualified DBA to use Pg. That's half the point - it's focused on safe and correct behaviour and has great documentation so you don't need to go through an extensive training course just to turn it on. I run Pg on several machines, but of all of them my quad-core SSD-equiped i7 laptop is by far the fastest. It's perfectly reasonable to run a small Pg database on very small hardware and expect it to perform very well. Of course, if you're throwing millions of rows and terabytes of data around you'll want to have hardware to match, but Pg scales down as well as up. > I've written a version of UPSERT designed to import a large number of > records. It works in two steps: it UPDATEs incumbent records that > share common keys with the new records, then INSERTs new records that > do not share common keys with the incumbent records. That's incorrect; it's subject to several nasty races. The best article I've seen on this is here: http://www.depesz.com/2012/06/10/why-is-upsert-so-complicated/ (thanks again depesz for your great articles). -- Craig Ringer
Craig: On Sun, Jun 24, 2012 at 12:06 AM, Craig Ringer <ringerc@ringerc.id.au> wrote: > That [implementation of UPSERT] is incorrect; it's subject to several nasty races. > The best article I've seen on this is here: > > http://www.depesz.com/2012/06/10/why-is-upsert-so-complicated/ You're right -- that's a thorough and lucid note. Heeding depesz's warning that advisory locks are not a GENERAL solution, they're appropriate for my application: my code is the only place where data is added to this particular table. So advisory locks sound like the way to go -- I'll give that a shot. Thank you for the pointer. - rdp
On 06/24/2012 03:42 PM, Robert Poor wrote: > Craig: > > On Sun, Jun 24, 2012 at 12:06 AM, Craig Ringer <ringerc@ringerc.id.au> wrote: >> That [implementation of UPSERT] is incorrect; it's subject to several nasty races. >> The best article I've seen on this is here: >> >> http://www.depesz.com/2012/06/10/why-is-upsert-so-complicated/ > > You're right -- that's a thorough and lucid note. > > Heeding depesz's warning that advisory locks are not a GENERAL > solution, they're appropriate for my application: my code is the only > place where data is added to this particular table. So advisory locks > sound like the way to go -- I'll give that a shot. Yep, advisory locks sound like a good choice for that situation. True predicate locking would solve this, allowing an app to SELECT ... FOR UPDATE records that may not yet exist. Pg doesn't do full predicate locking - it's slow, expensive in memory etc, hard to get right, causes deadlocks all over the place, and usually isn't what users want. Pg's SERIALIZABLE isolation does do predicate locking, but only lightweight ones used to detect serialization failures, not to block work from proceeding. -- Craig Ringer