Thread: BUG #4945: Parallel update(s) gone wild

BUG #4945: Parallel update(s) gone wild

From
"dan boeriu"
Date:
The following bug has been logged online:

Bug reference:      4945
Logged by:          dan boeriu
Email address:      dan.boeriu@roost.com
PostgreSQL version: 8.3.6 8.3.7 8.4
Operating system:   Redhat Linux
Description:        Parallel update(s) gone wild
Details:

In my update I use one table to read from (let's call it READ TABLE)  and
one table to update (let's call it WRITE TABLE).
Often the updates are started almost in parallel.
When I have 2 parallel updates the first one finishes but the second sends
the cpu to 100% and in certain situations it does never finish.
What I noticed is that the second will not finish if the READ table has many
rows to be read (1 million let's say) but it finishes when the read table
has only a few 1000s of rows.
Any idea why?
BTW - I disabled the triggers on both tables.
Also I tried with autovacuum=off and no difference.

Re: BUG #4945: Parallel update(s) gone wild

From
Craig Ringer
Date:
On Mon, 2009-07-27 at 21:56 +0000, dan boeriu wrote:

> What I noticed is that the second will not finish if the READ table has many
> rows to be read (1 million let's say) but it finishes when the read table
> has only a few 1000s of rows.
> Any idea why?

It could be that it _does_ finish ... eventually. It might be doing
something that scales very poorly with number of input rows, like a
nested loop within a nested loop.

Can you provide EXPLAIN ANALYZE output for the problem query?

--
Craig Ringer

Re: BUG #4945: Parallel update(s) gone wild

From
Craig Ringer
Date:
Please reply to the list, not directly to me.

> I don't think is that simple. The VERY SAME statement runs twice - one
> finishes in about 20 secs the other doesn't finish in 24 hours.

Yep, OK, so it's not just a planning or scaling issue.

Have you checked pg_locks ?

  SELECT * FROM pg_locks;

What does pg_stat_activity indicate about the query?

  SELECT * FROM pg_stat_activity;

> The plan might change from one execution to the other - is there a way
> to get the executed plan at runtime?

I think there is in 8.4, but I haven't moved up to it and tested yet.
Not in previous versions.

--
Craig Ringer

Re: BUG #4945: Parallel update(s) gone wild

From
"Dan Boeriu"
Date:
I am doing a reply-all - hope that's fine.

The pg_locks shows only locks held by that Postgresql process - I am the on=
ly one on the box.
pg_stat_activity shows 2 things:
1) the update
2) vacuum on the updated table

Now, I tried to isolate the problem even more and turned the auto vacuum OF=
F.
Then pg_stat_activity shows only the UPDATE statement.

Here is one wierd thing:
IF I cancel the second UPDATE then restart it then it runs in 20 secs.

Let me repeat the problem: I run 2 SQL UPDATEs (the same statement) in para=
llel.
The first one to acquire the locks on the updated table goes through in 20 =
secs.
The second will not finish in 24h but the CPU stays at 100% and iostat show=
s no IO other than the checkpoints.
IF I cancel the second UPDATE (pg_cancel_backend) and restart it (new pgsql=
 invocation) ASAP it goes through in 20 secs.


Dan Boeriu
Senior Architect - Roost.com
P: (415) 742 8056
Roost.com - 2008 Inman Award Winner for Most Innovative New Technology




-----Original Message-----
From: Craig Ringer [mailto:craig@postnewspapers.com.au]
Sent: Mon 7/27/2009 11:13 PM
To: Dan Boeriu; PostgreSQL bugs
Subject: RE: [BUGS] BUG #4945: Parallel update(s) gone wild
=20
Please reply to the list, not directly to me.

> I don't think is that simple. The VERY SAME statement runs twice - one
> finishes in about 20 secs the other doesn't finish in 24 hours.

Yep, OK, so it's not just a planning or scaling issue.

Have you checked pg_locks ?

  SELECT * FROM pg_locks;

What does pg_stat_activity indicate about the query?

  SELECT * FROM pg_stat_activity;

> The plan might change from one execution to the other - is there a way
> to get the executed plan at runtime?

I think there is in 8.4, but I haven't moved up to it and tested yet.
Not in previous versions.

--
Craig Ringer

Re: BUG #4945: Parallel update(s) gone wild

From
Robert Haas
Date:
On Jul 28, 2009, at 1:16 PM, "Dan Boeriu" <dan.boeriu@roost.com> wrote:

> I am doing a reply-all - hope that's fine.
>
> The pg_locks shows only locks held by that Postgresql process - I am
> the only one on the box.
> pg_stat_activity shows 2 things:
> 1) the update
> 2) vacuum on the updated table
>
> Now, I tried to isolate the problem even more and turned the auto
> vacuum OFF.
> Then pg_stat_activity shows only the UPDATE statement.
>
> Here is one wierd thing:
> IF I cancel the second UPDATE then restart it then it runs in 20 secs.
>
> Let me repeat the problem: I run 2 SQL UPDATEs (the same statement)
> in parallel.
> The first one to acquire the locks on the updated table goes through
> in 20 secs.
> The second will not finish in 24h but the CPU stays at 100% and
> iostat shows no IO other than the checkpoints.
> IF I cancel the second UPDATE (pg_cancel_backend) and restart it
> (new pgsql invocation) ASAP it goes through in 20 secs.
>
>
> Dan Boeriu
> Senior Architect - Roost.com
> P: (415) 742 8056
> Roost.com - 2008 Inman Award Winner for Most Innovative New Technology
>
>
>
>
> -----Original Message-----
> From: Craig Ringer [mailto:craig@postnewspapers.com.au]
> Sent: Mon 7/27/2009 11:13 PM
> To: Dan Boeriu; PostgreSQL bugs
> Subject: RE: [BUGS] BUG #4945: Parallel update(s) gone wild
>
> Please reply to the list, not directly to me.
>
> > I don't think is that simple. The VERY SAME statement runs twice -
> one
> > finishes in about 20 secs the other doesn't finish in 24 hours.
>
> Yep, OK, so it's not just a planning or scaling issue.
>
How about posting:

Server version
EXPLAIN ANALYZE output
Schemas of relevant tables

Or far better still:

A self-contained reproducible test case

...Robert

Re: BUG #4945: Parallel update(s) gone wild

From
Craig Ringer
Date:
Dan Boeriu wrote:

> The second will not finish in 24h but the CPU stays at 100% and iostat
> shows no IO other than the checkpoints.
 > IF I cancel the second UPDATE (pg_cancel_backend) and restart it (new
 > pgsql invocation) ASAP it goes through in 20 secs.

OK, that's interesting when combined with the information provided.
(Though, in general, it's preferable that you quote command output as
well as your interpretation; I think we all miss things sometimes.)

You might want to connect to the problem process (the one pegged at 100%
CPU) with gdb and see if you can get a backtrace showing what's going on.

Where the pid of the problem process is shown as 9999 here:

gdb -p 9999
(gdb) bt
(gdb) cont
^C
(gdb) bt
(gdb) cont
^C
(gdb) bt
(gdb) cont
^C
(gdb) bt
(gdb) q
y


In other words: connect to the process with gdb. Issue the "bt" command
(backtrace) to get a backtrace at the point you interrupted it at by
connecting with gdb, then tell it to continue execution normally. After
a little while (a few seconds/minutes/whatever, just tell us vaguely how
long) hit control-C, request another backtrace, and tell the program to
continue. Repeat a few times, then quit gdb and tell it to detach from
the process.

Then send the ENTIRE, EXACT output to the mailing list.

If the output mostly looks like this:

#0  0xb7f6a410 in ?? ()
#1  0xbfe75578 in ?? ()
#2  0x00000000 in ?? ()

then your Linux distro has stripped debug info out of the Pg binaries.
You may need to install debuginfo RPMs for PostgreSQL and glibc at the
very least, then repeat the process of collecting backtrace information.

--
Craig Ringer

Re: BUG #4945: Parallel update(s) gone wild

From
"Dan Boeriu"
Date:

Attached is the reproducible test case - I was able to reproduce the problem on 32 and 64 bit 8.3.6 and 8.4.0 RedHat 5.3 kernel 2.6.18-128.1.16.el5 #1 SMP

JUST IN CASE - if the second update ever finishes just do a couple of times:

INSERT INTO xxx SELECT * FROM xxx;

That should do it.

Please let me know if you were able to reproduce the problem.


Dan Boeriu
Senior Architect - Roost.com
P: (415) 742 8056
Roost.com - 2008 Inman Award Winner for Most Innovative New Technology

Attachment

Re: BUG #4945: Parallel update(s) gone wild

From
Tom Lane
Date:
"Dan Boeriu" <dan.boeriu@roost.com> writes:
> Attached is the reproducible test case - I was able to reproduce the problem on 32 and 64 bit 8.3.6 and 8.4.0 RedHat
5.3kernel 2.6.18-128.1.16.el5 #1 SMP 

I looked at this a bit.  It's the same issue discussed at
http://archives.postgresql.org/pgsql-bugs/2008-09/msg00045.php
namely, that the second update finds itself trying to update a large
number of tuples that were already updated since its snapshot was taken.
That means it has to re-verify that the updated versions of those tuples
meet its WHERE qualification.  That's done by a function EvalPlanQual
that's pretty darn inefficient for complex queries like this one.
It's essentially redoing the join (and recomputing the whole sub-SELECT)
for each row that needs to be updated.

Someday I'd like us to redesign that mechanism, but don't hold
your breath ...

            regards, tom lane

Re: BUG #4945: Parallel update(s) gone wild

From
"Dan Boeriu"
Date:
Is there a workaround?
To us this is pretty bad news; we receive updates from several partners and=
 constantly update the counts like in the example I sent you...
Obviously we can serialize the updates but that would be pretty sad thing t=
o do in a database.
Realistically - when will we see this fixed (I understand it has pretty low=
 priority...) ?

Thanks a bunch for your time,

Dan Boeriu
Senior Architect - Roost.com
P: (415) 742 8056
Roost.com - 2008 Inman Award Winner for Most Innovative New Technology




-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Thu 7/30/2009 2:34 PM
To: Dan Boeriu
Cc: Robert Haas; Craig Ringer; PostgreSQL bugs
Subject: Re: [BUGS] BUG #4945: Parallel update(s) gone wild=20
=20
"Dan Boeriu" <dan.boeriu@roost.com> writes:
> Attached is the reproducible test case - I was able to reproduce the prob=
lem on 32 and 64 bit 8.3.6 and 8.4.0 RedHat 5.3 kernel 2.6.18-128.1.16.el5 =
#1 SMP

I looked at this a bit.  It's the same issue discussed at
http://archives.postgresql.org/pgsql-bugs/2008-09/msg00045.php
namely, that the second update finds itself trying to update a large
number of tuples that were already updated since its snapshot was taken.
That means it has to re-verify that the updated versions of those tuples
meet its WHERE qualification.  That's done by a function EvalPlanQual
that's pretty darn inefficient for complex queries like this one.
It's essentially redoing the join (and recomputing the whole sub-SELECT)
for each row that needs to be updated.

Someday I'd like us to redesign that mechanism, but don't hold
your breath ...

            regards, tom lane