Thread: BUG #5443: Undetected deadlock situation

BUG #5443: Undetected deadlock situation

From
"Claudio"
Date:
The following bug has been logged online:

Bug reference:      5443
Logged by:          Claudio
Email address:      claudio@livra.com
PostgreSQL version: 8.3.7
Operating system:   CentOS
Description:        Undetected deadlock situation
Details:

During a massive update of a table A, a single-row update of table A in
another session created an undetected deadlock situation, as evidenced by
the following locks (taken from pg_locks):

   locktype    | database | relation | page  | tuple | virtualxid |
transactionid | classid | objid | objsubid | virtualtransaction |  pid  |
   mode       | granted
 transactionid |          |          |       |       |            |
39773877 |         |       |          | 63/15761           | 11157 |
ShareLock        | f
 transactionid |          |          |       |       |            |
39773877 |         |       |          | 4/10902            |  6421 |
ExclusiveLock    | t


Only the deadlocked locks have been pasted, but I have saved the entire lock
list in case you need it.

It *might* be related to bug #3883 (
http://archives.postgresql.org/pgsql-hackers/2008-01/msg00873.php ) but it's
not clear. What makes me think so is that I've carefully set everything up
so that those updates make heavy use of HOT, and, if I'm not mistaken, HOT
does on-demand "vacuuming" of other dead HOT, so it *might* be calling
LockBufferForCleanup (I'm not familiar with the code though).

Re: BUG #5443: Undetected deadlock situation

From
Peter Eisentraut
Date:
On fre, 2010-04-30 at 03:51 +0000, Claudio wrote:
> The following bug has been logged online:
>
> Bug reference:      5443
> Logged by:          Claudio
> Email address:      claudio@livra.com
> PostgreSQL version: 8.3.7
> Operating system:   CentOS
> Description:        Undetected deadlock situation
> Details:
>
> During a massive update of a table A, a single-row update of table A in
> another session created an undetected deadlock situation, as evidenced by
> the following locks (taken from pg_locks):
>
>    locktype    | database | relation | page  | tuple | virtualxid |
> transactionid | classid | objid | objsubid | virtualtransaction |  pid  |
>    mode       | granted
>  transactionid |          |          |       |       |            |
> 39773877 |         |       |          | 63/15761           | 11157 |
> ShareLock        | f
>  transactionid |          |          |       |       |            |
> 39773877 |         |       |          | 4/10902            |  6421 |
> ExclusiveLock    | t
>

This is not a deadlock.  It's just two locks on the same table, one is
granted, and the other one is waiting.  There needs to be more
information to make sense of this (and maybe a bit more readable
formatting).

Re: BUG #5443: Undetected deadlock situation

From
"Kevin Grittner"
Date:
Peter Eisentraut <peter_e@gmx.net> wrote:

> This is not a deadlock.

Agreed.

> It's just two locks on the same table

Not as I read it.

> (and maybe a bit more readable formatting).

Eliminating null columns and mangling column headers for length, I
get this:

>>   locktype    |  tranid  | virtualx |  pid  |      mode     | gr
>> transactionid | 39773877 | 63/15761 | 11157 | ShareLock     | f
>> transactionid | 39773877 | 4/10902  |  6421 | ExclusiveLock | t

So it looks like two locks on the same transaction ID by different
transactions.  How does that happen?

-Kevin

Re: BUG #5443: Undetected deadlock situation

From
Claudio Freire
Date:
On Fri, 2010-04-30 at 08:47 -0500, Kevin Grittner wrote:

> Peter Eisentraut <peter_e@gmx.net> wrote:
>
> > This is not a deadlock.
>
> Agreed.
>
> > It's just two locks on the same table
>
> Not as I read it.
>
> > (and maybe a bit more readable formatting).
>
> Eliminating null columns and mangling column headers for length, I
> get this:
>
> >>   locktype    |  tranid  | virtualx |  pid  |      mode     | gr
> >> transactionid | 39773877 | 63/15761 | 11157 | ShareLock     | f
> >> transactionid | 39773877 | 4/10902  |  6421 | ExclusiveLock | t
>
> So it looks like two locks on the same transaction ID by different
> transactions.  How does that happen?


Exactly, I found that odd.

And too, you're right in that looking at the locks doesn't suggest a
deadlock.

But the reality of pid 6421 stopping making any progress whatsoever
makes me think about an unregistered lock - perhaps a mutex somewhere
that isn't tracked by pg_locks.

When it doesn't happen, the related queries take at most a few minutes
to update ~30k rows. When the deadlock happens, though, same number of
rows (same query in fact), it remained "working" for day and a half, no
progress, every other update on the same page blocked by that big
update. The server's load drops suggesting the big update isn't doing
much other than waiting.

If I manually cancel the small query, for instance, the big update
starts consuming resources again (suggesting progress), but quickly
deadlocks again (because my app retries the query I just cancelled). If
I manually cancel the big query, of course, everything goes back to
normal (after the little while it takes to roll back).

I can't reproduce it in a test environment because it's tied to
concurrency patterns that I couldn't reproduce yet. I will keep trying
though.

Re: BUG #5443: Undetected deadlock situation

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Eliminating null columns and mangling column headers for length, I
> get this:

> locktype    |  tranid  | virtualx |  pid  |      mode     | gr
> transactionid | 39773877 | 63/15761 | 11157 | ShareLock     | f
> transactionid | 39773877 | 4/10902  |  6421 | ExclusiveLock | t

> So it looks like two locks on the same transaction ID by different
> transactions.  How does that happen?

That's perfectly normal --- it indicates that pid 11157 is waiting for
a row-level lock that's currently held by pid 6421.  We translate
row-level locking delays into waits for XID locks in order to limit
the number of distinct locks that have to be remembered in the shared
lock table.  (We'd soon blow out shared memory if per-row lock data had
to be kept there.)

As Peter stated, there's no evidence of an actual problem in this
bug report.  I'd go looking for clients sitting with open
transactions...

            regards, tom lane

Re: BUG #5443: Undetected deadlock situation

From
"Kevin Grittner"
Date:
Claudio Freire <claudio@livra.com> wrote:

> I know it doesn't look like a deadlock from where you're standing,
> but it does so from where I am.

It's possible that you're filtering out the information which would
be most useful to diagnosing the problem.  Next time it happens,
please send the entire output from pg_locks and pg_stat_activity
(preferably as attachments to avoid wrapping issues), along with the
information listed under "Things you need to mention" on this page:

http://wiki.postgresql.org/wiki/Guide_to_reporting_problems

-Kevin

Re: BUG #5443: Undetected deadlock situation

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Claudio Freire <claudio@livra.com> wrote:
>> I know it doesn't look like a deadlock from where you're standing,
>> but it does so from where I am.

> It's possible that you're filtering out the information which would
> be most useful to diagnosing the problem.  Next time it happens,
> please send the entire output from pg_locks and pg_stat_activity
> (preferably as attachments to avoid wrapping issues), along with the
> information listed under "Things you need to mention" on this page:

An even better thing, if you can spare a few minutes for analysis
next time it happens, is to attach to each backend process with gdb
and get a stack trace.  If there is a deadlock that would probably
be enough information to understand what it is.

            regards, tom lane

Re: BUG #5443: Undetected deadlock situation

From
Claudio Freire
Date:
On Fri, 2010-04-30 at 11:50 -0400, Tom Lane wrote:

> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> > Eliminating null columns and mangling column headers for length, I
> > get this:
>
> > locktype    |  tranid  | virtualx |  pid  |      mode     | gr
> > transactionid | 39773877 | 63/15761 | 11157 | ShareLock     | f
> > transactionid | 39773877 | 4/10902  |  6421 | ExclusiveLock | t
>
> > So it looks like two locks on the same transaction ID by different
> > transactions.  How does that happen?

...

> As Peter stated, there's no evidence of an actual problem in this
> bug report.  I'd go looking for clients sitting with open
> transactions...
>
>             regards, tom lane


Well, I don't have the pg_stats_activity output at hand, but there was
no other long-running transaction at the time.

I just witnessed it happen again, I've had this problem (undetected
deadlocks) happen before, but not nearly as often. Could it be some form
of database corruption?
(I'll issue a REINDEX just in case, as soon as there's an open window).
Even if it is indeed database corruption this time, I've had it happen
before.

I guess I'll have to work on a script that reproduces this behavior.

Anyway, I'd suggest you don't dismiss this bug report so lightly. I know
it doesn't look like a deadlock from where you're standing, but it does
so from where I am. It behaves like one. If there's any diagnostic input
I can provide to help you track it down, I'll be glad to - just bear in
mind that this is hard for me to reproduce, and since I'm in a
production environment, recompiling and tinkering with the postgresql
cluster would be... undesirable.

Re: BUG #5443: Undetected deadlock situation

From
Claudio Freire
Date:
On Fri, 2010-04-30 at 11:14 -0500, Kevin Grittner wrote:
Claudio Freire <claudio@livra.com> wrote:
> I know it doesn't look like a deadlock from where you're standing,
> but it does so from where I am.
It's possible that you're filtering out the information which would
be most useful to diagnosing the problem.  Next time it happens,
please send the entire output from pg_locks and pg_stat_activity
(preferably as attachments to avoid wrapping issues), along with the
information listed under "Things you need to mention" on this page:

pg_stat_activity not, but pg_locks I did save:


Attachment

Re: BUG #5443: Undetected deadlock situation

From
Claudio Freire
Date:
On Fri, 2010-04-30 at 12:32 -0400, Tom Lane wrote:

> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> > Claudio Freire <claudio@livra.com> wrote:
> >> I know it doesn't look like a deadlock from where you're standing,
> >> but it does so from where I am.
>
> > It's possible that you're filtering out the information which would
> > be most useful to diagnosing the problem.  Next time it happens,
> > please send the entire output from pg_locks and pg_stat_activity
> > (preferably as attachments to avoid wrapping issues), along with the
> > information listed under "Things you need to mention" on this page:
>
> An even better thing, if you can spare a few minutes for analysis
> next time it happens, is to attach to each backend process with gdb
> and get a stack trace.  If there is a deadlock that would probably
> be enough information to understand what it is.


Interesting idea... I'll see if I have gdb (or can install it) on the
server.

Re: BUG #5443: Undetected deadlock situation

From
Tom Lane
Date:
Claudio Freire <claudio@livra.com> writes:
> What I did do is analyze server load during the events, and as I
> suspected, disk activity during the "deadlocks" seems to suggest a
> vacuuming taking place. Although there was no autovacuum entry in
> pg_stat_activity every time I checked, disk activity precisely matches
> the case when autovacuum decides to vacuum a big table.

[ shrug... ]  If autovacuum isn't shown in pg_stat_activity then it's
pretty hard to credit that there was an autovacuum going on.  Moreover,
if there *was* an undetected deadlock that autovacuum was somehow
involved in, then the autovacuum would be blocked too so it's hardly
possible that you'd miss seeing it in pg_stat_activity.

> That's about as much information I can give. We've worked around the
> issue successfully and it hasn't happened since. Even if it is not a
> proper deadlock, the performance drop is unacceptable. I've done massive
> updates before, and that performance drop was not expected (more than
> one day for updating 30k rows on a table with a couple indices).

I'm afraid we're not going to be able to do much with this report
if you can't provide a reproduction scenario.

            regards, tom lane

Re: BUG #5443: Undetected deadlock situation

From
Claudio Freire
Date:
On Fri, 2010-04-30 at 11:50 -0400, Tom Lane wrote:

> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> > Eliminating null columns and mangling column headers for length, I
> > get this:
>
> > locktype    |  tranid  | virtualx |  pid  |      mode     | gr
> > transactionid | 39773877 | 63/15761 | 11157 | ShareLock     | f
> > transactionid | 39773877 | 4/10902  |  6421 | ExclusiveLock | t
>
> > So it looks like two locks on the same transaction ID by different
> > transactions.  How does that happen?
>
> That's perfectly normal --- it indicates that pid 11157 is waiting for
> a row-level lock that's currently held by pid 6421.  We translate
> row-level locking delays into waits for XID locks in order to limit
> the number of distinct locks that have to be remembered in the shared
> lock table.  (We'd soon blow out shared memory if per-row lock data had
> to be kept there.)
>
> As Peter stated, there's no evidence of an actual problem in this
> bug report.  I'd go looking for clients sitting with open
> transactions...


It may be that indeed there isn't a deadlock, but an abnormal
performance drop.

I couldn't get a gdb trace before I just had to work around the issue
since it happens in a production server, and when it does it means
trouble for our app.

The workaround is to break the transaction into a SELECT for the rows to
be updated, followed by individual updates batched in transactions of a
couple hundred.

With that pattern, everything works as expected. The SELECT takes a
sweet time (but doesn't block any other transaction), and the updates go
pretty fast thanks to HOT.

I didn't try a SELECT FOR UPDATE followed by all the updates in a single
transaction yet, that would result in the same behavior as the massive
update, but perhaps without the "deadlock" (or performance drop).

What I did do is analyze server load during the events, and as I
suspected, disk activity during the "deadlocks" seems to suggest a
vacuuming taking place. Although there was no autovacuum entry in
pg_stat_activity every time I checked, disk activity precisely matches
the case when autovacuum decides to vacuum a big table.

That's about as much information I can give. We've worked around the
issue successfully and it hasn't happened since. Even if it is not a
proper deadlock, the performance drop is unacceptable. I've done massive
updates before, and that performance drop was not expected (more than
one day for updating 30k rows on a table with a couple indices).

Re: BUG #5443: Undetected deadlock situation

From
Claudio Freire
Date:
On Thu, 2010-06-03 at 13:42 -0400, Tom Lane wrote:

> Claudio Freire <claudio@livra.com> writes:
> > What I did do is analyze server load during the events, and as I
> > suspected, disk activity during the "deadlocks" seems to suggest a
> > vacuuming taking place. Although there was no autovacuum entry in
> > pg_stat_activity every time I checked, disk activity precisely matches
> > the case when autovacuum decides to vacuum a big table.
>
> [ shrug... ]  If autovacuum isn't shown in pg_stat_activity then it's
> pretty hard to credit that there was an autovacuum going on.  Moreover,
> if there *was* an undetected deadlock that autovacuum was somehow
> involved in, then the autovacuum would be blocked too so it's hardly
> possible that you'd miss seeing it in pg_stat_activity.


I know. However, I seem to recall that HOT did a sort-of vacuuming of
full pages, hoping to make space and not resort to regular updates. Now
that wouldn't show up in pg_stat_activity, would it?


> > That's about as much information I can give. We've worked around the
> > issue successfully and it hasn't happened since. Even if it is not a
> > proper deadlock, the performance drop is unacceptable. I've done massive
> > updates before, and that performance drop was not expected (more than
> > one day for updating 30k rows on a table with a couple indices).
>
> I'm afraid we're not going to be able to do much with this report
> if you can't provide a reproduction scenario.


Hold your horses there... I may be able to give you a stripped dump of
the tables involved in the query.

They have sensitive data, but I guess I could randomize the contents and
keep only the relevant distributions.

Re: BUG #5443: Undetected deadlock situation

From
"Kevin Grittner"
Date:
Claudio Freire <claudio@livra.com> wrote:
> On Thu, 2010-06-03 at 13:42 -0400, Tom Lane wrote:
>> Claudio Freire <claudio@livra.com> writes:
>> > What I did do is analyze server load during the events, and as
>> > I suspected, disk activity during the "deadlocks" seems to
>> > suggest a vacuuming taking place. Although there was no
>> > autovacuum entry in pg_stat_activity every time I checked, disk
>> > activity precisely matches the case when autovacuum decides to
>> > vacuum a big table.
>>
>> [ shrug... ]  If autovacuum isn't shown in pg_stat_activity then
>> it's pretty hard to credit that there was an autovacuum going
>> on.  Moreover, if there *was* an undetected deadlock that
>> autovacuum was somehow involved in, then the autovacuum would be
>> blocked too so it's hardly possible that you'd miss seeing it in
>> pg_stat_activity.
>
> I know. However, I seem to recall that HOT did a sort-of vacuuming
> of full pages, hoping to make space and not resort to regular
> updates. Now that wouldn't show up in pg_stat_activity, would it?

It would show up as part of whatever DML statement initiated the
cleanup.  It also wouldn't generally look a whole lot like a
table-level vacuum.  Other prime suspects would be checkpoint
activity (you might want to turn on checkpoint logging and see if
those match up to the events) and hint-bit rewrites (which can cause
disk writes the first time tuples are read after they are
committed).  Based on what I've heard so far, I wouldn't entirely
rule out some sort of bloat being a factor, either.

-Kevin

Re: BUG #5443: Undetected deadlock situation

From
Claudio Freire
Date:
On Thu, 2010-06-03 at 13:11 -0500, Kevin Grittner wrote:

> Based on what I've heard so far, I wouldn't entirely
>
> rule out some sort of bloat being a factor, either.
>
> -Kevin


Index/table bloat is the first thing I suspected, and both table and
indices look all the right size. Knowing how frequently updated those
tables are, I bet there's a healthy amount of bloat, but nothing I would
consider abnormal.