Philip Warner <pjw@rhyme.com.au> writes:
> At 11:04 PM 3/05/2004, Tom Lane wrote:
>> How confident are you in those "processes"? I don't know of any other
>> mechanism for 'tuple concurrently updated' failures in ANALYZE than
>> concurrent analyze runs ...
> Fairly. In this particular instance the error was probably caused bu a
> manually run VACUUM (part of me stressing it to encourage the error).
Yeah, I see a process 14295 in your dump that seems to be trying to
ANALYZE (at least, it's got write lock on pg_statistic...). 8631 is the
incumbent ANALYZE, and it's got locks all over the place :-(
I think what we have here is an evil side-effect of the change a couple
versions back to allow standalone ANALYZE to run as a single
transaction. A database-wide ANALYZE will therefore accumulate
AccessShareLock on each table it touches, and it won't release these
locks until commit. So the scenario goes like this:
1. Somewhere relatively early in its run, ANALYZE processes
pg_statistic. So it's now holding AccessShareLock on pg_statistic.
2. As the ANALYZE proceeds, it issues sinval messages due to the updates
it's making in pg_statistic. This is normal. There will be (at least)
one such message per column analyzed, and it sounds like your database
has many columns. Plus of course other catalog updates could be
occurring in other backends.
3. There is at least one other backend sitting idle and therefore not
reading sinval messages. So eventually the sinval queue gets 70% full
and the SIGUSR2 escape-hatch is triggered.
4. The idle backends (and eventually non-idle ones, too, whenever they
next reach the idle loop) try to do the NOTIFY thing, and get blocked
trying to acquire AccessExclusiveLock on pg_listener. They will now be
stuck until the ANALYZE completes.
As a quick-hack fix, I think it would do to reduce the locks taken on
pg_listener in async.c from AccessExclusiveLock to ExclusiveLock. This
would serve the purpose of serializing async.c processing without
creating a conflict against ANALYZE's AccessShareLock.
Some other things we ought to think about for the future:
* Is it really a good idea for database-wide ANALYZE to run as a single
transaction? Holding all those locks is a recipe for deadlocks, even
if they're as inoffensive as AccessShareLocks normally are.
* Can we use something less heavyweight than ProcessIncomingNotify
to deal with the sinval-clearing problem? Not only is that routine
expensive, but it is a serialization bottleneck, which is exactly what
we *don't* want in something that all the backends are getting told to
do at the same time. I think the original motivation for that hack was
because we didn't have a spare signal number available to dedicate to
sinval response, but SIGUSR1 has been free for a couple releases now.
I'm very tempted to commandeer it for sinval.
regards, tom lane