Re: Heap truncation without AccessExclusiveLock (9.4) - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Heap truncation without AccessExclusiveLock (9.4)
Date
Msg-id CA+TgmoZMWP5nTwkL1uMqTruhN4VTTtQTm89Vm53DaNS7O5wXEQ@mail.gmail.com
Whole thread Raw
In response to Heap truncation without AccessExclusiveLock (9.4)  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: Heap truncation without AccessExclusiveLock (9.4)  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Heap truncation without AccessExclusiveLock (9.4)  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers
On Wed, May 15, 2013 at 11:35 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Shared memory space is limited, but we only need the watermarks for any
> in-progress truncations. Let's keep them in shared memory, in a small
> fixed-size array. That limits the number of concurrent truncations that can
> be in-progress, but that should be ok.

Would it only limit the number of concurrent transactions that can be
in progress *due to vacuum*?  Or would it limit the total number of
TOTAL concurrent truncations?  Because a table could have arbitrarily
many inheritance children, and you might try to truncate the whole
thing at once...

> To not slow down common backend
> operations, the values (or lack thereof) are cached in relcache. To sync the
> relcache when the values change, there will be a new shared cache
> invalidation event to force backends to refresh the cached watermark values.
> A backend (vacuum) can ensure that all backends see the new value by first
> updating the value in shared memory, sending the sinval message, and waiting
> until everyone has received it.

AFAIK, the sinval mechanism isn't really well-designed to ensure that
these kinds of notifications arrive in a timely fashion.  There's no
particular bound on how long you might have to wait.  Pretty much all
inner loops have CHECK_FOR_INTERRUPTS(), but they definitely do not
all have AcceptInvalidationMessages(), nor would that be safe or
practical.  The sinval code sends catchup interrupts, but only for the
purpose of preventing sinval overflow, not for timely receipt.

Another problem is that sinval resets are bad for performance, and
anything we do that pushes more messages through sinval will increase
the frequency of resets.  Now if those are operations are things that
are relatively uncommon, it's not worth worrying about - but if it's
something that happens on every relation extension, I think that's
likely to cause problems.  That could leave to wrapping the sinval
queue around in a fraction of a second, leading to system-wide sinval
resets.  Ouch.

> With the watermarks, truncation works like this:
>
> 1. Set soft watermark to the point where we think we can truncate the
> relation. Wait until everyone sees it (send sinval message, wait).

I'm also concerned about how you plan to synchronize access to this
shared memory arena.  I thought about implementing a relation size
cache during the 9.2 cycle, to avoid the overhead of the approximately
1 gazillion lseek calls we do under e.g. a pgbench workload.  But the
thing is, at least on Linux, the system calls are pretty cheap, and on
kernels >= 3.2, they are lock-free.  On earlier kernels, there's a
spinlock acquire/release cycle for every lseek, and performance tanks
with >= 44 cores.  That spinlock is around a single memory fetch, so a
spinlock or lwlock around the entire array would presumably be a lot
worse.

It seems to me that under this system, everyone who would under
present circumstances invoke lseek() would have to first have to query
this shared memory area, and then if they miss (which is likely, since
most of the time there won't be a truncation in progress) they'll
still have to do the lseek.  So even if there's no contention problem,
there could still be a raw loss of performance.  I feel like I might
be missing a trick though; it seems like somehow we ought to be able
to cache the relation size for long periods of time when no extension
is in progress.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: pg_dump versus defaults on foreign tables
Next
From: Dev Kumkar
Date:
Subject: Re: "on existing update" construct