Thread: Heap truncation without AccessExclusiveLock (9.4)

Heap truncation without AccessExclusiveLock (9.4)

From

Heikki Linnakangas

Date:

15 May 2013, 15:35:50

Truncating a heap at the end of vacuum, to release unused space back to
the OS, currently requires taking an AccessExclusiveLock. Although it's 
only held for a short duration, it can be enough to cause a hiccup in 
query processing while it's held. Also, if there is a continuous stream 
of queries on the table, autovacuum never succeeds in acquiring the 
lock, and thus the table never gets truncated.

I'd like to eliminate the need for AccessExclusiveLock while truncating.

Design
------

In shared memory, keep two watermarks: a "soft" truncation watermark, 
and a "hard" truncation watermark. If there is no truncation in 
progress, the values are not set and everything works like today.

The soft watermark is the relation size (ie. number of pages) that 
vacuum wants to truncate the relation to. Backends can read pages above 
the soft watermark normally, but should refrain from inserting new 
tuples there. However, it's OK to update a page above the soft 
watermark, including adding new tuples, if the page is not completely 
empty (vacuum will check and not truncate away non-empty pages). If a 
backend nevertheless has to insert a new tuple to an empty page above 
the soft watermark, for example if there is no more free space in any 
lower-numbered pages, it must grab the extension lock, and update the 
soft watermark while holding it.

The hard watermark is the point above which there is guaranteed to be no 
tuples. A backend must not try to read or write any pages above the hard 
watermark - it should be thought of as the end of file for all practical 
purposes. If a backend needs to write above the hard watermark, ie. to 
extend the relation, it must first grab the extension lock, and raise 
the hard watermark.

The hard watermark is always >= the soft watermark.

Shared memory space is limited, but we only need the watermarks for any 
in-progress truncations. Let's keep them in shared memory, in a small 
fixed-size array. That limits the number of concurrent truncations that 
can be in-progress, but that should be ok. To not slow down common 
backend operations, the values (or lack thereof) are cached in relcache. 
To sync the relcache when the values change, there will be a new shared 
cache invalidation event to force backends to refresh the cached 
watermark values. A backend (vacuum) can ensure that all backends see 
the new value by first updating the value in shared memory, sending the 
sinval message, and waiting until everyone has received it.

With the watermarks, truncation works like this:

1. Set soft watermark to the point where we think we can truncate the 
relation. Wait until everyone sees it (send sinval message, wait).

2. Scan the pages to verify they are still empty.

3. Grab extension lock. Set hard watermark to current soft watermark (a 
backend might have inserted a tuple and raised the soft watermark while 
we were scanning). Release lock.

4. Wait until everyone sees the new hard watermark.

5. Grab extension lock.

6. Check (or wait) that there are no pinned buffers above the current 
hard watermark. (a backend might have a scan in progress that started 
before any of this, still holding a buffer pinned, even though it's empty.)

7. Truncate relation to the current hard watermark.

8. Release extension lock.


If a backend inserts a new tuple before step 2, the vacuum scan will see 
it. If it's inserted after step 2, the backend's cached soft watermark 
is already up-to-date, and thus the backend will update the soft 
watermark before the insert. Thus after the vacuum scan has finished the 
scan at step 2, all pages above the current soft watermark must still be 
empty.


Implementation details
----------------------

There are three kinds of access to a heap page:

A) As a target for new tuple.
B) Following an index pointer, ctid or similar.
C) A sequential scan (and bitmap heap scan?)


To refrain from inserting new tuples to non-empty pages above the soft 
watermark (A), RelationGetBufferForTuple() is modified to check the soft 
watermark (and raise it if necessary).

An index scan (B) should never try to read beyond the high watermark, 
because there are no tuples above it, and thus there should be no 
pointers to pages above it either.

A sequential scan (C) must refrain from reading beyond the hard 
watermark. This can be implemented by always checking the (cached) high 
watermark value before stepping to next page.


Truncation during hot standby is a lot simpler: set soft and hard 
watermarks to the truncation point, wait until everyone sees the new 
values, and truncate the relation.


Does anyone see a flaw in this?

- Heikki

Re: Heap truncation without AccessExclusiveLock (9.4)

From

Robert Haas

Date:

15 May 2013, 21:18:19

On Wed, May 15, 2013 at 11:35 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Shared memory space is limited, but we only need the watermarks for any
> in-progress truncations. Let's keep them in shared memory, in a small
> fixed-size array. That limits the number of concurrent truncations that can
> be in-progress, but that should be ok.

Would it only limit the number of concurrent transactions that can be
in progress *due to vacuum*?  Or would it limit the total number of
TOTAL concurrent truncations?  Because a table could have arbitrarily
many inheritance children, and you might try to truncate the whole
thing at once...

> To not slow down common backend
> operations, the values (or lack thereof) are cached in relcache. To sync the
> relcache when the values change, there will be a new shared cache
> invalidation event to force backends to refresh the cached watermark values.
> A backend (vacuum) can ensure that all backends see the new value by first
> updating the value in shared memory, sending the sinval message, and waiting
> until everyone has received it.

AFAIK, the sinval mechanism isn't really well-designed to ensure that
these kinds of notifications arrive in a timely fashion.  There's no
particular bound on how long you might have to wait.  Pretty much all
inner loops have CHECK_FOR_INTERRUPTS(), but they definitely do not
all have AcceptInvalidationMessages(), nor would that be safe or
practical.  The sinval code sends catchup interrupts, but only for the
purpose of preventing sinval overflow, not for timely receipt.

Another problem is that sinval resets are bad for performance, and
anything we do that pushes more messages through sinval will increase
the frequency of resets.  Now if those are operations are things that
are relatively uncommon, it's not worth worrying about - but if it's
something that happens on every relation extension, I think that's
likely to cause problems.  That could leave to wrapping the sinval
queue around in a fraction of a second, leading to system-wide sinval
resets.  Ouch.

> With the watermarks, truncation works like this:
>
> 1. Set soft watermark to the point where we think we can truncate the
> relation. Wait until everyone sees it (send sinval message, wait).

I'm also concerned about how you plan to synchronize access to this
shared memory arena.  I thought about implementing a relation size
cache during the 9.2 cycle, to avoid the overhead of the approximately
1 gazillion lseek calls we do under e.g. a pgbench workload.  But the
thing is, at least on Linux, the system calls are pretty cheap, and on
kernels >= 3.2, they are lock-free.  On earlier kernels, there's a
spinlock acquire/release cycle for every lseek, and performance tanks
with >= 44 cores.  That spinlock is around a single memory fetch, so a
spinlock or lwlock around the entire array would presumably be a lot
worse.

It seems to me that under this system, everyone who would under
present circumstances invoke lseek() would have to first have to query
this shared memory area, and then if they miss (which is likely, since
most of the time there won't be a truncation in progress) they'll
still have to do the lseek.  So even if there's no contention problem,
there could still be a raw loss of performance.  I feel like I might
be missing a trick though; it seems like somehow we ought to be able
to cache the relation size for long periods of time when no extension
is in progress.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Heap truncation without AccessExclusiveLock (9.4)

From

Tom Lane

Date:

15 May 2013, 23:10:22

Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, May 15, 2013 at 11:35 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> To not slow down common backend
>> operations, the values (or lack thereof) are cached in relcache. To sync the
>> relcache when the values change, there will be a new shared cache
>> invalidation event to force backends to refresh the cached watermark values.

> AFAIK, the sinval mechanism isn't really well-designed to ensure that
> these kinds of notifications arrive in a timely fashion.

Yeah; currently it's only meant to guarantee that you see updates that
were protected by obtaining a heavyweight lock with which your own lock
request conflicts.  It will *not* work for the usage Heikki proposes,
at least not without sprinkling sinval queue checks into a lot of places
where they aren't now.  And as you say, the side-effects of that would
be worrisome.

> Another problem is that sinval resets are bad for performance, and
> anything we do that pushes more messages through sinval will increase
> the frequency of resets.

I've been thinking that we should increase the size of the sinval ring;
now that we're out from under SysV shmem size limits, it wouldn't be
especially painful to do that.  That's not terribly relevant to this
issue though.  I agree that we don't want an sinval message per relation
extension, no matter what the ring size is.
        regards, tom lane

Re: Heap truncation without AccessExclusiveLock (9.4)

From

Robert Haas

Date:

16 May 2013, 00:04:24

On Wed, May 15, 2013 at 7:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Another problem is that sinval resets are bad for performance, and
>> anything we do that pushes more messages through sinval will increase
>> the frequency of resets.
>
> I've been thinking that we should increase the size of the sinval ring;
> now that we're out from under SysV shmem size limits, it wouldn't be
> especially painful to do that.  That's not terribly relevant to this
> issue though.  I agree that we don't want an sinval message per relation
> extension, no matter what the ring size is.

I've been thinking for a while that we need some other system for
managing other kinds of invalidations.  For example, suppose we want
to cache relation sizes in blocks.  So we allocate 4kB of shared
memory, interpreted as an array of 512 8-byte entries.  Whenever you
extend a relation, you hash the relfilenode and take the low-order 9
bits of the hash value as an index into the array.  You increment that
value either under a spinlock or perhaps using fetch-and-add where
available.

On the read side, every backend can cache the length of as many
relations as it wants.  But before relying on a cached value, it must
index into the shared array and see whether the value has been
updated.  On 64-bit systems, this requires no lock, only a barrier,
and some 32-bit systems have special instructions that can be used for
an 8-byte atomic read, and hence could avoid the lock as well.  This
would almost certainly be cheaper than doing an lseek every time,
although maybe not by enough to matter.  At least on Linux, the
syscall seems to be pretty cheap.

Now, a problem with this is that we keep doing things that make it
hard for people to run very low memory instances of PostgreSQL.  So
potentially whether or not we allocate space for this could be
controlled by a GUC.  Or maybe the structure could be made somewhat
larger and shared among multiple caching needs.

I'm not sure whether this idea can be adapted to do what Heikki is
after.  But I think these kinds of techniques are worth thinking about
as we look for ways to further improve performance.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Heap truncation without AccessExclusiveLock (9.4)

From

Tom Lane

Date:

16 May 2013, 00:24:40

Robert Haas <robertmhaas@gmail.com> writes:
> I've been thinking for a while that we need some other system for
> managing other kinds of invalidations.  For example, suppose we want
> to cache relation sizes in blocks.  So we allocate 4kB of shared
> memory, interpreted as an array of 512 8-byte entries.  Whenever you
> extend a relation, you hash the relfilenode and take the low-order 9
> bits of the hash value as an index into the array.  You increment that
> value either under a spinlock or perhaps using fetch-and-add where
> available.

I'm not sure I believe the details of that.

1. 4 bytes is not enough to store the exact identity of the table that
the cache entry belongs to, so how do you disambiguate?

2. If you don't find an entry for your target rel in the cache, aren't
you still going to have to do an lseek?

Having said that, the idea of specialized caches in shared memory seems
plenty reasonable to me.

One thing that's bothering me about Heikki's proposal is that it's not
clear that it's a *cache*; that is, I don't see the fallback logic to
use when there's no entry for a relation for lack of room.
        regards, tom lane

Re: Heap truncation without AccessExclusiveLock (9.4)

From

Andres Freund

Date:

16 May 2013, 01:15:21

On 2013-05-15 18:35:35 +0300, Heikki Linnakangas wrote:
> Truncating a heap at the end of vacuum, to release unused space back to
> the OS, currently requires taking an AccessExclusiveLock. Although it's only
> held for a short duration, it can be enough to cause a hiccup in query
> processing while it's held. Also, if there is a continuous stream of queries
> on the table, autovacuum never succeeds in acquiring the lock, and thus the
> table never gets truncated.
> 
> I'd like to eliminate the need for AccessExclusiveLock while truncating.

Couldn't we "just" take the extension lock and then walk backwards from
the rechecked end of relation ConditionalLockBufferForCleanup() the
buffers?
For every such locked page we check whether its still empty. If we find
a page that we couldn't lock, isn't empty or we already locked a
sufficient number of pages we truncate.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Heap truncation without AccessExclusiveLock (9.4)

From

Robert Haas

Date:

16 May 2013, 17:01:26

On Wed, May 15, 2013 at 8:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I've been thinking for a while that we need some other system for
>> managing other kinds of invalidations.  For example, suppose we want
>> to cache relation sizes in blocks.  So we allocate 4kB of shared
>> memory, interpreted as an array of 512 8-byte entries.  Whenever you
>> extend a relation, you hash the relfilenode and take the low-order 9
>> bits of the hash value as an index into the array.  You increment that
>> value either under a spinlock or perhaps using fetch-and-add where
>> available.
>
> I'm not sure I believe the details of that.
>
> 1. 4 bytes is not enough to store the exact identity of the table that
> the cache entry belongs to, so how do you disambiguate?

You don't.  The idea is that it's inexact.  When a relation is
extended, every backend is forced to recheck the length of every
relation whose relfilenode hashes to the same array slot as the one
that was actually extended.  So if you happen to be repeatedly
scanning relation A, and somebody else is repeatedly scanning relation
B, you'll *probably* not have to invalidate anything.  But if A and B
happen to hash to the same slot, then you'll keep getting bogus
invalidations.  Fortunately, that isn't very expensive.

The fast-path locking code uses a similar trick to detect conflicting
strong locks, and it works quite well.  In that case, as here, you can
reduce the collision probability as much as you like by increasing the
number of slots, at the cost of increased shared memory usage.

> 2. If you don't find an entry for your target rel in the cache, aren't
> you still going to have to do an lseek?

Don't think of it as a cache.  The caching happens inside each
backend's relcache; the shared memory structure is just a tool to
force those caches to be revalidated when necessary.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Heap truncation without AccessExclusiveLock (9.4)

From

Tom Lane

Date:

16 May 2013, 17:15:53

Robert Haas <robertmhaas@gmail.com> writes:
>> 2. If you don't find an entry for your target rel in the cache, aren't
>> you still going to have to do an lseek?

> Don't think of it as a cache.  The caching happens inside each
> backend's relcache; the shared memory structure is just a tool to
> force those caches to be revalidated when necessary.

Hmm.  Now I see: it's not a cache, it's a Bloom filter.  The failure
mode I was thinking of is inapplicable, but there's a different one:
you have to be absolutely positive that *any* operation that extends the
file will update the relevant filter entry.  Still, I guess that we're
already assuming that any such op will take the relation's extension
lock, so it should be easy enough to find all the places to fix.
        regards, tom lane

Re: Heap truncation without AccessExclusiveLock (9.4)

From

Robert Haas

Date:

16 May 2013, 17:29:58

On Thu, May 16, 2013 at 1:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>>> 2. If you don't find an entry for your target rel in the cache, aren't
>>> you still going to have to do an lseek?
>
>> Don't think of it as a cache.  The caching happens inside each
>> backend's relcache; the shared memory structure is just a tool to
>> force those caches to be revalidated when necessary.
>
> Hmm.  Now I see: it's not a cache, it's a Bloom filter.

Yes.

> The failure
> mode I was thinking of is inapplicable, but there's a different one:
> you have to be absolutely positive that *any* operation that extends the
> file will update the relevant filter entry.  Still, I guess that we're
> already assuming that any such op will take the relation's extension
> lock, so it should be easy enough to find all the places to fix.

I would think so.  The main thing that's held me back from actually
implementing this is the fact that lseek is so darn cheap on Linux,
and I don't have reliable data one way or the other for any other
platform.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Heap truncation without AccessExclusiveLock (9.4)

From

Heikki Linnakangas

Date:

17 May 2013, 07:38:31

On 16.05.2013 00:18, Robert Haas wrote:
> On Wed, May 15, 2013 at 11:35 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com>  wrote:
>> Shared memory space is limited, but we only need the watermarks for any
>> in-progress truncations. Let's keep them in shared memory, in a small
>> fixed-size array. That limits the number of concurrent truncations that can
>> be in-progress, but that should be ok.
>
> Would it only limit the number of concurrent transactions that can be
> in progress *due to vacuum*?  Or would it limit the total number of
> TOTAL concurrent truncations?  Because a table could have arbitrarily
> many inheritance children, and you might try to truncate the whole
> thing at once...

It would only limit the number of concurrent *truncations*. Vacuums in 
general would not count, only vacuums at the end of the vacuum process, 
trying to truncate the heap.

>> To not slow down common backend
>> operations, the values (or lack thereof) are cached in relcache. To sync the
>> relcache when the values change, there will be a new shared cache
>> invalidation event to force backends to refresh the cached watermark values.
>> A backend (vacuum) can ensure that all backends see the new value by first
>> updating the value in shared memory, sending the sinval message, and waiting
>> until everyone has received it.
>
> AFAIK, the sinval mechanism isn't really well-designed to ensure that
> these kinds of notifications arrive in a timely fashion.  There's no
> particular bound on how long you might have to wait.  Pretty much all
> inner loops have CHECK_FOR_INTERRUPTS(), but they definitely do not
> all have AcceptInvalidationMessages(), nor would that be safe or
> practical.  The sinval code sends catchup interrupts, but only for the
> purpose of preventing sinval overflow, not for timely receipt.

Currently, vacuum will have to wait for all transactions that have 
touched the relation to finish, to get the AccessExclusiveLock. If we 
don't change anything in the sinval mechanism, the wait would be similar 
- until all currently in-progress transactions have finished. It's not 
quite the same; you'd have to wait for all in-progress transactions to 
finish, not only those that have actually touched the relation. But on 
the plus-side, you would not block new transactions from accessing the 
relation, so it's not too bad if it takes a long time.

If we could use the catchup interrupts to speed that up though, that 
would be much better. I think vacuum could simply send a catchup 
interrupt, and wait until everyone has caught up. That would 
significantly increase the traffic of sinval queue and catchup 
interrupts, compared to what it is today, but I think it would still be 
ok. It would still only be a few sinval messages and catchup interrupts 
per truncation (ie. per vacuum).

> Another problem is that sinval resets are bad for performance, and
> anything we do that pushes more messages through sinval will increase
> the frequency of resets.  Now if those are operations are things that
> are relatively uncommon, it's not worth worrying about - but if it's
> something that happens on every relation extension, I think that's
> likely to cause problems.

It would not be on every relation extension, only on truncation.

>> With the watermarks, truncation works like this:
>>
>> 1. Set soft watermark to the point where we think we can truncate the
>> relation. Wait until everyone sees it (send sinval message, wait).
>
> I'm also concerned about how you plan to synchronize access to this
> shared memory arena.

I was thinking of a simple lwlock, or perhaps one lwlock per slot in the 
arena. It would not be accessed very frequently, because the watermark 
values would be cached in the relcache. It would only need to be 
accessed when:

1. Truncating the relation, by vacuum, to set the watermark values
2. By backends, to update the relcache, when it receives the sinval 
message sent by vacuum.
3. By backends, when writing above the cached watermark value. IOW, when 
extending a relation that's being truncated at the same time.

In particular, it would definitely not be accessed every time a backend 
currently needs to do an lseek. Nor everytime a backend needs to extend 
a relation.

- Heikki

Re: Heap truncation without AccessExclusiveLock (9.4)

From

Heikki Linnakangas

Date:

17 May 2013, 07:45:40

On 16.05.2013 04:15, Andres Freund wrote:
> On 2013-05-15 18:35:35 +0300, Heikki Linnakangas wrote:
>> Truncating a heap at the end of vacuum, to release unused space back to
>> the OS, currently requires taking an AccessExclusiveLock. Although it's only
>> held for a short duration, it can be enough to cause a hiccup in query
>> processing while it's held. Also, if there is a continuous stream of queries
>> on the table, autovacuum never succeeds in acquiring the lock, and thus the
>> table never gets truncated.
>>
>> I'd like to eliminate the need for AccessExclusiveLock while truncating.
>
> Couldn't we "just" take the extension lock and then walk backwards from
> the rechecked end of relation ConditionalLockBufferForCleanup() the
> buffers?
> For every such locked page we check whether its still empty. If we find
> a page that we couldn't lock, isn't empty or we already locked a
> sufficient number of pages we truncate.

You need an AccessExclusiveLock on the relation to make sure that after 
you have checked that pages 10-15 are empty, and truncated them away, a 
backend doesn't come along a few seconds later and try to read page 10 
again. There might be an old sequential scan in progress, for example, 
that thinks that the pages are still there.

- Heikki

Re: Heap truncation without AccessExclusiveLock (9.4)

From

Andres Freund

Date:

17 May 2013, 10:24:32

On 2013-05-17 10:45:26 +0300, Heikki Linnakangas wrote:
> On 16.05.2013 04:15, Andres Freund wrote:
> >On 2013-05-15 18:35:35 +0300, Heikki Linnakangas wrote:
> >>Truncating a heap at the end of vacuum, to release unused space back to
> >>the OS, currently requires taking an AccessExclusiveLock. Although it's only
> >>held for a short duration, it can be enough to cause a hiccup in query
> >>processing while it's held. Also, if there is a continuous stream of queries
> >>on the table, autovacuum never succeeds in acquiring the lock, and thus the
> >>table never gets truncated.
> >>
> >>I'd like to eliminate the need for AccessExclusiveLock while truncating.
> >
> >Couldn't we "just" take the extension lock and then walk backwards from
> >the rechecked end of relation ConditionalLockBufferForCleanup() the
> >buffers?
> >For every such locked page we check whether its still empty. If we find
> >a page that we couldn't lock, isn't empty or we already locked a
> >sufficient number of pages we truncate.
> 
> You need an AccessExclusiveLock on the relation to make sure that after you
> have checked that pages 10-15 are empty, and truncated them away, a backend
> doesn't come along a few seconds later and try to read page 10 again. There
> might be an old sequential scan in progress, for example, that thinks that
> the pages are still there.

But that seems easily enough handled: We know the current page in its
scan cannot be removed since its pinned. So make
heapgettup()/heapgetpage() pass something like RBM_IFEXISTS to
ReadBuffer and if the read fails recheck the length of the relation
before throwing an error.

There isn't much besides seqscans that can have that behaviour afaics:
- (bitmap)indexscans et al. won't point to completely empty pages
- there cannot be a concurrent vacuum since we have the appropriate locks
- if a trigger or something else has a tid referencing a page there need to be unremovable tuples on it.

The only thing that I immediately see are tidscans which should be
handleable in a similar manner to seqscans.

Sure, there are some callsites that need to be adapted but it still
seems noticeably easier than what you proposed upthread.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: Heap truncation without AccessExclusiveLock (9.4)

From

Robert Haas

Date:

20 May 2013, 13:59:23

On Fri, May 17, 2013 at 3:38 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> If we could use the catchup interrupts to speed that up though, that would
> be much better. I think vacuum could simply send a catchup interrupt, and
> wait until everyone has caught up. That would significantly increase the
> traffic of sinval queue and catchup interrupts, compared to what it is
> today, but I think it would still be ok. It would still only be a few sinval
> messages and catchup interrupts per truncation (ie. per vacuum).

Hmm.  So your proposal is to only send these sinval messages while a
truncation is in progress, not any time the relation is extended?
That would certainly be far more appealing from the point of view of
not blowing out sinval.

It shouldn't be difficult to restrict the set of backends that have to
be signaled to those that have the relation open.  You could have a
special kind of catchup signal that means "catch yourself up, but
don't chain" - and send that only to those backends returned by
GetConflictingVirtualXIDs.  It might be better to disconnect this
mechanism from sinval entirely.  In other words, just stick a version
number on your shared memory data structure and have everyone
advertise the last version number they've seen via PGPROC.  The sinval
message doesn't really seem necessary; it's basically just a no-op
message to say "reread shared memory", and a plain old signal can
carry that same message more efficiently.

One other thought regarding point 6 from your original proposal.  If
it's true that a scan could hold a pin on a buffer above the current
hard watermark, which I think it is, then that means there's a scan in
progress which is going to try to fetch pages beyond that also, up to
whatever the end of file position was when the scan started.  I
suppose that heap scans will need to be taught to check some
backend-local flag before fetching each new block, to see if a hard
watermark change might have intervened.  Is that what you have in
mind?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Heap truncation without AccessExclusiveLock (9.4)

From

Heikki Linnakangas

Date:

20 May 2013, 14:19:43

On 20.05.2013 16:59, Robert Haas wrote:
> On Fri, May 17, 2013 at 3:38 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com>  wrote:
>> If we could use the catchup interrupts to speed that up though, that would
>> be much better. I think vacuum could simply send a catchup interrupt, and
>> wait until everyone has caught up. That would significantly increase the
>> traffic of sinval queue and catchup interrupts, compared to what it is
>> today, but I think it would still be ok. It would still only be a few sinval
>> messages and catchup interrupts per truncation (ie. per vacuum).
>
> Hmm.  So your proposal is to only send these sinval messages while a
> truncation is in progress, not any time the relation is extended?
> That would certainly be far more appealing from the point of view of
> not blowing out sinval.

Right.

> It shouldn't be difficult to restrict the set of backends that have to
> be signaled to those that have the relation open.  You could have a
> special kind of catchup signal that means "catch yourself up, but
> don't chain"

What does "chain" mean above?

> - and send that only to those backends returned by
> GetConflictingVirtualXIDs.

Yeah, that might be a worthwhile optimization.

> It might be better to disconnect this mechanism from sinval entirely.
> In other words, just stick a version number on your shared memory
> data structure and have everyone advertise the last version number
> they've seen via PGPROC.  The sinval message doesn't really seem
> necessary; it's basically just a no-op message to say "reread shared
> memory", and a plain old signal can carry that same message more
> efficiently.

Hmm. The sinval message makes sure that when a backend locks a relation, 
it will see the latest value, because of the AcceptInvalidationMessages 
call in LockRelation. If there is no sinval message, you'd need to 
always check the shared memory area when you lock a relation.

> One other thought regarding point 6 from your original proposal.  If
> it's true that a scan could hold a pin on a buffer above the current
> hard watermark, which I think it is, then that means there's a scan in
> progress which is going to try to fetch pages beyond that also, up to
> whatever the end of file position was when the scan started.  I
> suppose that heap scans will need to be taught to check some
> backend-local flag before fetching each new block, to see if a hard
> watermark change might have intervened.  Is that what you have in
> mind?

Yes, heap scan will need to check the locally cached high watermark 
value every time when stepping to a new page.

- Heikki

Re: Heap truncation without AccessExclusiveLock (9.4)

From

Heikki Linnakangas

Date:

20 May 2013, 19:43:19

On 17.05.2013 12:35, Andres Freund wrote:
> On 2013-05-17 10:45:26 +0300, Heikki Linnakangas wrote:
>> On 16.05.2013 04:15, Andres Freund wrote:
>>> Couldn't we "just" take the extension lock and then walk backwards from
>>> the rechecked end of relation ConditionalLockBufferForCleanup() the
>>> buffers?
>>> For every such locked page we check whether its still empty. If we find
>>> a page that we couldn't lock, isn't empty or we already locked a
>>> sufficient number of pages we truncate.
>>
>> You need an AccessExclusiveLock on the relation to make sure that after you
>> have checked that pages 10-15 are empty, and truncated them away, a backend
>> doesn't come along a few seconds later and try to read page 10 again. There
>> might be an old sequential scan in progress, for example, that thinks that
>> the pages are still there.
>
> But that seems easily enough handled: We know the current page in its
> scan cannot be removed since its pinned. So make
> heapgettup()/heapgetpage() pass something like RBM_IFEXISTS to
> ReadBuffer and if the read fails recheck the length of the relation
> before throwing an error.

Hmm. For the above to work, you'd need to atomically check that the 
pages you're truncating away are not pinned, and truncate them. If those 
steps are not atomic, a backend might pin a page after you've checked 
that it's not pinned, but before you've truncated the underlying file. I 
guess that be doable; needs some new infrastructure in the buffer 
manager, however.

> There isn't much besides seqscans that can have that behaviour afaics:
> - (bitmap)indexscans et al. won't point to completely empty pages
> - there cannot be a concurrent vacuum since we have the appropriate
>    locks
> - if a trigger or something else has a tid referencing a page there need
>    to be unremovable tuples on it.
>
> The only thing that I immediately see are tidscans which should be
> handleable in a similar manner to seqscans.
>
> Sure, there are some callsites that need to be adapted but it still
> seems noticeably easier than what you proposed upthread.

Yeah. I'll think some more how the required buffer manager changes could 
be done.

- Heikki

Re: Heap truncation without AccessExclusiveLock (9.4)

From

Robert Haas

Date:

23 May 2013, 03:05:01

On Mon, May 20, 2013 at 10:19 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
>> It shouldn't be difficult to restrict the set of backends that have to
>> be signaled to those that have the relation open.  You could have a
>> special kind of catchup signal that means "catch yourself up, but
>> don't chain"
>
> What does "chain" mean above?

Normally, when sinval catchup is needed, we signal the backend that is
furthest behind.  After catching up, it signals the backend that is
next-furthest behind, which in turns catches up and signals the next
laggard, and so forth.

> Hmm. The sinval message makes sure that when a backend locks a relation, it
> will see the latest value, because of the AcceptInvalidationMessages call in
> LockRelation. If there is no sinval message, you'd need to always check the
> shared memory area when you lock a relation.

The latest value of what?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Heap truncation without AccessExclusiveLock (9.4)

From

Simon Riggs

Date:

23 May 2013, 12:22:39

On 15 May 2013 16:35, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

> Shared memory space is limited, but we only need the watermarks for any
> in-progress truncations. Let's keep them in shared memory, in a small
> fixed-size array. That limits the number of concurrent truncations that can
> be in-progress, but that should be ok. To not slow down common backend
> operations, the values (or lack thereof) are cached in relcache. To sync the
> relcache when the values change, there will be a new shared cache
> invalidation event to force backends to refresh the cached watermark values.
> A backend (vacuum) can ensure that all backends see the new value by first
> updating the value in shared memory, sending the sinval message, and waiting
> until everyone has received it.

I think we could use a similar scheme for 2 other use cases.

1. Unlogged tables. It would be useful to have a persistent "safe high
watermark" for an unlogged table. So in the event of a crash, we
truncate back to the safe high watermark, not truncate the whole
table. That would get updated at each checkpoint. Unlogged tables will
get much more useful with that change. (Issues, with indexes would
need to be resolved also).

2. Table extension during COPY operations is difficult. We need to be
able to extend in larger chunks, so we would need to change the
algorithm about how extension works. I'm thinking there's a
relationship there with watermarks.

Can we look at the needs of multiple areas at once, so we come up with
a more useful design that covers more than just one use case, please?

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services