Thread: Heap truncation without AccessExclusiveLock (9.4)
Truncating a heap at the end of vacuum, to release unused space back to the OS, currently requires taking an AccessExclusiveLock. Although it's only held for a short duration, it can be enough to cause a hiccup in query processing while it's held. Also, if there is a continuous stream of queries on the table, autovacuum never succeeds in acquiring the lock, and thus the table never gets truncated. I'd like to eliminate the need for AccessExclusiveLock while truncating. Design ------ In shared memory, keep two watermarks: a "soft" truncation watermark, and a "hard" truncation watermark. If there is no truncation in progress, the values are not set and everything works like today. The soft watermark is the relation size (ie. number of pages) that vacuum wants to truncate the relation to. Backends can read pages above the soft watermark normally, but should refrain from inserting new tuples there. However, it's OK to update a page above the soft watermark, including adding new tuples, if the page is not completely empty (vacuum will check and not truncate away non-empty pages). If a backend nevertheless has to insert a new tuple to an empty page above the soft watermark, for example if there is no more free space in any lower-numbered pages, it must grab the extension lock, and update the soft watermark while holding it. The hard watermark is the point above which there is guaranteed to be no tuples. A backend must not try to read or write any pages above the hard watermark - it should be thought of as the end of file for all practical purposes. If a backend needs to write above the hard watermark, ie. to extend the relation, it must first grab the extension lock, and raise the hard watermark. The hard watermark is always >= the soft watermark. Shared memory space is limited, but we only need the watermarks for any in-progress truncations. Let's keep them in shared memory, in a small fixed-size array. That limits the number of concurrent truncations that can be in-progress, but that should be ok. To not slow down common backend operations, the values (or lack thereof) are cached in relcache. To sync the relcache when the values change, there will be a new shared cache invalidation event to force backends to refresh the cached watermark values. A backend (vacuum) can ensure that all backends see the new value by first updating the value in shared memory, sending the sinval message, and waiting until everyone has received it. With the watermarks, truncation works like this: 1. Set soft watermark to the point where we think we can truncate the relation. Wait until everyone sees it (send sinval message, wait). 2. Scan the pages to verify they are still empty. 3. Grab extension lock. Set hard watermark to current soft watermark (a backend might have inserted a tuple and raised the soft watermark while we were scanning). Release lock. 4. Wait until everyone sees the new hard watermark. 5. Grab extension lock. 6. Check (or wait) that there are no pinned buffers above the current hard watermark. (a backend might have a scan in progress that started before any of this, still holding a buffer pinned, even though it's empty.) 7. Truncate relation to the current hard watermark. 8. Release extension lock. If a backend inserts a new tuple before step 2, the vacuum scan will see it. If it's inserted after step 2, the backend's cached soft watermark is already up-to-date, and thus the backend will update the soft watermark before the insert. Thus after the vacuum scan has finished the scan at step 2, all pages above the current soft watermark must still be empty. Implementation details ---------------------- There are three kinds of access to a heap page: A) As a target for new tuple. B) Following an index pointer, ctid or similar. C) A sequential scan (and bitmap heap scan?) To refrain from inserting new tuples to non-empty pages above the soft watermark (A), RelationGetBufferForTuple() is modified to check the soft watermark (and raise it if necessary). An index scan (B) should never try to read beyond the high watermark, because there are no tuples above it, and thus there should be no pointers to pages above it either. A sequential scan (C) must refrain from reading beyond the hard watermark. This can be implemented by always checking the (cached) high watermark value before stepping to next page. Truncation during hot standby is a lot simpler: set soft and hard watermarks to the truncation point, wait until everyone sees the new values, and truncate the relation. Does anyone see a flaw in this? - Heikki
On Wed, May 15, 2013 at 11:35 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Shared memory space is limited, but we only need the watermarks for any > in-progress truncations. Let's keep them in shared memory, in a small > fixed-size array. That limits the number of concurrent truncations that can > be in-progress, but that should be ok. Would it only limit the number of concurrent transactions that can be in progress *due to vacuum*? Or would it limit the total number of TOTAL concurrent truncations? Because a table could have arbitrarily many inheritance children, and you might try to truncate the whole thing at once... > To not slow down common backend > operations, the values (or lack thereof) are cached in relcache. To sync the > relcache when the values change, there will be a new shared cache > invalidation event to force backends to refresh the cached watermark values. > A backend (vacuum) can ensure that all backends see the new value by first > updating the value in shared memory, sending the sinval message, and waiting > until everyone has received it. AFAIK, the sinval mechanism isn't really well-designed to ensure that these kinds of notifications arrive in a timely fashion. There's no particular bound on how long you might have to wait. Pretty much all inner loops have CHECK_FOR_INTERRUPTS(), but they definitely do not all have AcceptInvalidationMessages(), nor would that be safe or practical. The sinval code sends catchup interrupts, but only for the purpose of preventing sinval overflow, not for timely receipt. Another problem is that sinval resets are bad for performance, and anything we do that pushes more messages through sinval will increase the frequency of resets. Now if those are operations are things that are relatively uncommon, it's not worth worrying about - but if it's something that happens on every relation extension, I think that's likely to cause problems. That could leave to wrapping the sinval queue around in a fraction of a second, leading to system-wide sinval resets. Ouch. > With the watermarks, truncation works like this: > > 1. Set soft watermark to the point where we think we can truncate the > relation. Wait until everyone sees it (send sinval message, wait). I'm also concerned about how you plan to synchronize access to this shared memory arena. I thought about implementing a relation size cache during the 9.2 cycle, to avoid the overhead of the approximately 1 gazillion lseek calls we do under e.g. a pgbench workload. But the thing is, at least on Linux, the system calls are pretty cheap, and on kernels >= 3.2, they are lock-free. On earlier kernels, there's a spinlock acquire/release cycle for every lseek, and performance tanks with >= 44 cores. That spinlock is around a single memory fetch, so a spinlock or lwlock around the entire array would presumably be a lot worse. It seems to me that under this system, everyone who would under present circumstances invoke lseek() would have to first have to query this shared memory area, and then if they miss (which is likely, since most of the time there won't be a truncation in progress) they'll still have to do the lseek. So even if there's no contention problem, there could still be a raw loss of performance. I feel like I might be missing a trick though; it seems like somehow we ought to be able to cache the relation size for long periods of time when no extension is in progress. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, May 15, 2013 at 11:35 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> To not slow down common backend >> operations, the values (or lack thereof) are cached in relcache. To sync the >> relcache when the values change, there will be a new shared cache >> invalidation event to force backends to refresh the cached watermark values. > AFAIK, the sinval mechanism isn't really well-designed to ensure that > these kinds of notifications arrive in a timely fashion. Yeah; currently it's only meant to guarantee that you see updates that were protected by obtaining a heavyweight lock with which your own lock request conflicts. It will *not* work for the usage Heikki proposes, at least not without sprinkling sinval queue checks into a lot of places where they aren't now. And as you say, the side-effects of that would be worrisome. > Another problem is that sinval resets are bad for performance, and > anything we do that pushes more messages through sinval will increase > the frequency of resets. I've been thinking that we should increase the size of the sinval ring; now that we're out from under SysV shmem size limits, it wouldn't be especially painful to do that. That's not terribly relevant to this issue though. I agree that we don't want an sinval message per relation extension, no matter what the ring size is. regards, tom lane
On Wed, May 15, 2013 at 7:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Another problem is that sinval resets are bad for performance, and >> anything we do that pushes more messages through sinval will increase >> the frequency of resets. > > I've been thinking that we should increase the size of the sinval ring; > now that we're out from under SysV shmem size limits, it wouldn't be > especially painful to do that. That's not terribly relevant to this > issue though. I agree that we don't want an sinval message per relation > extension, no matter what the ring size is. I've been thinking for a while that we need some other system for managing other kinds of invalidations. For example, suppose we want to cache relation sizes in blocks. So we allocate 4kB of shared memory, interpreted as an array of 512 8-byte entries. Whenever you extend a relation, you hash the relfilenode and take the low-order 9 bits of the hash value as an index into the array. You increment that value either under a spinlock or perhaps using fetch-and-add where available. On the read side, every backend can cache the length of as many relations as it wants. But before relying on a cached value, it must index into the shared array and see whether the value has been updated. On 64-bit systems, this requires no lock, only a barrier, and some 32-bit systems have special instructions that can be used for an 8-byte atomic read, and hence could avoid the lock as well. This would almost certainly be cheaper than doing an lseek every time, although maybe not by enough to matter. At least on Linux, the syscall seems to be pretty cheap. Now, a problem with this is that we keep doing things that make it hard for people to run very low memory instances of PostgreSQL. So potentially whether or not we allocate space for this could be controlled by a GUC. Or maybe the structure could be made somewhat larger and shared among multiple caching needs. I'm not sure whether this idea can be adapted to do what Heikki is after. But I think these kinds of techniques are worth thinking about as we look for ways to further improve performance. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I've been thinking for a while that we need some other system for > managing other kinds of invalidations. For example, suppose we want > to cache relation sizes in blocks. So we allocate 4kB of shared > memory, interpreted as an array of 512 8-byte entries. Whenever you > extend a relation, you hash the relfilenode and take the low-order 9 > bits of the hash value as an index into the array. You increment that > value either under a spinlock or perhaps using fetch-and-add where > available. I'm not sure I believe the details of that. 1. 4 bytes is not enough to store the exact identity of the table that the cache entry belongs to, so how do you disambiguate? 2. If you don't find an entry for your target rel in the cache, aren't you still going to have to do an lseek? Having said that, the idea of specialized caches in shared memory seems plenty reasonable to me. One thing that's bothering me about Heikki's proposal is that it's not clear that it's a *cache*; that is, I don't see the fallback logic to use when there's no entry for a relation for lack of room. regards, tom lane
On 2013-05-15 18:35:35 +0300, Heikki Linnakangas wrote: > Truncating a heap at the end of vacuum, to release unused space back to > the OS, currently requires taking an AccessExclusiveLock. Although it's only > held for a short duration, it can be enough to cause a hiccup in query > processing while it's held. Also, if there is a continuous stream of queries > on the table, autovacuum never succeeds in acquiring the lock, and thus the > table never gets truncated. > > I'd like to eliminate the need for AccessExclusiveLock while truncating. Couldn't we "just" take the extension lock and then walk backwards from the rechecked end of relation ConditionalLockBufferForCleanup() the buffers? For every such locked page we check whether its still empty. If we find a page that we couldn't lock, isn't empty or we already locked a sufficient number of pages we truncate. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, May 15, 2013 at 8:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I've been thinking for a while that we need some other system for >> managing other kinds of invalidations. For example, suppose we want >> to cache relation sizes in blocks. So we allocate 4kB of shared >> memory, interpreted as an array of 512 8-byte entries. Whenever you >> extend a relation, you hash the relfilenode and take the low-order 9 >> bits of the hash value as an index into the array. You increment that >> value either under a spinlock or perhaps using fetch-and-add where >> available. > > I'm not sure I believe the details of that. > > 1. 4 bytes is not enough to store the exact identity of the table that > the cache entry belongs to, so how do you disambiguate? You don't. The idea is that it's inexact. When a relation is extended, every backend is forced to recheck the length of every relation whose relfilenode hashes to the same array slot as the one that was actually extended. So if you happen to be repeatedly scanning relation A, and somebody else is repeatedly scanning relation B, you'll *probably* not have to invalidate anything. But if A and B happen to hash to the same slot, then you'll keep getting bogus invalidations. Fortunately, that isn't very expensive. The fast-path locking code uses a similar trick to detect conflicting strong locks, and it works quite well. In that case, as here, you can reduce the collision probability as much as you like by increasing the number of slots, at the cost of increased shared memory usage. > 2. If you don't find an entry for your target rel in the cache, aren't > you still going to have to do an lseek? Don't think of it as a cache. The caching happens inside each backend's relcache; the shared memory structure is just a tool to force those caches to be revalidated when necessary. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: >> 2. If you don't find an entry for your target rel in the cache, aren't >> you still going to have to do an lseek? > Don't think of it as a cache. The caching happens inside each > backend's relcache; the shared memory structure is just a tool to > force those caches to be revalidated when necessary. Hmm. Now I see: it's not a cache, it's a Bloom filter. The failure mode I was thinking of is inapplicable, but there's a different one: you have to be absolutely positive that *any* operation that extends the file will update the relevant filter entry. Still, I guess that we're already assuming that any such op will take the relation's extension lock, so it should be easy enough to find all the places to fix. regards, tom lane
On Thu, May 16, 2013 at 1:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >>> 2. If you don't find an entry for your target rel in the cache, aren't >>> you still going to have to do an lseek? > >> Don't think of it as a cache. The caching happens inside each >> backend's relcache; the shared memory structure is just a tool to >> force those caches to be revalidated when necessary. > > Hmm. Now I see: it's not a cache, it's a Bloom filter. Yes. > The failure > mode I was thinking of is inapplicable, but there's a different one: > you have to be absolutely positive that *any* operation that extends the > file will update the relevant filter entry. Still, I guess that we're > already assuming that any such op will take the relation's extension > lock, so it should be easy enough to find all the places to fix. I would think so. The main thing that's held me back from actually implementing this is the fact that lseek is so darn cheap on Linux, and I don't have reliable data one way or the other for any other platform. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 16.05.2013 00:18, Robert Haas wrote: > On Wed, May 15, 2013 at 11:35 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> Shared memory space is limited, but we only need the watermarks for any >> in-progress truncations. Let's keep them in shared memory, in a small >> fixed-size array. That limits the number of concurrent truncations that can >> be in-progress, but that should be ok. > > Would it only limit the number of concurrent transactions that can be > in progress *due to vacuum*? Or would it limit the total number of > TOTAL concurrent truncations? Because a table could have arbitrarily > many inheritance children, and you might try to truncate the whole > thing at once... It would only limit the number of concurrent *truncations*. Vacuums in general would not count, only vacuums at the end of the vacuum process, trying to truncate the heap. >> To not slow down common backend >> operations, the values (or lack thereof) are cached in relcache. To sync the >> relcache when the values change, there will be a new shared cache >> invalidation event to force backends to refresh the cached watermark values. >> A backend (vacuum) can ensure that all backends see the new value by first >> updating the value in shared memory, sending the sinval message, and waiting >> until everyone has received it. > > AFAIK, the sinval mechanism isn't really well-designed to ensure that > these kinds of notifications arrive in a timely fashion. There's no > particular bound on how long you might have to wait. Pretty much all > inner loops have CHECK_FOR_INTERRUPTS(), but they definitely do not > all have AcceptInvalidationMessages(), nor would that be safe or > practical. The sinval code sends catchup interrupts, but only for the > purpose of preventing sinval overflow, not for timely receipt. Currently, vacuum will have to wait for all transactions that have touched the relation to finish, to get the AccessExclusiveLock. If we don't change anything in the sinval mechanism, the wait would be similar - until all currently in-progress transactions have finished. It's not quite the same; you'd have to wait for all in-progress transactions to finish, not only those that have actually touched the relation. But on the plus-side, you would not block new transactions from accessing the relation, so it's not too bad if it takes a long time. If we could use the catchup interrupts to speed that up though, that would be much better. I think vacuum could simply send a catchup interrupt, and wait until everyone has caught up. That would significantly increase the traffic of sinval queue and catchup interrupts, compared to what it is today, but I think it would still be ok. It would still only be a few sinval messages and catchup interrupts per truncation (ie. per vacuum). > Another problem is that sinval resets are bad for performance, and > anything we do that pushes more messages through sinval will increase > the frequency of resets. Now if those are operations are things that > are relatively uncommon, it's not worth worrying about - but if it's > something that happens on every relation extension, I think that's > likely to cause problems. It would not be on every relation extension, only on truncation. >> With the watermarks, truncation works like this: >> >> 1. Set soft watermark to the point where we think we can truncate the >> relation. Wait until everyone sees it (send sinval message, wait). > > I'm also concerned about how you plan to synchronize access to this > shared memory arena. I was thinking of a simple lwlock, or perhaps one lwlock per slot in the arena. It would not be accessed very frequently, because the watermark values would be cached in the relcache. It would only need to be accessed when: 1. Truncating the relation, by vacuum, to set the watermark values 2. By backends, to update the relcache, when it receives the sinval message sent by vacuum. 3. By backends, when writing above the cached watermark value. IOW, when extending a relation that's being truncated at the same time. In particular, it would definitely not be accessed every time a backend currently needs to do an lseek. Nor everytime a backend needs to extend a relation. - Heikki
On 16.05.2013 04:15, Andres Freund wrote: > On 2013-05-15 18:35:35 +0300, Heikki Linnakangas wrote: >> Truncating a heap at the end of vacuum, to release unused space back to >> the OS, currently requires taking an AccessExclusiveLock. Although it's only >> held for a short duration, it can be enough to cause a hiccup in query >> processing while it's held. Also, if there is a continuous stream of queries >> on the table, autovacuum never succeeds in acquiring the lock, and thus the >> table never gets truncated. >> >> I'd like to eliminate the need for AccessExclusiveLock while truncating. > > Couldn't we "just" take the extension lock and then walk backwards from > the rechecked end of relation ConditionalLockBufferForCleanup() the > buffers? > For every such locked page we check whether its still empty. If we find > a page that we couldn't lock, isn't empty or we already locked a > sufficient number of pages we truncate. You need an AccessExclusiveLock on the relation to make sure that after you have checked that pages 10-15 are empty, and truncated them away, a backend doesn't come along a few seconds later and try to read page 10 again. There might be an old sequential scan in progress, for example, that thinks that the pages are still there. - Heikki
On 2013-05-17 10:45:26 +0300, Heikki Linnakangas wrote: > On 16.05.2013 04:15, Andres Freund wrote: > >On 2013-05-15 18:35:35 +0300, Heikki Linnakangas wrote: > >>Truncating a heap at the end of vacuum, to release unused space back to > >>the OS, currently requires taking an AccessExclusiveLock. Although it's only > >>held for a short duration, it can be enough to cause a hiccup in query > >>processing while it's held. Also, if there is a continuous stream of queries > >>on the table, autovacuum never succeeds in acquiring the lock, and thus the > >>table never gets truncated. > >> > >>I'd like to eliminate the need for AccessExclusiveLock while truncating. > > > >Couldn't we "just" take the extension lock and then walk backwards from > >the rechecked end of relation ConditionalLockBufferForCleanup() the > >buffers? > >For every such locked page we check whether its still empty. If we find > >a page that we couldn't lock, isn't empty or we already locked a > >sufficient number of pages we truncate. > > You need an AccessExclusiveLock on the relation to make sure that after you > have checked that pages 10-15 are empty, and truncated them away, a backend > doesn't come along a few seconds later and try to read page 10 again. There > might be an old sequential scan in progress, for example, that thinks that > the pages are still there. But that seems easily enough handled: We know the current page in its scan cannot be removed since its pinned. So make heapgettup()/heapgetpage() pass something like RBM_IFEXISTS to ReadBuffer and if the read fails recheck the length of the relation before throwing an error. There isn't much besides seqscans that can have that behaviour afaics: - (bitmap)indexscans et al. won't point to completely empty pages - there cannot be a concurrent vacuum since we have the appropriate locks - if a trigger or something else has a tid referencing a page there need to be unremovable tuples on it. The only thing that I immediately see are tidscans which should be handleable in a similar manner to seqscans. Sure, there are some callsites that need to be adapted but it still seems noticeably easier than what you proposed upthread. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, May 17, 2013 at 3:38 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > If we could use the catchup interrupts to speed that up though, that would > be much better. I think vacuum could simply send a catchup interrupt, and > wait until everyone has caught up. That would significantly increase the > traffic of sinval queue and catchup interrupts, compared to what it is > today, but I think it would still be ok. It would still only be a few sinval > messages and catchup interrupts per truncation (ie. per vacuum). Hmm. So your proposal is to only send these sinval messages while a truncation is in progress, not any time the relation is extended? That would certainly be far more appealing from the point of view of not blowing out sinval. It shouldn't be difficult to restrict the set of backends that have to be signaled to those that have the relation open. You could have a special kind of catchup signal that means "catch yourself up, but don't chain" - and send that only to those backends returned by GetConflictingVirtualXIDs. It might be better to disconnect this mechanism from sinval entirely. In other words, just stick a version number on your shared memory data structure and have everyone advertise the last version number they've seen via PGPROC. The sinval message doesn't really seem necessary; it's basically just a no-op message to say "reread shared memory", and a plain old signal can carry that same message more efficiently. One other thought regarding point 6 from your original proposal. If it's true that a scan could hold a pin on a buffer above the current hard watermark, which I think it is, then that means there's a scan in progress which is going to try to fetch pages beyond that also, up to whatever the end of file position was when the scan started. I suppose that heap scans will need to be taught to check some backend-local flag before fetching each new block, to see if a hard watermark change might have intervened. Is that what you have in mind? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 20.05.2013 16:59, Robert Haas wrote: > On Fri, May 17, 2013 at 3:38 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> If we could use the catchup interrupts to speed that up though, that would >> be much better. I think vacuum could simply send a catchup interrupt, and >> wait until everyone has caught up. That would significantly increase the >> traffic of sinval queue and catchup interrupts, compared to what it is >> today, but I think it would still be ok. It would still only be a few sinval >> messages and catchup interrupts per truncation (ie. per vacuum). > > Hmm. So your proposal is to only send these sinval messages while a > truncation is in progress, not any time the relation is extended? > That would certainly be far more appealing from the point of view of > not blowing out sinval. Right. > It shouldn't be difficult to restrict the set of backends that have to > be signaled to those that have the relation open. You could have a > special kind of catchup signal that means "catch yourself up, but > don't chain" What does "chain" mean above? > - and send that only to those backends returned by > GetConflictingVirtualXIDs. Yeah, that might be a worthwhile optimization. > It might be better to disconnect this mechanism from sinval entirely. > In other words, just stick a version number on your shared memory > data structure and have everyone advertise the last version number > they've seen via PGPROC. The sinval message doesn't really seem > necessary; it's basically just a no-op message to say "reread shared > memory", and a plain old signal can carry that same message more > efficiently. Hmm. The sinval message makes sure that when a backend locks a relation, it will see the latest value, because of the AcceptInvalidationMessages call in LockRelation. If there is no sinval message, you'd need to always check the shared memory area when you lock a relation. > One other thought regarding point 6 from your original proposal. If > it's true that a scan could hold a pin on a buffer above the current > hard watermark, which I think it is, then that means there's a scan in > progress which is going to try to fetch pages beyond that also, up to > whatever the end of file position was when the scan started. I > suppose that heap scans will need to be taught to check some > backend-local flag before fetching each new block, to see if a hard > watermark change might have intervened. Is that what you have in > mind? Yes, heap scan will need to check the locally cached high watermark value every time when stepping to a new page. - Heikki
On 17.05.2013 12:35, Andres Freund wrote: > On 2013-05-17 10:45:26 +0300, Heikki Linnakangas wrote: >> On 16.05.2013 04:15, Andres Freund wrote: >>> Couldn't we "just" take the extension lock and then walk backwards from >>> the rechecked end of relation ConditionalLockBufferForCleanup() the >>> buffers? >>> For every such locked page we check whether its still empty. If we find >>> a page that we couldn't lock, isn't empty or we already locked a >>> sufficient number of pages we truncate. >> >> You need an AccessExclusiveLock on the relation to make sure that after you >> have checked that pages 10-15 are empty, and truncated them away, a backend >> doesn't come along a few seconds later and try to read page 10 again. There >> might be an old sequential scan in progress, for example, that thinks that >> the pages are still there. > > But that seems easily enough handled: We know the current page in its > scan cannot be removed since its pinned. So make > heapgettup()/heapgetpage() pass something like RBM_IFEXISTS to > ReadBuffer and if the read fails recheck the length of the relation > before throwing an error. Hmm. For the above to work, you'd need to atomically check that the pages you're truncating away are not pinned, and truncate them. If those steps are not atomic, a backend might pin a page after you've checked that it's not pinned, but before you've truncated the underlying file. I guess that be doable; needs some new infrastructure in the buffer manager, however. > There isn't much besides seqscans that can have that behaviour afaics: > - (bitmap)indexscans et al. won't point to completely empty pages > - there cannot be a concurrent vacuum since we have the appropriate > locks > - if a trigger or something else has a tid referencing a page there need > to be unremovable tuples on it. > > The only thing that I immediately see are tidscans which should be > handleable in a similar manner to seqscans. > > Sure, there are some callsites that need to be adapted but it still > seems noticeably easier than what you proposed upthread. Yeah. I'll think some more how the required buffer manager changes could be done. - Heikki
On Mon, May 20, 2013 at 10:19 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> It shouldn't be difficult to restrict the set of backends that have to >> be signaled to those that have the relation open. You could have a >> special kind of catchup signal that means "catch yourself up, but >> don't chain" > > What does "chain" mean above? Normally, when sinval catchup is needed, we signal the backend that is furthest behind. After catching up, it signals the backend that is next-furthest behind, which in turns catches up and signals the next laggard, and so forth. > Hmm. The sinval message makes sure that when a backend locks a relation, it > will see the latest value, because of the AcceptInvalidationMessages call in > LockRelation. If there is no sinval message, you'd need to always check the > shared memory area when you lock a relation. The latest value of what? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 15 May 2013 16:35, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Shared memory space is limited, but we only need the watermarks for any > in-progress truncations. Let's keep them in shared memory, in a small > fixed-size array. That limits the number of concurrent truncations that can > be in-progress, but that should be ok. To not slow down common backend > operations, the values (or lack thereof) are cached in relcache. To sync the > relcache when the values change, there will be a new shared cache > invalidation event to force backends to refresh the cached watermark values. > A backend (vacuum) can ensure that all backends see the new value by first > updating the value in shared memory, sending the sinval message, and waiting > until everyone has received it. I think we could use a similar scheme for 2 other use cases. 1. Unlogged tables. It would be useful to have a persistent "safe high watermark" for an unlogged table. So in the event of a crash, we truncate back to the safe high watermark, not truncate the whole table. That would get updated at each checkpoint. Unlogged tables will get much more useful with that change. (Issues, with indexes would need to be resolved also). 2. Table extension during COPY operations is difficult. We need to be able to extend in larger chunks, so we would need to change the algorithm about how extension works. I'm thinking there's a relationship there with watermarks. Can we look at the needs of multiple areas at once, so we come up with a more useful design that covers more than just one use case, please? --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services