Thread: collateral benefits of a crash-safe visibility map
On Tue, May 10, 2011 at 9:59 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > no, that wasn't my intent at all, except in the sense of wondering if > a crash-safe visibility map provides a route of displacing a lot of > hint bit i/o and by extension, making alternative approaches of doing > that, including mine, a lot less useful. that's a good thing. Sadly, I don't think it's going to have that effect. The page-is-all-visible bits seem to offer a significant performance benefit over the xmin-committed hint bits; but the benefit of xmin-committed all by itself is too much to ignore. The advantages of the xmin-committed hint bit (as opposed to the all-visible page-level bit) are: (1) Setting the xmin-committed hint bit is a much more light-weight operation than setting the all-visible page-level bit. It can by done on-the-fly by any backend, rather than only by VACUUM, and need not be XLOG'd. (2) If there are long-running transactions on the system, xmin-committed can be set much sooner than all-visible - the transaction need only commit. All-visible can't be set until overlapping transactions have ended. (3) xmin-committed is useful on standby servers, whereas all-visible is ignored there. (Note that neither this patch nor index-only scans changes anything about that: it's existing behavior, necessitated by different xmin horizons.) So I think that attempts to minimize the overhead of setting the xmin-committed bit are not likely to be mooted by anything I'm doing. Keep up the good work. :-) Where I do think that we can possibly squeeze some additional benefit out of a crash-safe visibility map is in regards to anti-wraparound vacuuming. The existing visibility map is used to skip vacuuming of all-visible pages, but it's not used when XID wraparound is at issue. The reason is fairly obvious: a regular vacuum only needs to worry about getting rid of dead tuples (and a visibility map bit being set is good evidence that there are none), but an anti-wraparound vacuum also needs to worry about live tuples with xmins that are about to wrap around from past to future (such tuples must be frozen). There's a second reason, too: the visibility map bit, not being crash-safe, has a small chance of being wrong, and we'd like to eventually get rid of any dead tuples that slip through the cracks. Making the visibility map crash-safe doesn't directly address the first problem, but it does (if or when we're convinced that it's fairly bug-free) address the second one. To address the first problem, what we've talked about doing is something along the line of freezing the tuples at the time we mark the page all-visible, so we don't have to go back and do it again later. Unfortunately, it's not quite that simple, because freezing tuples that early would cause all sorts of headaches for hot standby, not to mention making Tom and Alvaro grumpy when they're trying to figure out a corruption problem and all the xmins are FrozenXID rather than whatever they were originally. We floated the idea of a tuple-level bit HEAP_XMIN_FROZEN that would tell the system to treat the tuple as frozen, but wouldn't actually overwrite the xmin field. That would solve the forensic problem with earlier freezing, but it doesn't do anything to resolve the Hot Standby problem. There is a performance issue to worry about, too: freezing operations must be xlog'd, as we update relfrozenxid based on the results, and therefore can't risk losing a freezing operation later on. So freezing sooner means more xlog activity for pages that might very well never benefit from it (if the tuples therein don't stick around long enough for it to matter). Nonetheless, I haven't completely given up hope. The current situation is that a big table into which new records are slowly being inserted has to be repeatedly scanned in its entirety for unfrozen tuples even though only a small and readily identifiable part of it can actually contain any such tuples, which is clearly less than ideal. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 10, 2011 at 3:47 PM, Robert Haas <robertmhaas@gmail.com> wrote: > To address the first problem, what we've talked about doing is > something along the line of freezing the tuples at the time we mark > the page all-visible, so we don't have to go back and do it again > later. Unfortunately, it's not quite that simple, because freezing > tuples that early would cause all sorts of headaches for hot standby, > not to mention making Tom and Alvaro grumpy when they're trying to > figure out a corruption problem and all the xmins are FrozenXID rather > than whatever they were originally. We floated the idea of a > tuple-level bit HEAP_XMIN_FROZEN that would tell the system to treat > the tuple as frozen, but wouldn't actually overwrite the xmin field. > That would solve the forensic problem with earlier freezing, but it > doesn't do anything to resolve the Hot Standby problem. There is a > performance issue to worry about, too: freezing operations must be > xlog'd, as we update relfrozenxid based on the results, and therefore > can't risk losing a freezing operation later on. So freezing sooner > means more xlog activity for pages that might very well never benefit > from it (if the tuples therein don't stick around long enough for it > to matter). Hmmm, do we really need to WAL log freezing? Can we break down freezing into a 2 stage process, so that we can have first stage as a lossy operation and a second stage that is WAL logged? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, May 10, 2011 at 12:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > Hmmm, do we really need to WAL log freezing? > > Can we break down freezing into a 2 stage process, so that we can have > first stage as a lossy operation and a second stage that is WAL > logged? That might solve the relfrozenxid problem - set the bits in the heap, sync the heap, then update relfrozenxid once the heap is guaranteed safely on disk - but it again seems problematic for Hot Standby. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, May 10, 2011 at 12:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Hmmm, do we really need to WAL log freezing? > That might solve the relfrozenxid problem - set the bits in the heap, > sync the heap, then update relfrozenxid once the heap is guaranteed > safely on disk - but it again seems problematic for Hot Standby. ... or even warm standby. You basically *have to* WAL-log freezing before you can truncate pg_clog. The only freedom you have here is freedom to mess with the policy about how soon you try to truncate pg_clog. (Doing an unlogged freeze operation first is right out, too, if it causes the system to fail to perform/log the operation later.) regards, tom lane
On 10.05.2011 17:47, Robert Haas wrote: > On Tue, May 10, 2011 at 9:59 AM, Merlin Moncure<mmoncure@gmail.com> wrote: >> no, that wasn't my intent at all, except in the sense of wondering if >> a crash-safe visibility map provides a route of displacing a lot of >> hint bit i/o and by extension, making alternative approaches of doing >> that, including mine, a lot less useful. that's a good thing. > > Sadly, I don't think it's going to have that effect. The > page-is-all-visible bits seem to offer a significant performance > benefit over the xmin-committed hint bits; but the benefit of > xmin-committed all by itself is too much to ignore. The advantages of > the xmin-committed hint bit (as opposed to the all-visible page-level > bit) are: > > (1) Setting the xmin-committed hint bit is a much more light-weight > operation than setting the all-visible page-level bit. It can by done > on-the-fly by any backend, rather than only by VACUUM, and need not be > XLOG'd. > (2) If there are long-running transactions on the system, > xmin-committed can be set much sooner than all-visible - the > transaction need only commit. All-visible can't be set until > overlapping transactions have ended. > (3) xmin-committed is useful on standby servers, whereas all-visible > is ignored there. (Note that neither this patch nor index-only scans > changes anything about that: it's existing behavior, necessitated by > different xmin horizons.) (4) xmin-committed flag attached directly to the tuple provides some robustness in case of corruption, due to bad hw. Without the flag, a single bit flip in the clog could in the worst case render all of your bulk-loaded data invisible and vacuumable. Of course, corruption will always eat your data to some extent, but the hint bits provide some robustness. Hint bits are close to the data itself, not in another file like the clog, which can come handy at disaster recovery. A flag in the heap page header isn't too different from a per-tuple hint bit from that point of view, it's still in the same page as the data itself. A bit in the clog or visibility map is not. Not sure how much performance we're willing to sacrifice for that, but it's something to keep in mind. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, May 10, 2011 at 9:47 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, May 10, 2011 at 9:59 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> no, that wasn't my intent at all, except in the sense of wondering if >> a crash-safe visibility map provides a route of displacing a lot of >> hint bit i/o and by extension, making alternative approaches of doing >> that, including mine, a lot less useful. that's a good thing. > > Sadly, I don't think it's going to have that effect. The > page-is-all-visible bits seem to offer a significant performance > benefit over the xmin-committed hint bits; but the benefit of > xmin-committed all by itself is too much to ignore. The advantages of > the xmin-committed hint bit (as opposed to the all-visible page-level > bit) are: > > (1) Setting the xmin-committed hint bit is a much more light-weight > operation than setting the all-visible page-level bit. It can by done > on-the-fly by any backend, rather than only by VACUUM, and need not be > XLOG'd. > (2) If there are long-running transactions on the system, > xmin-committed can be set much sooner than all-visible - the > transaction need only commit. All-visible can't be set until > overlapping transactions have ended. > (3) xmin-committed is useful on standby servers, whereas all-visible > is ignored there. (Note that neither this patch nor index-only scans > changes anything about that: it's existing behavior, necessitated by > different xmin horizons.) right. #1 could maybe worked around somehow and #2 is perhaps arguable, at least in some workloads, but #3 is admittedly a killer especially since the bit is on the page. I noted your earlier skepticism regarding moving the page visibility check completely to the VM: "In some ways, that would make things much simpler. But to make that work, every insert/update/delete to a page would have to pin the visibility map page and clear PD_ALL_VISIBLE if appropriate, so it might not be good from a performance standpoint, especially in high-concurrency workloads. Right now, if PD_ALL_VISIBLE isn't set, we don't bother touching the visibility map page, which seems like a possibly important optimization." That's debatable, but probably moot. Thanks for thinking that through though. merlin
On Tue, May 10, 2011 at 6:02 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, May 10, 2011 at 12:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Hmmm, do we really need to WAL log freezing? >> >> Can we break down freezing into a 2 stage process, so that we can have >> first stage as a lossy operation and a second stage that is WAL >> logged? > > That might solve the relfrozenxid problem - set the bits in the heap, > sync the heap, then update relfrozenxid once the heap is guaranteed > safely on disk - but it again seems problematic for Hot Standby. How about we truncate the clog differently on each server? We could have a special kind of VACUUM that runs during Hot Standby, setting frozen hint bits only. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, May 10, 2011 at 1:49 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Tue, May 10, 2011 at 6:02 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, May 10, 2011 at 12:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> Hmmm, do we really need to WAL log freezing? >>> >>> Can we break down freezing into a 2 stage process, so that we can have >>> first stage as a lossy operation and a second stage that is WAL >>> logged? >> >> That might solve the relfrozenxid problem - set the bits in the heap, >> sync the heap, then update relfrozenxid once the heap is guaranteed >> safely on disk - but it again seems problematic for Hot Standby. > > How about we truncate the clog differently on each server? We could > have a special kind of VACUUM that runs during Hot Standby, setting > frozen hint bits only. Interesting idea. It does seem complicated. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 10, 2011 at 6:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, May 10, 2011 at 12:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> Hmmm, do we really need to WAL log freezing? > >> That might solve the relfrozenxid problem - set the bits in the heap, >> sync the heap, then update relfrozenxid once the heap is guaranteed >> safely on disk - but it again seems problematic for Hot Standby. > > ... or even warm standby. You basically *have to* WAL-log freezing > before you can truncate pg_clog. The only freedom you have here is > freedom to mess with the policy about how soon you try to truncate > pg_clog. > > (Doing an unlogged freeze operation first is right out, too, if it > causes the system to fail to perform/log the operation later.) Trying to think outside of the box from all these things we can't do. Can we keep track of the relfrozenxid and then note when we fsync the relevant file, then issue a single WAL record to indicate that? Still WAL logging, but 1 record per table, not 1 record per tuple. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services